The project consists of three key components:
- Comprehensive Study of Existing Solutions: an in-depth analysis of existing automated speech recognition solutions and the availability of training data (speech and related transcription);
- Creation of an Open-Source ASR Prototype: development of an open-source basic speech recognition prototype solution that serves as a general-purpose client-server streaming automatic speech recognizer (ASR) for, inter alia, European small and medium-sized enterprises (SMEs) and the European public administrations. The goal is to provide a freely accessible, well-documented, open-source ASR solution for both commercial and non-commercial usage.
- Collection of the Low-resource European Languages Datasets ("LELD"): the collection of speech data and the creation of a database for three low-resourced official European Union languages: Czech, Estonian, and Greek. The goal is to gather and provide essential data for European entities aiming to develop language technologies in the future, thereby fostering support for multilingualism in Europe.
Funding and Implementation The project is financed under the 2022 Language Technology Solutions call for tenders implementing the DIGITAL Europe Programme in the field of language technologies.
Data Processing Disclaimer
To develop the ASR tool and collect LELD datasets, some personal data will be processed in compliance with regulations. This data comes from publicly available web archives, including Mozilla Common Voice, VoxPopuli, and recordings from the European Commission Audiovisual Portal, Czech city councils, the Estonian Parliament, and the Greek Parliament websites. For more details and to learn about your rights if you are a Data Subject, please refer to the Data Protection Record and Privacy Statement available under this link.