Language Technologies (LT) enable machines not only to read, analyse, process and generate human language, but also, thanks to recent scientific advancements, to bridge the divide between human communication and machine understanding.
Since language serves as the fundamental medium for human interaction, LT have gained immense importance across various industries and applications, from translation and localisation to customer support, healthcare, media creation and marketing. Common examples of this technology are speech recognition, smart assistants, machine translation, chatbots, text summarisation and automatic subtitling.
What is needed to develop Language Technologies?
Below are some of the key elements to develop language-based tools and services.
- Language data: this refers to the textual or spoken content – in one or more languages – that serves as an input or training material for Natural Language Processing tasks, e.g., text generation or sentiment analysis. This data can come from a variety of sources – books, articles, social media posts, transcripts of spoken conversations, etc. Since it forms the foundation of LT development, it is vital that it be collected in full respect of copyright (IPR) and personal data protection (GDPR) provisions.
- Training algorithms and language models: algorithms are the software, the ‘recipes’, to create models of human languages. With substantial amounts of quality data, the latest Machine Learning algorithms have shown increased abilities to create models representing the knowledge derived from language data. The larger the resources and models are, the more encompassing and generic their applications are.
- Computational power: significant computational resources are required, especially during the creation of language models, where high-performance computing and robust cloud infrastructures are crucial.
- Human expertise: successful LT involve collaboration between linguists, data scientists, computer engineers, and domain specialists.
What is Europe doing to develop Language Technologies?
In Europe, we have a complex language landscape. The Charter of Fundamental Rights of the European Union prohibits discrimination on grounds of language and places an obligation on the EU to respect linguistic diversity. Accountability, transparency, fairness and respect of our values are only a few of its ethical implications. These rights and principles can only be guaranteed by an unbiased use of LT.
The European LT industry plays a key role in Europe’s strategic and technological autonomy, which should be further strengthened. Our specific market needs are best known by European LT providers, of which hundreds are listed in the Catalogue of eTranslation Services.
Publicly available solutions, including tools and services offered by the European Commission, complement the market offer while addressing a deficiency in the technology support to low-resource languages. These basic solutions – machine translation (eTranslation), named entity recognition, summarisation, speech transcription and data pseudonymisation for GDPR compliance – are available to all European public administrations and small and medium-sized enterprises in all official languages.
Under Horizon 2020, the European Language Grid (ELG) created a one-stop shop of specialised LT solutions. Efforts in dissemination and community building have contributed to fostering a common understanding on the necessity to join public and private forces and benefit from the best of both worlds in research and deployment.
The Horizon Europe Programme fosters research and innovation through support to the development of beyond state-of-the-art advanced LT, including Large Language Models. These models, designed to enhance human-machine interaction, will have multilingual capabilities, handle multiple modes of input, manage biases and exhibit context awareness.
The European Language Equality (ELE) initiative, a pilot project/preparatory action initiated by the European Parliament, developed an agenda and a roadmap for achieving full digital language equality in Europe by 2030.
Under the Digital Decade Policy Programme, the Commission is coordinating a Union effort across the Member States and the private sector to develop a European LT ecosystem.
Finally, the Commission acknowledges the value of language data as the foundation for training language models through the Common European Language Data Space (LDS). Funded under the DIGITAL Work Programme 2021-2022, the project aims to deploy a platform and a marketplace for the collection, sharing and re-use of multilingual and multimodal language data. Aligned with the European Strategy for Data and the very concept of Data Spaces, it will ensure that more language data becomes available for use in economy, society and research, while keeping the companies and individuals who generate the data in control.
Bringing all these elements, projects and actors together is a major challenge for the Union, the European industry and national public administrations – the ultimate purpose being to support Europe’s Digital Decade for the benefit of all.