Public Access to European Language Data

List of potential internal and external language data sources.

The European Union acknowledges the value of data and has been supporting a number of initiatives devoted to collecting and sharing data. The most recent of these projects is the 'Common European Language Data Space' (LDS), launched in January 2023.

Common European Language Data Space (LDS)

Aligned with the European Data Strategy and the Data Spaces concept, the objective of the LDS is to be a functional platform and marketplace for the sharing of language data and models across the European Union. First exchanges of data are planned for M24, i.e. December 2024/January 2025. The LDS platform will support these exchanges, but will not host data itself.

Further information at Language Data Space.

What follows is a list of other relevant initiatives in no particular order:

DGT-Translation Memory

DGT-TM is a translation memory and contains segments from the Acquis Communautaire, the body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). This dataset focuses on parallel texts in the 24 EU official languages.
In TMX format, it encompasses around 2.6 Bio words and increases by 200 Mio words per year.

Further information, conditions for use and download are available at Joint Research Centre.

ELRC-SHARE

This repository is a collection of language resources that resulted from cooperation between the Member States’ administrations and various EU projects from 2015 to 2022. The main focus was on gathering parallel (translation) corpora. Overall, ELRC-SHARE contains approximately 6 000 datasets of different sizes and with different access policies.

The repository is available at ELRC-SHARE Repository.

High Performance Language Technologies (HPLT)

This project, funded under the HORIZON research programme, aims, inter alia, at collecting vast amounts of language resources in more than 100 different languages, and is the successor of Paracrawl, which focussed on the collection of parallel data for translation purpose.

Further information is available through the project website at High Performance Language Technologies and Paracrawl.

OpenWebSearch.EU

This research project’s main objective is to create a publicly accessible database that indexes websites from all around the world. To achieve this, the project will gather and analyse the contents of existing websites. This data collection effort could be useful for gathering language-related information.

Further information is available through the project website at OpenWebSearch.EU.

data.europa.eu

The portal is a central point of access to European open data from international, European Union, national, regional, local and geodata portals. It consolidates the former EU Open Data Portal and the European Data Portal. It currently contains more than 1.5 Mio European public sector datasets grouped by 179 catalogues and pertaining to different topical categories.

Further information and access to the datasets at data.europa.eu.

Publications Office of the European Union

The Publications Office of the European Union is the official provider of publishing services to all EU institutions, bodies, and agencies.
The Cellar is its common data repository and stores multilingual publications and metadata. It is open to all EU citizens and provides machine-readable data.
The EU Web Archive has been preserving the content and design of the websites of the EU institutions, agencies and bodies (the EU institutions) since 2013.

Further information at Publications Office of the European Union including the web archive.

Common Language Resources and Technology Infrastructure (CLARIN)

CLARIN ERIC (European Research Infrastructure Consortium) is a pan-European initiative which enables social sciences and humanities research on language resources (LR).

Further information at CLARIN. You can also access to the language resources and services.

European Language Grid (ELG)

The ELG project established a single, scalable cloud platform as a one-stop shop for the European Language Technology industry and research community. The ELG catalogue provides access to a number of commercial and non-commercial running tools and services, models, lexica, terminologies or grammars but also includes 8000 corpora, with some overlap with the previous initiatives.

Further information at ELG and the ELG Catalogue.

Learn more about Language Technologies.

Public Access to European Language Data

Last update