Data formats

The ORTOLANG platform follows the research community recommendations through CORLI, a CLARIN K-Centre which works on data formats for language resources.

ORTOLANG accepts a wide variety of resources (corpora, lexicons, terminologies and tools) and data formats to avoid limiting scientists in using the data for their cutting-edge research work.

ORTOLANG’s validation process encourages producers to deposit their data in open formats that can be used and stored over the long term. These include standards recommended by CLARIN like TEI, XML, Unicode text, and appropriate raw data (CSV, TXT), audio (MP3, WAV) and video formats (MPEG 1/2/4).

Using open formats means resources can be the subject of automatic processing specific to linguistics. This means French language corpora can benefit from full-text indexing, automatic syntactic processing and live display.

Producers can also provide their material in alternative formats to facilitate their dissemination and usage.