The aim of the project is to extend the CORIS/CODIS Corpus. Thus, the DiaCORIS Corpus will include Italian texts produced between 1861 and 1945. The total side will be 15 million words.
The CORIS/CODIS corpus is a reference corpus for modern Italian. It contains texts from the last two decades of the 20th century, for a total of 100-million words.
It contains about 19 millions words of literary and non literary texts in prose and poetry written in early/old Italian from the beginning of the XIII century to 1375.
It consists of texts in the domain of the history of Italian renaissance fine arts: about 200 literary and essayistic works by about 60 authors of the XV-XVII centuries.
In the Masoretic text of the Hebrew Bible, the cantillation marks the division and subdivision of each verse. This structural information of every verse has been represented as a tree in XML format, constituting a cantillation tree bank.
This is a sentence-aligned English–Hungarian parallel corpus. It contains 23.7 million English and 29.4 million Hungarian words in 2.07 million sentence pairs from 5 genres of text.