Corpora

I've been working in the development of two very different corpora:

C-ORAL-ROM spontaneous spoken corpus from the Computational Linguistics Laboratory of the Autónoma University of Madrid (LLI). The C-oral-ROM project has collected and annotated comparable corpora in four Romance languages: Italian, Portuguese, French and Spanish, developed by the University of Florence, the University of Lisbon Foundation, the University of Aix-en-Provence and the Autónoma University of Madrid.

The Spanish C-oral-ROM Corpus is made up of more than 300,000 transcribed and tagged words.

Texts were recorded following requirements of spontaneity, quality of the sound and variety of the speakers. They have been recorded in very assorted contexts (familiar, conferences, etc...) and always through digital processing.


The UAM Spanish Treebank is a written corpus made up of  1600 sentences taken from digital newspapers (El País y Compra Maestra). These sentences are annotated following a format which is very similar to that of the syntactic Treebank of Pennsylvania.