SESCO

Corpora

This work has been developed taking the spoken corpus from the Computational Linguistics Laboratory of the Autónoma University of Madrid (http://www.lllf.uam.es/) as point of reference, which in turn is placed within the European project "C-oral-ROM". The C-oral-ROM project is collecting a spoken corpus in four Romance languages: Italian, Portuguese, French and Spanish, developed by the University of Florence, the University of Lisbon Foundation, the University of Aix-en-Provence and the Autónoma University of Madrid.

The Spanish C-oral-ROM Corpus is made up of more than 300,000 transcribed and tagged words.

Texts have been recorded following requirements of spontaneity, quality of the sound and variety of the speakers. They have been recorded in very assorted contexts (familiar, conferences, etc...) and always through digital processing.

The variety of speakers divides the corpus into two great blocks, the infomal and the formal one. The informal block has private and public recordings; in the private recordings the speakers do not represent a rigid public role (public roles are, for example, a salesman, a receptionist, etc...). The formal block has texts taken from mass media, telephone and also formal speech in natural contexts (conferences, lectures, etc...).

In a second stage, a The UAM Spanish Treebank has been also tagged. This corpus is made up of  1600 sentences taken from digital newspapers (El País y Compra Maestra). These sentence are annotated following a format which is very similar to that of the syntactic Treebank of Pennsylvania.