This work has been developed taking the spoken corpus from the Computational Linguistics Laboratory of the Autónoma University of Madrid (http://www.lllf.uam.es/) as point of reference, which in turn is placed within the European project "C-oral-ROM". The C-oral-ROM project is collecting a spoken corpus in four Romance languages: Italian, Portuguese, French and Spanish, developed by the University of Florence, the University of Lisbon Foundation, the University of Aix-en-Provence and the Autónoma University of Madrid.
The Spanish C-oral-ROM Corpus is made up of more than 300,000 transcribed and tagged words.
Texts have been recorded following requirements of spontaneity, quality of the sound and variety of the speakers. They have been recorded in very assorted contexts (familiar, conferences, etc...) and always through digital processing.
The variety of speakers
divides the corpus into two great blocks, the infomal and the formal
one. The informal block has private and public recordings; in the
private recordings the speakers do not represent a rigid public role
(public roles are, for example, a salesman, a receptionist, etc...).
The formal block has texts taken from mass media, telephone and also
formal speech in natural contexts (conferences, lectures, etc...).
In a second stage, a The UAM
Spanish Treebank has been also tagged. This corpus is made up
of 1600 sentences taken from digital newspapers (El País y Compra Maestra). These sentence
are annotated following a format which is very similar to that of the
syntactic Treebank of Pennsylvania.