Computational Linguistics Laboratory

English version - Semantic Tagging System for Corpora - Versión en Castellano
Universidad Autónoma de Madrid
Universidad Autónoma de Madrid




     
                      1. Purposes 2. Theoretical Framework
                    3. Corpora 4. Papers
5. Further information and Bibliography












    .
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     


    Purposes

    Our main purpose is to develop a tagging system that allows the semantic representation of a linguistic corpus (Semantic Tagging System for Corpora - Sistema de Etiqutado Semántico para Corpora : SESCO).

    One of the most striking aspects of our system is its universality, characteristic derived from the semantic theory by Moreno Cabrera (1997). At the present time, it is being developed on both a SPONTANEOUS SPEECH corpus and a written corpus, and SESCO is devised so that it can be used with any other language or type of corpus.

    The main arguments in support of our approach are enumerated:


            ·With SESCO we can deal with very large corpora. Software applications have been developed in order to implement semi-automatic analysis.
            ·The SESCO system has the portabilities and universalities of XML.
            ·In its version SESCO 2.0, it is compatible with other annotation systems.
          
    Since
    the development of annotated speech corpora provide an effective way of testing theories and of discovering new problems, a great development of the Semantics based in corpora in the near future should be expected.


    Go back to the menu





Theoretical Framework

The Semantics of events has experienced an important development lately (Tenny and Pustejovsky, 2000). To the same line, we have assumed that the system has to give information not only about language fragments, but also about references and relations that can be given during a discourse. It is needed a dynamic model that allows the retrieval of those representations which has already been made a reference to (Webber and Nilsson, 1981).

Our semantic tagging has been chosen because it is simple and reliable. We want to make an essential and flexible analysis for extracting the greater possible amount of data from the corpus without limiting it to an excessively restrictive theory. In this regard our procedure is to reach the theory from the data. We try to search the most accurate regularities, even though we seem to have an over-simple point of view.

We understand that there is a countless amount of meanings transmissible through the language and that, although there is a very limited set of morpho-syntactic phenomena, the semantic formal system must be rich. Nevertheless, our objective is limited: we want to advance in the acquisition of that experience that is essential to establish the general principles of the Semantics (Jackendoff, 1990).

We will back the Moreno Cabrera's proposal (Moreno Cabrera, 1985, 1991a, 1991b, 1997) on event analysis, although we also have considered other very similar approaches (Fernandez, 2000). We propose, therefore, tags based on the analysis of event types. We will be able to classify the verbs according to different types and to describe the classes of arguments they are related to.

The events expressed by verbs can be of three great types, conforming an universal hierarchy (Moreno Cabrera, 1997): states, processes and actions. Syntactic and/or morphologic aspects (according to the language) will depend on the event type. These three types are divided into subtypes according to the arguments they require.





Go back to the menu


























Corpora


The present work has been developed taking the spoken corpus from the Computational Linguistics Laboratory of the Autónoma University of Madrid (http://www.lllf.uam.es/) as point of reference, which in turn is placed within the European project "C-oral-ROM". The C-oral-ROM project is collecting a spoken corpus in four Romance languages: Italian, Portuguese, French and Spanish, developed by the University of Florence, the University of Lisbon Foundation, the University of Aix-en-Provence and the Autónoma University of Madrid.

The Spanish C-oral-ROM Corpus is made up of more than 300,000 transcribed and tagged words.

Texts have been recorded following requirements of spontaneity, quality of the sound and variety of the speakers. They have been recorded in very assorted contexts (familiar, conferences, etc...) and always through digital processing.

The variety of speakers divides the corpus into two great blocks, the infomal and the formal one. The informal block has private and public recordings; in the private recordings the speakers do not represent a rigid public role (public roles are, for example, a salesman, a receptionist, etc...). The formal block has texts taken from mass media, telephone and also formal speech in natural contexts (conferences, lectures, etc...).

In a second stage, a The UAM Spanish Treebank has been also tagged. This corpus is made up of  1600 sentences taken from digital newspapers (El País y Compra Maestra). These sentence are annotated following a format which is very similar to that of the syntactic Treebank of Pennsylvania.






Go back to the menu
















Papers.


    Alcántara Plá, Manuel and Antonio Moreno Cabrera: "Syntax to Semantics Transformation: Application to Treebanking"; in Proceedings of Frontiers in Corpus Annotation 2004; HLT-NAACL Boston 2004.

    Alcántara Plá, Manuel: "Semantic Tagging System for Corpora"; Fifth International Workshop on Computational Semantics; Tilburg University, 2003.


    Alcántara Plá, Manuel; Antonio Moreno Sandoval, Guillermo de la Madrid Heitzmann, Ana González Ledesma y Fernando Ares Chicote: “C-ORAL-ROM. Corpus integrado de referencia en lenguas romances”, Procesamiento de Lenguaje Natural, nº 31, septiembre 2003.

    Jose María Guirao Miras, Antonio Moreno Sandoval, Ana González Ledesma, Guillermo de la Madrid Heitzmann, Manuel Alcántara Plá: “Relating linguistic units to socio-contextual information in a spontaneous speech corpus of Spanish” cap. Corpus Linguistics across the Word, ed. Rodopi, 2003.







    Go back to the menu



















    Further information


    SESCO  is the Ph.D. work of  Manuel Alcántara Plá , supervised by  Antonio Moreno Sandoval, professor at the Universidad Autónoma de Madrid.

    We would especially like to thank Juan Carlos Moreno Cabrera for his advice and assistance on the theoretical issues.
    We also owe a debt of gratitude for assitance and support to the "C-ORAL-ROM Madrid Team".


    Contact information:

            Computational Linguistics Laboratory (http://www.lllf.uam.es)
            Facultad de Filosofía y Letras
            Universidad Autónoma de Madrid
            Campus de Cantoblanco. Carretera de Colmenar Km.16
            (28049) Madrid. Spain.



                E-mail: manuel@maria.lllf.uam.es
       
                 
                Main http-site : http://www.lllf.uam.es/~manuel

                SESCO spanish http-site : http://www.lllf.uam.es/~manuel/sesco/sesco-ca.htm

                Telephone number: (+034) 91 397 52 50




    BIBLIOGRAPHY






    Go back to the menu