|
English version - Semantic Tagging System for Corpora - Versión en Castellano |
|
1. Purposes | 2. Theoretical Framework |
3. Corpora | 4. Papers |
5.
Further information
and Bibliography |
.
Our main purpose is to develop a tagging
system that allows the semantic representation of a linguistic corpus
(Semantic Tagging System for Corpora - Sistema de Etiqutado
Semántico para Corpora : SESCO).
One of the most striking aspects of our
system is its universality, characteristic derived from the semantic
theory by Moreno Cabrera (1997). At the present time, it is being
developed on both a SPONTANEOUS SPEECH corpus and a written corpus, and
SESCO is devised so that it can be used with any other language or type
of corpus.
The main arguments in support of our
approach are enumerated:
·With SESCO we can deal
with very large corpora. Software applications have been developed in
order to implement semi-automatic analysis.
·The SESCO system has the
portabilities and universalities of XML.
·In its version SESCO 2.0,
it is compatible with other annotation systems.
Since the development of annotated speech
corpora provide an effective way of testing theories and of discovering
new problems, a great development of the
Semantics based in corpora in the near future should be expected.
The Semantics of events has experienced an important development lately (Tenny and Pustejovsky, 2000). To the same line, we have assumed that the system has to give information not only about language fragments, but also about references and relations that can be given during a discourse. It is needed a dynamic model that allows the retrieval of those representations which has already been made a reference to (Webber and Nilsson, 1981).
Our semantic tagging has been chosen because it is simple and reliable. We want to make an essential and flexible analysis for extracting the greater possible amount of data from the corpus without limiting it to an excessively restrictive theory. In this regard our procedure is to reach the theory from the data. We try to search the most accurate regularities, even though we seem to have an over-simple point of view.
We understand that there is a countless amount of meanings transmissible through the language and that, although there is a very limited set of morpho-syntactic phenomena, the semantic formal system must be rich. Nevertheless, our objective is limited: we want to advance in the acquisition of that experience that is essential to establish the general principles of the Semantics (Jackendoff, 1990).
We will back the Moreno Cabrera's proposal (Moreno Cabrera, 1985, 1991a, 1991b, 1997) on event analysis, although we also have considered other very similar approaches (Fernandez, 2000). We propose, therefore, tags based on the analysis of event types. We will be able to classify the verbs according to different types and to describe the classes of arguments they are related to.
The events expressed by
verbs can be of three great types, conforming an universal hierarchy
(Moreno Cabrera, 1997): states, processes and actions. Syntactic and/or
morphologic aspects (according to the language) will depend on the
event type. These three types are divided into subtypes according to
the arguments they require.
The Spanish C-oral-ROM Corpus is made up of more than 300,000 transcribed and tagged words.
Texts have been recorded following requirements of spontaneity, quality of the sound and variety of the speakers. They have been recorded in very assorted contexts (familiar, conferences, etc...) and always through digital processing.
The variety of speakers
divides the corpus into two great blocks, the infomal and the formal
one. The informal block has private and public recordings; in the
private recordings the speakers do not represent a rigid public role
(public roles are, for example, a salesman, a receptionist, etc...).
The formal block has texts taken from mass media, telephone and also
formal speech in natural contexts (conferences, lectures, etc...).
In a second stage, a The UAM
Spanish Treebank has been also tagged. This corpus is made up
of 1600 sentences taken from digital newspapers (El País y Compra Maestra). These sentence
are annotated following a format which is very similar to that of the
syntactic Treebank of Pennsylvania.
Alcántara Plá, Manuel and Antonio
Moreno Cabrera: "Syntax
to Semantics Transformation: Application to Treebanking"; in Proceedings of Frontiers in Corpus
Annotation 2004; HLT-NAACL Boston 2004.
Alcántara
Plá, Manuel: "Semantic
Tagging System for Corpora"; Fifth
International Workshop on Computational Semantics; Tilburg
University, 2003.
Alcántara
Plá, Manuel; Antonio Moreno Sandoval, Guillermo de la Madrid
Heitzmann, Ana González Ledesma y Fernando Ares Chicote:
“C-ORAL-ROM. Corpus integrado de referencia en lenguas romances”,
Procesamiento de Lenguaje Natural, nº 31, septiembre 2003.
Jose María Guirao Miras, Antonio Moreno Sandoval, Ana
González Ledesma, Guillermo de la Madrid Heitzmann, Manuel
Alcántara Plá: “Relating linguistic units to
socio-contextual information in a spontaneous speech corpus of Spanish”
cap. Corpus Linguistics across the Word, ed. Rodopi, 2003.