at.ofai.gate.creole
Class MtlTransducer

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by at.ofai.gate.creole.MtlTransducer
All Implemented Interfaces:
gate.creole.ANNIEConstants, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable

public class MtlTransducer
extends gate.creole.AbstractLanguageAnalyser

Montreal Transducer: A cascaded multi-phase ontology-aware transducer using the Jape language which is a variant of the CPSL language. Requires java 1.4 or higher.

The Montreal Transducer is based on the Transducer from the ANNIE suite but with the following added features:

To use this resource, the repository (or directory) containing the creole.xml and resource jar file must be loaded by the user. The repository must be accessible via the file:// protocol. Unlike most resources, the repository cannot be a web URL (http://www...). This is because the transducer compiles java code (the grammar rules) every time it is loaded and the resource jar file must be part of the classpath when compiling, but only regular file URLs are allowed in the classpath. The resource will try to add the jar file to the classpath automatically; if problems arise when loading the transducer, add the jar file to the classpath manually prior to running the application.

More comparison operators

The Montreal Transducer offers more comparison operators to put in left hand side constraints of a JAPE grammar. The standard ANNIE transducer allows constraints only like these:

The Montreal Transducer allows the following constraints:

See the notes on the equality operators, comparison operators, pattern matching operators and negation operator.

Notes on equality operators: "==" and "!="

The "!=" operator is the negation of the "==" operator, that is to say: {Annot.attribute != value} is equivalent to {!Annot.attribute == value}.

When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the equality operator returns false (and the difference operator returns true).

If the constraint's attribute is a string, then the String.equals method is called with the annotation's attribute as a parameter. If the constraint's attribute is an integer, then the Long.equals method is called. If the constraint's attribute is a float, then the Double.equals method is called. And if the constraint's attribute is a boolean, then the Boolean.equals method is called. The grammar parser does not allow other types of constraints.

Normally, when the types of the constraint's and the annotation's attribute differ, they cannot be equal. However, because some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), the Montreal Transducer can convert the string to a Long/Double/Boolean before testing for equality. In other words, for the token "dog":

Notes on comparison operators: ">", "<", ">=" and "<="

If the constraint's attribute is a string, then the String.compareTo method is called with the annotation's attribute as a parameter (strings can be compared alphabetically). If the constraint's attribute is an integer, then the Long.compareTo method is called. If the constraint's attribute is a float, then the Double.compareTo method is called. The transducer issues a warning if an attempt is made to compare two Boolean because this type does not extend the Comparable interface and thus has no compareTo method.

The transducer issues a warning when it encounters an annotation's attribute that cannot be compared to the constraint's attribute because the value types are different, or because one value is null. For example, given a constraint {MyAnnot.attrib > 2}, a warning is issued for any MyAnnot in the document for which attrib is not an integer, such as attrib = "dog" because we cannot evaluate "dog" > 2. Similarly, {MyAnnot.attrib > 2} cannot be compared to attrib = 2.5 because 2.5 is a float. In this case, force 2 as a float with {MyAnnot.attrib > 2.0}.

The transducer does not issue a warning when the constraint's attribute is an integer/float and the annotation's attribute is a string but can be parsed as an integer/float. Some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), and because {Token.length < "10"} would lead to an alphabetical comparison, a workaround was needed so we could write {Token.length < 10}.

Notes on pattern matching operators: "=~" and "!~"

The "!~" operator is the negation of the "=~" operator, that is to say: {Annot.attribute !~ "value"} is equivalent to {!Annot.attribute =~ "value"}.

When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the value defaults to an empty string ("").

The regular expression must be enclosed in double quotes, otherwise the transducer issues a warning:

The regular expression must be a valid java.util.regex.Pattern, otherwise a warning is issued.

To have a match, the regular expression must cover the entire attribute string, not only a part of it. For example:

Notes on the negation operator: "!"

Bindings: when a constraint contains both negated and regular elements, the negated elements do not affect the bindings of the regular elements. Thus, {Person, !Organization} binds to the same annotations (amongst those that starts at current node in the annotation graph) as {Person}; the difference between the two is that the first will simply not match if one of the annotations starting at current node is an Organization. On the other hand, when a constraint contains only negated elements such as {!Organization}, it binds to all annotations starting at current node. It is important to keep that in mind especially when a rule ends with a constraint with negated elements only: the longest annotation at current node will be preferred.

Conjunctions of constraints on different types of annotation

The Montreal Transducer allows constraints on different types of annotation. Though the JAPE implementation exposed in the GATE 2.1 User Guide details an algorithm that would allow such constraints, the ANNIE transducer does not implement it. This transducer does. Those examples do not work as expected with the ANNIE transducer but do with this transducer:

As described in the algorithm, the first example above matches points in the document (or nodes in the annotation graph) where both a Person and an Organization annotations begin, even if they do not end at the same point in the document and even if other annotations begin at the same point. When a negation is involved, such as in the third example above, no annotation of that kind must begin at a given point for a match to occur (see the note on the negation operator below).

Greedy Kleene operators: "*" and "+"

The ANNIE transducer does not behave consistently regarding the "*" and "+" Kleene operators. Suppose we have the following rule with 2 bindings:

Given the sentence "the Honourable Mr. John Atkinson", we expect the following bindings: But the ANNIE transducer could give something like: This is not incorrect, but according to convention, "*" and "+" operators match as many tokens as possible before moving on to the next constraint. The Montreal Transducer guarantees that "*" and "+" are greedy.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener
 
Field Summary
protected  Batch batch
          The actual JapeTransducer used for processing the document(s).
static java.lang.String TRANSD_AUTHORISE_DUPLICATES_PARAMETER_NAME
           
static java.lang.String TRANSD_DOCUMENT_PARAMETER_NAME
           
static java.lang.String TRANSD_ENCODING_PARAMETER_NAME
           
static java.lang.String TRANSD_GRAMMAR_URL_PARAMETER_NAME
           
static java.lang.String TRANSD_INPUT_AS_PARAMETER_NAME
           
static java.lang.String TRANSD_OUTPUT_AS_PARAMETER_NAME
           
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from class gate.util.AbstractFeatureBearer
features
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
MtlTransducer()
          Default constructor.
 
Method Summary
 void cleanup()
          Remove this class' jar file from the system classpath so that the system state is the same as when the init method was called (and before this class' jar file was added to the classpath, if missing).
 void execute()
          Implementation of the run() method from Runnable.
 java.lang.Boolean getAuthoriseDuplicates()
          Gets the authoriseDuplicates flag that allow/prevent the transducer from creating annotations that already exist at some point in the doc.
 java.lang.String getEncoding()
          Gets the encoding used for reding the grammar file(s).
 java.net.URL getGrammarURL()
          Gets the URL to the grammar used to build this transducer.
 java.lang.String getInputASName()
          Gets the AnnotationSet used as input by this transducer.
 gate.creole.ontology.Ontology getOntology()
          Gets the ontology used by this transducer.
 java.lang.String getOutputASName()
          Gets the AnnotationSet used as output by this transducer.
 gate.Resource init()
          This method is the one responsible for initialising the transducer.
 void interrupt()
          Notifies all the PRs in this controller that they should stop their execution as soon as possible.
 void setAuthoriseDuplicates(java.lang.Boolean newAuthoriseDuplicates)
          Sets the authoriseDuplicates flag that allow/prevent the transducer from creating annotations that already exist at some point in the doc.
 void setEncoding(java.lang.String newEncoding)
          Sets the encoding to be used for reding the input file(s) forming the Jape grammar.
 void setGrammarURL(java.net.URL newGrammarURL)
          Sets the grammar to be used for building this transducer.
 void setInputASName(java.lang.String newInputASName)
          Sets the AnnotationSet to be used as input for the transducer.
 void setOntology(gate.creole.ontology.Ontology ontology)
          Sets the ontology used by this transducer.
 void setOutputASName(java.lang.String newOutputASName)
          Sets the AnnotationSet to be used as output by the transducer.
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, fireProcessFinished, fireProgressChanged, fireStatusChanged, isInterrupted, reInit, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class gate.util.AbstractFeatureBearer
getFeatures, setFeatures
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.ProcessingResource
reInit
 
Methods inherited from interface gate.Resource
getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.FeatureBearer
getFeatures, setFeatures
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
isInterrupted
 

Field Detail

TRANSD_DOCUMENT_PARAMETER_NAME

public static final java.lang.String TRANSD_DOCUMENT_PARAMETER_NAME
See Also:
Constant Field Values

TRANSD_INPUT_AS_PARAMETER_NAME

public static final java.lang.String TRANSD_INPUT_AS_PARAMETER_NAME
See Also:
Constant Field Values

TRANSD_OUTPUT_AS_PARAMETER_NAME

public static final java.lang.String TRANSD_OUTPUT_AS_PARAMETER_NAME
See Also:
Constant Field Values

TRANSD_ENCODING_PARAMETER_NAME

public static final java.lang.String TRANSD_ENCODING_PARAMETER_NAME
See Also:
Constant Field Values

TRANSD_GRAMMAR_URL_PARAMETER_NAME

public static final java.lang.String TRANSD_GRAMMAR_URL_PARAMETER_NAME
See Also:
Constant Field Values

TRANSD_AUTHORISE_DUPLICATES_PARAMETER_NAME

public static final java.lang.String TRANSD_AUTHORISE_DUPLICATES_PARAMETER_NAME
See Also:
Constant Field Values

batch

protected Batch batch
The actual JapeTransducer used for processing the document(s).

Constructor Detail

MtlTransducer

public MtlTransducer()
Default constructor. Does nothing apart from calling the default constructor from the super class. The actual object initialisation is done via the init() method.

Method Detail

init

public gate.Resource init()
                   throws gate.creole.ResourceInstantiationException
This method is the one responsible for initialising the transducer. It assumes that all the needed parameters have been already set using the appropiate setXXX() methods.

Specified by:
init in interface gate.Resource
Overrides:
init in class gate.creole.AbstractProcessingResource
Returns:
a reference to this
Throws:
gate.creole.ResourceInstantiationException

execute

public void execute()
             throws gate.creole.ExecutionException
Implementation of the run() method from Runnable. This method is responsible for doing all the processing of the input document.

Specified by:
execute in interface gate.Executable
Overrides:
execute in class gate.creole.AbstractProcessingResource
Throws:
gate.creole.ExecutionException

interrupt

public void interrupt()
Notifies all the PRs in this controller that they should stop their execution as soon as possible.

Specified by:
interrupt in interface gate.Executable
Overrides:
interrupt in class gate.creole.AbstractProcessingResource

setGrammarURL

public void setGrammarURL(java.net.URL newGrammarURL)
Sets the grammar to be used for building this transducer.

Parameters:
newGrammarURL - an URL to a file containing a Jape grammar.

getGrammarURL

public java.net.URL getGrammarURL()
Gets the URL to the grammar used to build this transducer.

Returns:
a URL pointing to the grammar file.

setEncoding

public void setEncoding(java.lang.String newEncoding)
Sets the encoding to be used for reding the input file(s) forming the Jape grammar. Note that if the input grammar is a multi-file one than the same encoding will be used for reding all the files. Multi file grammars with different encoding across the composing files are not supported!

Parameters:
newEncoding - a {link String} representing the encoding.

getEncoding

public java.lang.String getEncoding()
Gets the encoding used for reding the grammar file(s).


setInputASName

public void setInputASName(java.lang.String newInputASName)
Sets the AnnotationSet to be used as input for the transducer.

Parameters:
newInputAS - a AnnotationSet

getInputASName

public java.lang.String getInputASName()
Gets the AnnotationSet used as input by this transducer.

Returns:
a AnnotationSet

setOutputASName

public void setOutputASName(java.lang.String newOutputASName)
Sets the AnnotationSet to be used as output by the transducer.

Parameters:
newOutputAS - a AnnotationSet

getOutputASName

public java.lang.String getOutputASName()
Gets the AnnotationSet used as output by this transducer.

Returns:
a AnnotationSet

setAuthoriseDuplicates

public void setAuthoriseDuplicates(java.lang.Boolean newAuthoriseDuplicates)
Sets the authoriseDuplicates flag that allow/prevent the transducer from creating annotations that already exist at some point in the doc. This is particularly useful when the transducer is called more than once in a pipeline (as when the gazetteer is updated by a first pass of the transducer and we want the transducer to do a second pass)

Parameters:
newAuthoriseDuplicates - if set to false, the transducer performs righthandside actions as usual but does not add annotations to the output annotation set when an identical annotation exists at the same point in the document.

getAuthoriseDuplicates

public java.lang.Boolean getAuthoriseDuplicates()
Gets the authoriseDuplicates flag that allow/prevent the transducer from creating annotations that already exist at some point in the doc.

Returns:
true/false

getOntology

public gate.creole.ontology.Ontology getOntology()
Gets the ontology used by this transducer.

Returns:
an Ontology value.

setOntology

public void setOntology(gate.creole.ontology.Ontology ontology)
Sets the ontology used by this transducer.

Parameters:
ontology - an Ontology value.

cleanup

public void cleanup()
Remove this class' jar file from the system classpath so that the system state is the same as when the init method was called (and before this class' jar file was added to the classpath, if missing).

Specified by:
cleanup in interface gate.Resource
Overrides:
cleanup in class gate.creole.AbstractProcessingResource