at.ofai.gate
Class OFAIListGazetteer

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by gate.creole.gazetteer.AbstractGazetteer
                      extended by gate.creole.gazetteer.DefaultGazetteer
                          extended by at.ofai.gate.OFAIListGazetteer
All Implemented Interfaces:
gate.creole.ANNIEConstants, gate.creole.CustomDuplication, gate.creole.gazetteer.Gazetteer, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable

public class OFAIListGazetteer
extends gate.creole.gazetteer.DefaultGazetteer

This is a modified version of the GATE DefaultGazetteer class. In addition to the functionality of DefaultGazetteer, this class can also provide annotations for the part of a gazetteer string that directly precedes a match but follows any non-word character (the prefix) and the part of a gazetteer string that directly follows a match but precedes any non-word character (the suffix). That means, if the gazetteer list contains "someword" it can annotated " thisissomewordindeed " so that "thisis" gets annotated as "Lookup_prefix", "someword" gets annotated as "Lookup" and "indeed" gets annotated as "Lookup_suffix".

It also gives more control over which characters mark the beginning or end of a match in the document.

Matching prefixes or suffixes only makes sense if the gazetter matches parts of words, so prefix and suffix matching are deactivated if the wholeWordsOnly parameter is true.

Prefix and suffix annotation can be seperately switched on and off by the corresponding boolean parameters suffixAnnotations and prefixAnnotations.

Prefix and suffix annotations are created exactly as the corresponding lookup annotations, but have major annotation type "Lookup_prefix" and "Lookup_suffix" respectively. All features from the "Lookup" annotation are copied over to any corresponding "Lookup_prefix" or "Lookup_suffix" annotation.

Suffix annotations always include the "string" feature that contains the string of the suffix.

Lookup_prefix and Lookup annotations include the string feature only if parameter includeStrings is set to true.)

In addition to the features provided by the GATE Default gazetteer, the Lookup_prefix and Lookup annotations also include these features:

All parameters except encoding, gazetteerFeatureSeparator, and listsURL, are defined to be runtime parameters so it is much easier to change them during debugging without the need to re-create the processing resource.

The following parameters influence whether characters before or after a matching string in the document are regarded as word separators or not. These parameters do not influence how any of the characters within the string that occurs in the gazetteer list are handled: these characters always have to match exactly as they occur in the gazetteer list. parameters:

Whitespace will ALWAYS be interpreted as word boundary, combining spacing mark and non spacing mark will always be interpreted as part of a word.

The word characters as defined here will only influence how the characters *outside* the actual gazetteer match will be processed, i.e. how suffixes and prefixes are found. That means that an entry in a gazetteer list can contain non-word characters and still match, e.g. "worda wordb" will match "theworda wordbs" even though the space is a non-word character and will generate Lookup_prefix.string = "the" and Lookup_suffix.string = "s".

NOTE: features like "firstcharUpper" are set to "true" or "false" as strings, not booleans.

Author:
Valentin Tablan, Borislav Popov, Johann Petrak
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class gate.creole.gazetteer.DefaultGazetteer
gate.creole.gazetteer.DefaultGazetteer.CharMap, gate.creole.gazetteer.DefaultGazetteer.Iter
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener
 
Field Summary
 
Fields inherited from class gate.creole.gazetteer.DefaultGazetteer
DEF_GAZ_ANNOT_SET_PARAMETER_NAME, DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME, DEF_GAZ_DOCUMENT_PARAMETER_NAME, DEF_GAZ_ENCODING_PARAMETER_NAME, DEF_GAZ_FEATURE_SEPARATOR_PARAMETER_NAME, DEF_GAZ_LISTS_URL_PARAMETER_NAME, DEF_GAZ_LONGEST_MATCH_ONLY_PARAMETER_NAME, gazetteerFeatureSeparator, initialState, listsByNode
 
Fields inherited from class gate.creole.gazetteer.AbstractGazetteer
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, longestMatchOnly, mappingDefinition, wholeWordsOnly
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DEFAULT_FILE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PLUGIN_DIR, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
OFAIListGazetteer()
          Build a gazetter using the default lists from the gate resources
 
Method Summary
 void execute()
          This method runs the gazetteer.
 java.lang.Boolean getIncludeStrings()
           
 java.lang.Boolean getPrefixAnnotations()
           
 java.lang.Boolean getSuffixAnnotations()
           
 java.lang.String getWordBoundaryChars()
           
 java.lang.String getWordChars()
           
 java.lang.Integer getWordCharsClass()
           
 gate.Resource init()
           
 boolean isWithinWord(char ch)
          Tests whether a character is internal to a word (i.e.
 void setIncludeStrings(java.lang.Boolean newIncludeStrings)
           
 void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)
           
 void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)
           
 void setWordBoundaryChars(java.lang.String newWordBoundaryChars)
           
 void setWordChars(java.lang.String newWordChars)
           
 void setWordCharsClass(java.lang.Integer newWordCharsClass)
           
 
Methods inherited from class gate.creole.gazetteer.DefaultGazetteer
add, addLookup, createLookups, duplicate, getFSMgml, getGazetteerFeatureSeparator, isWordInternal, lookup, readList, remove, removeLookup, setGazetteerFeatureSeparator
 
Methods inherited from class gate.creole.gazetteer.AbstractGazetteer
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getLongestMatchOnly, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setLongestMatchOnly, setMappingDefinition, setWholeWordsOnly
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Constructor Detail

OFAIListGazetteer

public OFAIListGazetteer()
Build a gazetter using the default lists from the gate resources

Method Detail

getPrefixAnnotations

public java.lang.Boolean getPrefixAnnotations()

setPrefixAnnotations

public void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)

getSuffixAnnotations

public java.lang.Boolean getSuffixAnnotations()

setSuffixAnnotations

public void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)

getIncludeStrings

public java.lang.Boolean getIncludeStrings()

setIncludeStrings

public void setIncludeStrings(java.lang.Boolean newIncludeStrings)

getWordCharsClass

public java.lang.Integer getWordCharsClass()

setWordCharsClass

public void setWordCharsClass(java.lang.Integer newWordCharsClass)

getWordBoundaryChars

public java.lang.String getWordBoundaryChars()

setWordBoundaryChars

public void setWordBoundaryChars(java.lang.String newWordBoundaryChars)

getWordChars

public java.lang.String getWordChars()

setWordChars

public void setWordChars(java.lang.String newWordChars)

init

public gate.Resource init()
                   throws gate.creole.ResourceInstantiationException
Specified by:
init in interface gate.Resource
Overrides:
init in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ResourceInstantiationException

isWithinWord

public boolean isWithinWord(char ch)
Tests whether a character is internal to a word (i.e. if it's a letter or a combining mark (spacing or not)).

Parameters:
ch - the character to be tested
Returns:
a boolean value

execute

public void execute()
             throws gate.creole.ExecutionException
This method runs the gazetteer. It assumes that all the needed parameters are set. If they are not, an exception will be fired.

Specified by:
execute in interface gate.Executable
Overrides:
execute in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ExecutionException