at.ofai.gate.extendedgazetteer
Class ExtendedGazetteer

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by gate.creole.gazetteer.AbstractGazetteer
                      extended by gate.creole.gazetteer.DefaultGazetteer
                          extended by at.ofai.gate.extendedgazetteer.ExtendedGazetteer
All Implemented Interfaces:
gate.creole.ANNIEConstants, gate.creole.CustomDuplication, gate.creole.gazetteer.Gazetteer, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable

@CreoleResource(name="ExtendedGazetteerCompat",
                comment="An extension of the GATE DefaultGazetteer that can handle prefixes and suffixes")
public class ExtendedGazetteer
extends gate.creole.gazetteer.DefaultGazetteer

This is a modified version of the GATE DefaultGazetteer class. In addition to the functionality of DefaultGazetteer, this class can also provide annotations for the part of a gazetteer string that directly precedes a match but follows any non-word character (the prefix) and the part of a gazetteer string that directly follows a match but precedes any non-word character (the suffix). That means, if the gazetteer list contains "someword" it can annotated " thisissomewordindeed " so that "thisis" gets annotated as "Lookup_prefix", "someword" gets annotated as "Lookup" and "indeed" gets annotated as "Lookup_suffix".

It also gives more control over which characters mark the beginning or end of a match in the document.

Matching prefixes or suffixes only makes sense if the gazetter matches parts of words, so prefix and suffix matching are deactivated if the wholeWordsOnly parameter is true.

Prefix and suffix annotation can be seperately switched on and off by the corresponding boolean parameters suffixAnnotations and prefixAnnotations.

Prefix and suffix annotations are created exactly as the corresponding lookup annotations, but have major annotation type "Lookup_prefix" and "Lookup_suffix" respectively. All features from the "Lookup" annotation are copied over to any corresponding "Lookup_prefix" or "Lookup_suffix" annotation.

Suffix annotations always include the "string" feature that contains the string of the suffix.

Lookup_prefix and Lookup annotations include the string feature only if parameter includeStrings is set to true.)

In addition to the features provided by the GATE Default gazetteer, the Lookup_prefix and Lookup annotations also include these features:

All parameters except encoding, gazetteerFeatureSeparator, and listsURL, are defined to be runtime parameters so it is much easier to change them during debugging without the need to re-create the processing resource.

The following parameters influence whether characters before or after a matching string in the document are regarded as word separators or not. These parameters do not influence how any of the characters within the string that occurs in the gazetteer list are handled: these characters always have to match exactly as they occur in the gazetteer list. parameters:

Whitespace will ALWAYS be interpreted as word boundary, combining spacing mark and non spacing mark will always be interpreted as part of a word.

The word characters as defined here will only influence how the characters *outside* the actual gazetteer match will be processed, i.e. how suffixes and prefixes are found. That means that an entry in a gazetteer list can contain non-word characters and still match, e.g. "worda wordb" will match "theworda wordbs" even though the space is a non-word character and will generate Lookup_prefix.string = "the" and Lookup_suffix.string = "s".

NOTE: features like "firstcharUpper" are set to "true" or "false" as strings, not booleans.

Author:
Valentin Tablan, Borislav Popov, Johann Petrak
See Also:
Serialized Form

Nested Class Summary
protected  class ExtendedGazetteer.FSM
           
 
Nested classes/interfaces inherited from class gate.creole.gazetteer.DefaultGazetteer
gate.creole.gazetteer.DefaultGazetteer.CharMap, gate.creole.gazetteer.DefaultGazetteer.Iter
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener
 
Field Summary
protected static java.util.Map<java.net.URL,ExtendedGazetteer.FSM> loadedGazetteers
           
protected  java.lang.Boolean memorySavingMode
           
protected  java.lang.String unescapedSeparator
           
 
Fields inherited from class gate.creole.gazetteer.DefaultGazetteer
DEF_GAZ_ANNOT_SET_PARAMETER_NAME, DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME, DEF_GAZ_DOCUMENT_PARAMETER_NAME, DEF_GAZ_ENCODING_PARAMETER_NAME, DEF_GAZ_FEATURE_SEPARATOR_PARAMETER_NAME, DEF_GAZ_LISTS_URL_PARAMETER_NAME, DEF_GAZ_LONGEST_MATCH_ONLY_PARAMETER_NAME, fsmStates, gazetteerFeatureSeparator, initialState, listsByNode
 
Fields inherited from class gate.creole.gazetteer.AbstractGazetteer
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, longestMatchOnly, mappingDefinition, wholeWordsOnly
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DEFAULT_FILE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PLUGIN_DIR, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
ExtendedGazetteer()
          Build a gazetter using the default lists from the gate resources
 
Method Summary
 void cleanup()
           
 void execute()
          This method runs the gazetteer.
 java.lang.Boolean getIncludeStrings()
           
 java.lang.Boolean getMemorySavingMode()
           
 java.lang.Boolean getPrefixAnnotations()
           
 java.lang.Boolean getSuffixAnnotations()
           
 java.lang.String getWordBoundaryChars()
           
 java.lang.String getWordChars()
           
 WCClass getWordCharsClass()
           
 gate.Resource init()
           
 boolean isWithinWord(char ch)
          Tests whether a character is internal to a word (i.e.
protected  void loadData()
           
 void setIncludeStrings(java.lang.Boolean newIncludeStrings)
           
 void setMemorySavingMode(java.lang.Boolean yesno)
           
 void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)
           
 void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)
           
 void setWordBoundaryChars(java.lang.String newWordBoundaryChars)
           
 void setWordChars(java.lang.String newWordChars)
           
 void setWordCharsClass(WCClass newWordCharsClass)
           
 
Methods inherited from class gate.creole.gazetteer.DefaultGazetteer
add, addLookup, createLookups, duplicate, getFSMgml, getGazetteerFeatureSeparator, isWordInternal, lookup, readList, remove, removeLookup, setGazetteerFeatureSeparator
 
Methods inherited from class gate.creole.gazetteer.AbstractGazetteer
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getLongestMatchOnly, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setLongestMatchOnly, setMappingDefinition, setWholeWordsOnly
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from interface gate.Resource
getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Field Detail

memorySavingMode

protected java.lang.Boolean memorySavingMode

unescapedSeparator

protected java.lang.String unescapedSeparator

loadedGazetteers

protected static java.util.Map<java.net.URL,ExtendedGazetteer.FSM> loadedGazetteers
Constructor Detail

ExtendedGazetteer

public ExtendedGazetteer()
Build a gazetter using the default lists from the gate resources

Method Detail

getPrefixAnnotations

public java.lang.Boolean getPrefixAnnotations()

setPrefixAnnotations

@CreoleParameter(comment="Generate prefix annotations?",
                 defaultValue="true")
@RunTime
public void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)

getSuffixAnnotations

public java.lang.Boolean getSuffixAnnotations()

setSuffixAnnotations

@CreoleParameter(comment="Generate suffix annotations?",
                 defaultValue="true")
@RunTime
public void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)

getIncludeStrings

public java.lang.Boolean getIncludeStrings()

setIncludeStrings

@CreoleParameter(comment="Include original string as feature?",
                 defaultValue="false")
@RunTime
public void setIncludeStrings(java.lang.Boolean newIncludeStrings)

getWordCharsClass

public WCClass getWordCharsClass()

setWordCharsClass

@CreoleParameter(comment="Types of characters that make up words?",
                 defaultValue="NONWHITESPACE")
@RunTime
public void setWordCharsClass(WCClass newWordCharsClass)

getWordBoundaryChars

public java.lang.String getWordBoundaryChars()

setWordBoundaryChars

@CreoleParameter(comment="Additional word boundary characters?",
                 defaultValue="")
@RunTime
public void setWordBoundaryChars(java.lang.String newWordBoundaryChars)

getWordChars

public java.lang.String getWordChars()

setWordChars

@CreoleParameter(comment="Additional word characters?",
                 defaultValue="-")
@RunTime
public void setWordChars(java.lang.String newWordChars)

setMemorySavingMode

@CreoleParameter(comment="Non-editable and memory saving mode, also enables FSM re-use for multiple copies",
                 defaultValue="true")
public void setMemorySavingMode(java.lang.Boolean yesno)

getMemorySavingMode

public java.lang.Boolean getMemorySavingMode()

init

public gate.Resource init()
                   throws gate.creole.ResourceInstantiationException
Specified by:
init in interface gate.Resource
Overrides:
init in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ResourceInstantiationException

cleanup

public void cleanup()
Specified by:
cleanup in interface gate.Resource
Overrides:
cleanup in class gate.creole.AbstractProcessingResource

loadData

protected void loadData()
                 throws java.io.UnsupportedEncodingException,
                        java.io.IOException,
                        gate.creole.ResourceInstantiationException
Throws:
java.io.UnsupportedEncodingException
java.io.IOException
gate.creole.ResourceInstantiationException

isWithinWord

public boolean isWithinWord(char ch)
Tests whether a character is internal to a word (i.e. if it's a letter or a combining mark (spacing or not)).

Parameters:
ch - the character to be tested
Returns:
a boolean value

execute

public void execute()
             throws gate.creole.ExecutionException
This method runs the gazetteer. It assumes that all the needed parameters are set. If they are not, an exception will be fired.

Specified by:
execute in interface gate.Executable
Overrides:
execute in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ExecutionException