at.ofai.gate
Class OFAIListGazetteer
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractProcessingResource
gate.creole.AbstractLanguageAnalyser
gate.creole.gazetteer.AbstractGazetteer
gate.creole.gazetteer.DefaultGazetteer
at.ofai.gate.OFAIListGazetteer
- All Implemented Interfaces:
- gate.creole.ANNIEConstants, gate.creole.CustomDuplication, gate.creole.gazetteer.Gazetteer, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable
public class OFAIListGazetteer
- extends gate.creole.gazetteer.DefaultGazetteer
This is a modified version of the GATE DefaultGazetteer class.
In addition to the functionality of DefaultGazetteer,
this class can also provide annotations
for the part of a gazetteer string that directly precedes a match but follows
any non-word character (the prefix)
and the part of a gazetteer string that directly follows a match
but precedes any non-word character (the suffix). That means, if the
gazetteer list contains "someword" it can annotated " thisissomewordindeed "
so that "thisis" gets annotated as "Lookup_prefix", "someword" gets
annotated as "Lookup" and "indeed" gets annotated as "Lookup_suffix".
It also gives more control over which characters mark the beginning or end
of a match in the document.
Matching prefixes or suffixes only makes
sense if the gazetter matches parts of words,
so prefix and suffix matching are deactivated if the wholeWordsOnly
parameter is true.
Prefix and suffix annotation can be seperately switched on and off by
the corresponding boolean parameters suffixAnnotations and prefixAnnotations.
Prefix and suffix annotations are created exactly as the corresponding lookup
annotations, but have major annotation type "Lookup_prefix" and "Lookup_suffix"
respectively. All features from the "Lookup" annotation are copied over
to any corresponding "Lookup_prefix" or "Lookup_suffix" annotation.
Suffix annotations always include the "string" feature that contains
the string of the suffix.
Lookup_prefix and Lookup annotations include the string feature only
if parameter includeStrings is set to true.)
In addition to the features provided by the GATE Default gazetteer, the
Lookup_prefix and Lookup annotations also include these features:
- firstcharUpper which is true if the first letter of the corresponding
string is upper case (for Lookup_prefix and Lookup)
- atEnd which is true if a Lookup annotation matched at the end of a
word according to the current word definition (for Lookup).
- atBeginning which is true if a Lookup annotation matched at the
beginning of a word according to the current word definition.
All parameters except encoding, gazetteerFeatureSeparator, and listsURL,
are defined to be runtime parameters so it is
much easier to change them during debugging without the need to re-create
the processing resource.
The following parameters influence whether characters before or after
a matching string in the document are regarded as word separators or not.
These parameters do not influence how any of the characters within the
string that occurs in the gazetteer list are handled: these characters
always have to match exactly as they occur in the gazetteer list.
parameters:
- wordCharsClass:
- 0 - whitespace: gazetteer strings are delimited by whitespace characters. If
matching with wholWordsOnly=true, only strings that are preceded and
followed by whitespace characters are considered "whole words". Note that
whitespace that is part of the gazetteer list entry is still part
of the string that must be matched successfully.
- 1 - letters: words only consist of letters (unicode class). Everything
else (including digits or special characters) is interpreted
as word boundary.
- 2 - Digit: words only consist of digits (unicode class). Everything
else (including letters and special characters) is interpreted
as word boundary
- 3 - LetterOrDigit: words consist of digits or letters
- wordChars: a string made up of additional characters that should be
accepted for words. Whitespace will be removed.
You might want to add e.g. a hyphen here.
- wordBoundaryChars: a string made up of additional characters that
should be interpreted as word boundaries. Whitespace will be removed.
Whitespace will ALWAYS be interpreted as word boundary, combining spacing
mark and non spacing mark will always be interpreted as part of a word.
The word characters as defined here will only influence how the characters
*outside* the actual gazetteer match will be processed, i.e. how
suffixes and prefixes are found. That means that an entry in a gazetteer
list can contain non-word characters and still match, e.g.
"worda wordb" will match "theworda wordbs" even though the space is a
non-word character and will generate Lookup_prefix.string = "the" and
Lookup_suffix.string = "s".
NOTE: features like "firstcharUpper" are set to "true" or "false" as
strings, not booleans.
- Author:
- Valentin Tablan, Borislav Popov, Johann Petrak
- See Also:
- Serialized Form
Nested classes/interfaces inherited from class gate.creole.gazetteer.DefaultGazetteer |
gate.creole.gazetteer.DefaultGazetteer.CharMap, gate.creole.gazetteer.DefaultGazetteer.Iter |
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource |
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener |
Fields inherited from class gate.creole.gazetteer.DefaultGazetteer |
DEF_GAZ_ANNOT_SET_PARAMETER_NAME, DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME, DEF_GAZ_DOCUMENT_PARAMETER_NAME, DEF_GAZ_ENCODING_PARAMETER_NAME, DEF_GAZ_FEATURE_SEPARATOR_PARAMETER_NAME, DEF_GAZ_LISTS_URL_PARAMETER_NAME, DEF_GAZ_LONGEST_MATCH_ONLY_PARAMETER_NAME, gazetteerFeatureSeparator, initialState, listsByNode |
Fields inherited from class gate.creole.gazetteer.AbstractGazetteer |
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, longestMatchOnly, mappingDefinition, wholeWordsOnly |
Fields inherited from class gate.creole.AbstractLanguageAnalyser |
corpus, document |
Fields inherited from class gate.creole.AbstractProcessingResource |
interrupted |
Fields inherited from class gate.creole.AbstractResource |
name |
Fields inherited from interface gate.creole.ANNIEConstants |
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DEFAULT_FILE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PLUGIN_DIR, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME |
Constructor Summary |
OFAIListGazetteer()
Build a gazetter using the default lists from the gate resources |
Methods inherited from class gate.creole.gazetteer.DefaultGazetteer |
add, addLookup, createLookups, duplicate, getFSMgml, getGazetteerFeatureSeparator, isWordInternal, lookup, readList, remove, removeLookup, setGazetteerFeatureSeparator |
Methods inherited from class gate.creole.gazetteer.AbstractGazetteer |
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getLongestMatchOnly, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setLongestMatchOnly, setMappingDefinition, setWholeWordsOnly |
Methods inherited from class gate.creole.AbstractLanguageAnalyser |
getCorpus, getDocument, setCorpus, setDocument |
Methods inherited from class gate.creole.AbstractProcessingResource |
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface gate.LanguageAnalyser |
getCorpus, getDocument, setCorpus, setDocument |
Methods inherited from interface gate.Resource |
cleanup, getParameterValue, setParameterValue, setParameterValues |
Methods inherited from interface gate.util.NameBearer |
getName, setName |
Methods inherited from interface gate.Executable |
interrupt, isInterrupted |
OFAIListGazetteer
public OFAIListGazetteer()
- Build a gazetter using the default lists from the gate resources
getPrefixAnnotations
public java.lang.Boolean getPrefixAnnotations()
setPrefixAnnotations
public void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)
getSuffixAnnotations
public java.lang.Boolean getSuffixAnnotations()
setSuffixAnnotations
public void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)
getIncludeStrings
public java.lang.Boolean getIncludeStrings()
setIncludeStrings
public void setIncludeStrings(java.lang.Boolean newIncludeStrings)
getWordCharsClass
public java.lang.Integer getWordCharsClass()
setWordCharsClass
public void setWordCharsClass(java.lang.Integer newWordCharsClass)
getWordBoundaryChars
public java.lang.String getWordBoundaryChars()
setWordBoundaryChars
public void setWordBoundaryChars(java.lang.String newWordBoundaryChars)
getWordChars
public java.lang.String getWordChars()
setWordChars
public void setWordChars(java.lang.String newWordChars)
init
public gate.Resource init()
throws gate.creole.ResourceInstantiationException
- Specified by:
init
in interface gate.Resource
- Overrides:
init
in class gate.creole.gazetteer.DefaultGazetteer
- Throws:
gate.creole.ResourceInstantiationException
isWithinWord
public boolean isWithinWord(char ch)
- Tests whether a character is internal to a word (i.e. if it's a letter or
a combining mark (spacing or not)).
- Parameters:
ch
- the character to be tested
- Returns:
- a boolean value
execute
public void execute()
throws gate.creole.ExecutionException
- This method runs the gazetteer. It assumes that all the needed parameters
are set. If they are not, an exception will be fired.
- Specified by:
execute
in interface gate.Executable
- Overrides:
execute
in class gate.creole.gazetteer.DefaultGazetteer
- Throws:
gate.creole.ExecutionException