gate.treetagger2
Class TreeTaggerChunk
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractProcessingResource
gate.creole.AbstractLanguageAnalyser
gate.treetagger2.TreeTaggerBase
gate.treetagger2.TreeTaggerChunk
- All Implemented Interfaces:
- gate.creole.ANNIEConstants, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable
public class TreeTaggerChunk
- extends TreeTaggerBase
- implements gate.ProcessingResource
This class is a wrapper for the language-independent POS tagger from the
University of Stuttgart, Germany.
This class is for the chunker function of TreeTagger.
This is a modified version of the plugin that comes with GATE version
3.1 and earlier. This modified version includes several changes
and enhancements over the original version:
- The user does not have to specify the full path to the script
that runs TreeTagger. Instead, the plugin derives the location from
the location where the plugin is stored.
- The script that runs TreeTagger nearly never ever needs to be
changed since it uses several ways of determining how to invocate
the TreeTagger binaries and where to locate the TreeTagger parameter
files (see below).
- The plugin does not read back the temporary TreeTagger input file
and store its content in temporary features of the Token annotations as
the old version did, thus increasing CPU and memory efficiency a bit.
The only check for consistency that is now done is
by comparing the number of Tokens with the number of lines read back
from the TreeTagger output which should be sufficient.
- All plugin parameters are now runtime parameters so they can be
changed after creating the processing resource much more easily.
- The plugin now provides two processing resources: one for POS tagging
and one for chunking (provided a chunking model is available). This
class handles the chunker.
- The plugin allows to specify an annotation type other than Token to
be used as input tot the TreeTagger program.
The following parameters are available for the TreeTaggerPOS part of
speech tagger processing resource:
- annotationSetName: the name of the annotation set to process
- debugMode: true/false -- if true, will keep temporary files and show
debugging messages
- doPosTagging: if true (the default), this invocation of the TreeTagger
will generate both POS tags and the chunking information.
- encoding: the encoding to use when creating the file that will be
read in by the TreeTagger binary. This defaults to ISO-8859-1 and this
is currently the only encoding supported by TreeTagger.
- failOnUnmappableChar: what to do if the document contains a
character that cannot be converted to the specified encoding. If true,
the processing will be interrupted, if false, the character will be
silently ignored.
- includeLemma: if true (default is false), lemmata will be included in
the POS features.
- tokenAnnotationType: the type name of the tokens to be used for
POS tagging. The default is "Token".
- treeTaggerInvocationScriptParms: everything specified here is
passed on to the invocation script that actually locates and invokes
the TreeTagger binary. The invocation script is located in the
plugin directory, subdirectory
cmd
and has the
name run-treetagger.pl
.
It can be invoked manually with the command
perl cmd/run-treetagger.pl -h
to show all possible command
line options and with the command
perl cmd/run-treetagger.pl -man
to show the full documentation.
Note that some arguments do not apply to POS tagging and that
some arguments are passed automatically: -tmpdir
and
-chunk
. The only
argument that needs to be specified in most cases is -lang
.
The output of the chunker is stored in tree additional features for each
annotation:
chunktag
which holds the original tag assigned from the
TreeTagger, chunkpart
which contains the sequence
information from the chunktag and chunktype
which contains
the type information from the chunktag.
In order to create annotations that span several tokens of the same type,
you need postprocess the output of the chunker with a JAPE transducer.
An example JAPE rule file that will work with the default token annotation
type is provided in the plugin directory, subdirectory
resources/grammar
as file join.jape
.
- Author:
- Johann Petrak, Austrian Research Institute for AI (OFAI)
- See Also:
- Serialized Form
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource |
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener |
Fields inherited from class gate.creole.AbstractLanguageAnalyser |
corpus |
Fields inherited from class gate.creole.AbstractProcessingResource |
interrupted |
Fields inherited from class gate.creole.AbstractResource |
name |
Fields inherited from class gate.util.AbstractFeatureBearer |
features |
Fields inherited from interface gate.creole.ANNIEConstants |
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME |
Method Summary |
void |
execute()
Run the TreeTagger on the current document. |
java.lang.Boolean |
getDoPosTagging()
Get the flag for whether we should fail if an unmappable character is
found. |
protected void |
getFeatures4Tokens(java.util.ArrayList lines,
java.util.ArrayList tokens)
Parse the lines of TreeTagger Chunker output and create features for
the tokens. |
java.lang.Boolean |
getIncludeLemma()
Get the flag for whether we should fail if an unmappable character is
found. |
void |
setDoPosTagging(java.lang.Boolean newValue)
Set the flag for whether we we also want POS taggin tags |
void |
setIncludeLemma(java.lang.Boolean newValue)
Set the flag for whether we we also want POS taggin tags |
Methods inherited from class gate.treetagger2.TreeTaggerBase |
getAnnotationSetName, getDebugMode, getDocument, getEncoding, getFailOnUnmappableChar, getTokenAnnotationType, getTreeTaggerInvocationScriptParms, init, setAnnotationSetName, setDebugMode, setDocument, setEncoding, setFailOnUnmappableChar, setTokenAnnotationType, setTreeTaggerInvocationScriptParms |
Methods inherited from class gate.creole.AbstractLanguageAnalyser |
getCorpus, setCorpus |
Methods inherited from class gate.creole.AbstractProcessingResource |
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class gate.util.AbstractFeatureBearer |
getFeatures, setFeatures |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface gate.ProcessingResource |
reInit |
Methods inherited from interface gate.Resource |
cleanup, getParameterValue, init, setParameterValue, setParameterValues |
Methods inherited from interface gate.util.FeatureBearer |
getFeatures, setFeatures |
Methods inherited from interface gate.util.NameBearer |
getName, setName |
Methods inherited from interface gate.Executable |
interrupt, isInterrupted |
doPosTagging
protected boolean doPosTagging
includeLemma
protected boolean includeLemma
TreeTaggerChunk
public TreeTaggerChunk()
execute
public void execute()
throws gate.creole.ExecutionException
- Description copied from class:
TreeTaggerBase
- Run the TreeTagger on the current document. This writes the document text
to a temporary file, runs the tagger and processes its output to produce
TreeTaggerToken annotations on the document.
- Specified by:
execute
in interface gate.Executable
- Overrides:
execute
in class TreeTaggerBase
- Throws:
gate.creole.ExecutionException
getFeatures4Tokens
protected void getFeatures4Tokens(java.util.ArrayList lines,
java.util.ArrayList tokens)
- Parse the lines of TreeTagger Chunker output and create features for
the tokens.
The chunker output has the format:
word-POS POS/CHUNK
If the doPosTagging flag is true, we also set the POS tags from
this output (using the first part of the POS/CHUNK pair)
If the includeLemma flag is true, the invocation script is called
with the -lemma option. In this case, we will set
tje lemma in addition to the chunk tags. The format in this case is:
word-POS POS/CHUNK LEMMA
- Specified by:
getFeatures4Tokens
in class TreeTaggerBase
setDoPosTagging
public void setDoPosTagging(java.lang.Boolean newValue)
- Set the flag for whether we we also want POS taggin tags
getDoPosTagging
public java.lang.Boolean getDoPosTagging()
- Get the flag for whether we should fail if an unmappable character is
found.
setIncludeLemma
public void setIncludeLemma(java.lang.Boolean newValue)
- Set the flag for whether we we also want POS taggin tags
getIncludeLemma
public java.lang.Boolean getIncludeLemma()
- Get the flag for whether we should fail if an unmappable character is
found.