public class OffsetTokenizer
extends org.apache.uima.analysis_engine.annotator.JTextAnnotator_ImplBase
java.util.StringTokenizer
), except
that this tokenizer returns
TokenAnnotation
objects, which, in
addition to the token text string, also contain the start and end offsets of the token in the
original string.
The tokenizer will optionally perform stemming and case normalization on the tokens, and the set
of characters that delimit tokens may be specified. The default stemmer is the Snowball Porter
stemmer, but any stemmer may be supplied to the tokenizer as long as it implements the
Stemmer
interface.
Modifier and Type | Field and Description |
---|---|
static String |
PARAM_CASE_MATCH
Configuration parameter key/label for the case matching string
|
static String |
PARAM_STEMMER_CLASS
Configuration parameter key/label for the stemmer class spec
|
static String |
PARAM_TOKEN_DELIM
Configuration parameter key/label for the token delimiters string
|
Constructor and Description |
---|
OffsetTokenizer()
Create a new
OffsetTokenizer . |
Modifier and Type | Method and Description |
---|---|
static String |
doFoldCase(String token) |
static String |
doStemming(String token,
Stemmer stemmer) |
protected void |
doTokenization(org.apache.uima.jcas.JCas jcas,
String documentText,
String delimiters) |
protected String |
foldCase(String token)
If one of the case folding flags is true and the input string matches the character pattern
corresponding to that flag, then convert all letters to lowercase.
|
protected boolean |
getCaseFoldAll()
Get case folding flag for folding all tokens.
|
protected boolean |
getCaseFoldDigit()
Get the case folding flag for folding tokens with at least one digit character.
|
protected boolean |
getCaseFoldInitCap()
Get case folding flag for folding tokens with initial cap.
|
protected String |
getDelim()
Get the current list of delimiters used to separate the input string into tokens.
|
Stemmer |
getStemmer() |
protected boolean |
getStemming()
Get the current stemming flag.
|
String |
getText() |
void |
initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext)
Initialize the annotator, which includes compilation of regular expressions, fetching
configuration parameters from XML descriptor file, and loading of the dictionary file.
|
void |
initTokenizer(String[] paramNames,
Object[] paramValues) |
TokenAnnotation |
newToken(org.apache.uima.jcas.JCas jcas) |
TokenAnnotation |
nextToken(org.apache.uima.jcas.JCas jcas) |
protected void |
overrideDelim(String delim)
Set the delimiters used to separate the input string into tokens.
|
void |
process(org.apache.uima.jcas.JCas jcas,
org.apache.uima.analysis_engine.ResultSpecification aResultSpec)
Perform the actual analysis.
|
void |
processAllConfigurationParameters(String[] configParameterNames,
Object[] configParameters) |
void |
processConfigurationParameter(String configParameterName,
Object configParameterValue) |
protected void |
setDelim(String delim)
Set the delimiters used to separate the input string into tokens.
|
void |
setStemmer(Stemmer stemmer) |
void |
setText(String text)
Set the text to tokenize.
|
boolean |
shouldFoldCase(String token) |
boolean |
shouldStem() |
protected String |
stem(String token)
If the stemming flag is true, then return the stemmed form of the supplied word using the
Porter stemmer.
|
destroy, finalize, getContext, getTypeSystem, reconfigure, typeSystemInit
public static final String PARAM_CASE_MATCH
public static final String PARAM_STEMMER_CLASS
public static final String PARAM_TOKEN_DELIM
public OffsetTokenizer()
OffsetTokenizer
. Initializes the default stemmer and sets up the
regular expressions for the various case folding options.public String getText()
public void setText(String text)
nextToken
will return the first token from the input string
as a TokenAnnotation; you can get the text by using
TokenAnnotation.getText()
public Stemmer getStemmer()
public void setStemmer(Stemmer stemmer)
stemmer
- The stemmer to set.public TokenAnnotation newToken(org.apache.uima.jcas.JCas jcas)
public TokenAnnotation nextToken(org.apache.uima.jcas.JCas jcas)
protected String foldCase(String token)
token
- The string to case foldpublic boolean shouldFoldCase(String token)
public boolean shouldStem()
protected void setDelim(String delim)
delim
- The new set of delimiters.protected void overrideDelim(String delim)
delim
- The new set of delimiters.protected String getDelim()
protected boolean getStemming()
protected boolean getCaseFoldInitCap()
protected boolean getCaseFoldDigit()
protected boolean getCaseFoldAll()
public void initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext) throws org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException, org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
initialize
in interface org.apache.uima.analysis_engine.annotator.BaseAnnotator
initialize
in class org.apache.uima.analysis_engine.annotator.Annotator_ImplBase
org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
public void processAllConfigurationParameters(String[] configParameterNames, Object[] configParameters) throws org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
public void process(org.apache.uima.jcas.JCas jcas, org.apache.uima.analysis_engine.ResultSpecification aResultSpec) throws org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
jcas
- the current CAS to process.aResultSpec
- a specification of the result annotation that should be created by this annotatororg.apache.uima.analysis_engine.annotator.AnnotatorProcessException
JTextAnnotator.process(JCas, ResultSpecification)
public void initTokenizer(String[] paramNames, Object[] paramValues) throws Exception
Exception
protected void doTokenization(org.apache.uima.jcas.JCas jcas, String documentText, String delimiters)
jcas
- documentText
- delimiters
- public void processConfigurationParameter(String configParameterName, Object configParameterValue)
configParameterName
- configParameterValue
- protected String stem(String token)
token
- the word to stemCopyright © 2006–2021 The Apache Software Foundation. All rights reserved.