public class LmReaders
extends java.lang.Object
This software provides three main pieces of functionality:
(a) estimation of a language models from text inputs
(b) data structures for efficiently storing large collections of n-grams in
memory
(c) an API for efficient querying language models derived from n-gram
collections. Most of the techniques in the paper are described in
"Faster and Smaller N-gram Language Models" (Pauls and Klein 2011).
This software supports the estimation of two types of language models:
Kneser-Ney language models (Kneser and Ney, 1995) and Stupid Backoff language
models (Brants et al. 2007). Kneser-Ney language models can be estimated from
raw text by calling
createKneserNeyLmFromTextFiles(List, WordIndexer, int, File, ConfigOptions)
. This
can also be done from the command-line by calling main()
in
MakeKneserNeyArpaFromText
. See the examples
folder for a
script which demonstrates its use. A Stupid Backoff language model can be
read from a directory containing n-gram counts in the format used by Google's
Web1T corpus by calling readLmFromGoogleNgramDir(String, boolean, boolean)
.
Note that this software does not (yet) support building Google count
directories from raw text, though this can be done using SRILM.
Loading/estimating language models from text files can be very slow. This
software can use Java's built-in serialization to build language model
binaries which are both smaller and faster to load.
MakeLmBinaryFromArpa
and MakeLmBinaryFromGoogle
provide
main()
methods for doing this. See the examples
folder for scripts which demonstrate their use.
Language models can be read into memory from ARPA formats using
readArrayEncodedLmFromArpa(String, boolean)
and
readContextEncodedLmFromArpa(String)
. The "array encoding" versus
"context encoding" distinction is discussed in Section 4.2 of Pauls and Klein
(2011). Again, since loading language models from textual representations can
be very slow, they can be read from binaries using
readLmBinary(String)
. The interfaces for these language models can
be found in ArrayEncodedNgramLanguageModel
and
ContextEncodedNgramLanguageModel
. For examples of these interfaces in
action, you can have a look at PerplexityTest
.
We implement the HASH,HASH+SCROLL, and COMPRESSED language model
representations described in Pauls and Klein (2011) in this release. The
SORTED implementation may be added later. See HashNgramMap
and
CompressedNgramMap
for the implementations of the HASH and COMPRESSED
representations.
To speed up queries, you can wrap language models with caches (
ContextEncodedCachingLmWrapper
and
ArrayEncodedCachingLmWrapper
). These caches are described in section
4.1 of Pauls and Klein (2011). You should more or less always use these
caches, since they are faster and have modest memory requirements.
This software also support a java Map wrapper around an n-gram collection.
You can read a map wrapper using
readNgramMapFromGoogleNgramDir(String, boolean, WordIndexer)
.
ComputeLogProbabilityOfTextStream
provides a main()
method for computing the log probability of raw text.
Some example scripts can be found in the examples/
directory.
Constructor and Description |
---|
LmReaders() |
Modifier and Type | Method and Description |
---|---|
static <W> void |
createKneserNeyLmFromTextFiles(java.util.List<java.lang.String> files,
WordIndexer<W> wordIndexer,
int lmOrder,
java.io.File arpaOutputFile,
ConfigOptions opts)
Estimates a Kneser-Ney language model from raw text, and writes a file
(in ARPA format).
|
static <W> ArrayEncodedProbBackoffLm<W> |
readArrayEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile,
boolean compress,
WordIndexer<W> wordIndexer,
ConfigOptions opts)
Reads an array-encoded language model from an ARPA lm file.
|
static ArrayEncodedProbBackoffLm<java.lang.String> |
readArrayEncodedLmFromArpa(java.lang.String lmFile,
boolean compress) |
static <W> ArrayEncodedProbBackoffLm<W> |
readArrayEncodedLmFromArpa(java.lang.String lmFile,
boolean compress,
WordIndexer<W> wordIndexer) |
static <W> ArrayEncodedProbBackoffLm<W> |
readArrayEncodedLmFromArpa(java.lang.String lmFile,
boolean compress,
WordIndexer<W> wordIndexer,
ConfigOptions opts,
int lmOrder) |
static <W> ContextEncodedProbBackoffLm<W> |
readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files,
WordIndexer<W> wordIndexer,
int lmOrder,
ConfigOptions opts)
Builds a context-encoded LM from raw text.
|
static <W> ContextEncodedProbBackoffLm<W> |
readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files,
WordIndexer<W> wordIndexer,
int lmOrder,
ConfigOptions opts,
java.io.File tmpFile) |
static <W> ContextEncodedProbBackoffLm<W> |
readContextEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile,
WordIndexer<W> wordIndexer,
ConfigOptions opts) |
static ContextEncodedProbBackoffLm<java.lang.String> |
readContextEncodedLmFromArpa(java.lang.String lmFile) |
static <W> ContextEncodedProbBackoffLm<W> |
readContextEncodedLmFromArpa(java.lang.String lmFile,
WordIndexer<W> wordIndexer) |
static <W> ContextEncodedProbBackoffLm<W> |
readContextEncodedLmFromArpa(java.lang.String lmFile,
WordIndexer<W> wordIndexer,
ConfigOptions opts,
int lmOrder)
Reads a context-encoded language model from an ARPA lm file.
|
static StupidBackoffLm<java.lang.String> |
readGoogleLmBinary(java.lang.String file,
java.lang.String sortedVocabFile) |
static <W> StupidBackoffLm<W> |
readGoogleLmBinary(java.lang.String file,
WordIndexer<W> wordIndexer,
java.lang.String sortedVocabFile)
Reads in a pre-built Google n-gram binary.
|
static <W> ArrayEncodedProbBackoffLm<W> |
readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files,
WordIndexer<W> wordIndexer,
int lmOrder,
boolean compress,
ConfigOptions opts,
java.io.File tmpFile) |
static <W> ArrayEncodedProbBackoffLm<W> |
readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files,
WordIndexer<W> wordIndexer,
int lmOrder,
ConfigOptions opts,
boolean compress)
Builds an array-encoded LM from raw text.
|
static <W> NgramLanguageModel<W> |
readLmBinary(java.lang.String file)
Reads a binary file representing an LM.
|
static ArrayEncodedNgramLanguageModel<java.lang.String> |
readLmFromGoogleNgramDir(java.lang.String dir,
boolean compress,
boolean kneserNey) |
static <W> ArrayEncodedNgramLanguageModel<W> |
readLmFromGoogleNgramDir(java.lang.String dir,
boolean compress,
boolean kneserNey,
WordIndexer<W> wordIndexer,
ConfigOptions opts)
Reads a stupid backoff lm from a directory with n-gram counts in the
format used by Google n-grams.
|
static NgramMapWrapper<java.lang.String,LongRef> |
readNgramMapFromBinary(java.lang.String binary,
java.lang.String vocabFile) |
static <W> NgramMapWrapper<W,LongRef> |
readNgramMapFromBinary(java.lang.String binary,
java.lang.String sortedVocabFile,
WordIndexer<W> wordIndexer) |
static NgramMapWrapper<java.lang.String,LongRef> |
readNgramMapFromGoogleNgramDir(java.lang.String dir,
boolean compress) |
static <W> NgramMapWrapper<W,LongRef> |
readNgramMapFromGoogleNgramDir(java.lang.String dir,
boolean compress,
WordIndexer<W> wordIndexer) |
static <W> void |
writeLmBinary(NgramLanguageModel<W> lm,
java.lang.String file)
Writes a binary file representing the LM using the built-in
serialization.
|
public static ContextEncodedProbBackoffLm<java.lang.String> readContextEncodedLmFromArpa(java.lang.String lmFile)
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer)
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
W
- lmFile
- compress
- wordIndexer
- opts
- lmOrder
- public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts)
public static ArrayEncodedProbBackoffLm<java.lang.String> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress)
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer)
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts)
W
- lmFile
- compress
- Compress the LM using block compression. This LM should be
smaller but slower.wordIndexer
- opts
- lmOrder
- public static NgramMapWrapper<java.lang.String,LongRef> readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress)
public static <W> NgramMapWrapper<W,LongRef> readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress, WordIndexer<W> wordIndexer)
public static NgramMapWrapper<java.lang.String,LongRef> readNgramMapFromBinary(java.lang.String binary, java.lang.String vocabFile)
public static <W> NgramMapWrapper<W,LongRef> readNgramMapFromBinary(java.lang.String binary, java.lang.String sortedVocabFile, WordIndexer<W> wordIndexer)
sortedVocabFile
- should be the vocab_cs.gz file from the Google n-gram corpus.public static ArrayEncodedNgramLanguageModel<java.lang.String> readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey)
public static <W> ArrayEncodedNgramLanguageModel<W> readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey, WordIndexer<W> wordIndexer, ConfigOptions opts)
W
- dir
- compress
- wordIndexer
- opts
- public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts)
#createKneserNeyLmFromTextFiles(List, WordIndexer, int, File)
,
and the reads the resulting file. Since the temp file can be quite large,
it is important that the temp directory used by java (
java.io.tmpdir
).W
- files
- wordIndexer
- lmOrder
- opts
- public static <W> ArrayEncodedProbBackoffLm<W> readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, boolean compress)
#createKneserNeyLmFromTextFiles(List, WordIndexer, int, File)
,
and the reads the resulting file. Since the temp file can be quite large,
it is important that the temp directory used by java (
java.io.tmpdir
).W
- files
- wordIndexer
- lmOrder
- opts
- public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, java.io.File tmpFile)
public static <W> ArrayEncodedProbBackoffLm<W> readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, boolean compress, ConfigOptions opts, java.io.File tmpFile)
public static <W> void createKneserNeyLmFromTextFiles(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, java.io.File arpaOutputFile, ConfigOptions opts)
W
- files
- Files of raw text (new-line separated).wordIndexer
- lmOrder
- arpaOutputFile
- public static StupidBackoffLm<java.lang.String> readGoogleLmBinary(java.lang.String file, java.lang.String sortedVocabFile)
public static <W> StupidBackoffLm<W> readGoogleLmBinary(java.lang.String file, WordIndexer<W> wordIndexer, java.lang.String sortedVocabFile)
vocab_cs.gz
file (so that the corpus cannot be reproduced
unless the user has the rights to do so).W
- file
- The binarywordIndexer
- sortedVocabFile
- the vocab_cs.gz
vocabulary file.public static <W> NgramLanguageModel<W> readLmBinary(java.lang.String file)
ContextEncodedNgramLanguageModel
or
ArrayEncodedNgramLanguageModel
to be useful.public static <W> void writeLmBinary(NgramLanguageModel<W> lm, java.lang.String file)
W
- lm
- file
-