Estimates a Kneser-Ney language model from raw text, and writes the language
model out in ARPA-format. This is meant to closely resemble the functionality
of SRILM's
ngram-count -text <text file> -ukndiscount -lm <outputfile>)
, with two main exceptions:
(a) rather than calculating the discount for each n-gram order from counts,
we use a constant discount of 0.75 for all orders
(b) Count thresholding is currently not implemented (SRILM by default
thresholds counts for n-grams with n > 3).
Note that if the input/output files have a .gz suffix, they will be
unzipped/zipped as necessary. If no input files or given (or "-" is
specified), lines will be read from standard input.