The State of the Art in Language Modeling
Joshua
Goodman, Microsoft Research
Summary:
This tutorial covers the state of the art in language modeling. Because of size
and data issues, language modeling is an especially challenging subfield of
machine learning. The bulk of the tutorial
will describe current techniques in language modeling, including techniques
like word clustering and smoothing (regularization) that are useful in many
areas besides language modeling, and more language-model specific techniques
such as high order n-grams and sentence mixture models. Finally, the talk will describe applications
of language modeling in more detail, including applications outside of
language, as well as available toolkits and corpora.
Tutorial Description and Outline
This tutorial will cover the state-of-the-art
in language modeling. Language models give the probability of word sequences, e.g.
"recognize speech" is much more probable than "wreck a nice
beach." Language modeling is a very
challenging problem for most machine learning techniques, because instead of
predicting the probability of one or two things, we need a probability
distribution over words -- i.e. tens of thousands of things.
Language models are useful in a
large number of areas, including speech recognition, handwriting recognition,
machine translation, information retrieval, context-sensitive spelling
correction, and text entry for Chinese and Japanese or on small input devices. Many language modeling techniques can be
applied to other areas or to modeling any discrete sequence. This tutorial should be accessible to anyone
interested in machine learning.
The most basic language models -- n-gram
models -- essentially just count occurrences of words in training data. I will describe five relatively simple
improvements over this baseline: smoothing, caching, skipping, sentence-mixture
models, and clustering. I will talk a
bit about the applications of language modeling, including to areas other than
language, and then I will quickly describe other recent promising work, and
available tools and resources.
I begin by describing conventional-style
language modeling techniques.
1) Smoothing (also called
regularization) addresses the problem of data sparsity:
there is rarely enough data to accurately estimate the parameters of a language
model. Smoothing gives a way to combine
less specific, more accurate information with more specific, but noisier data. I will describe two classic techniques -- deleted
interpolation and Katz (or Good-Turing) smoothing -- and one recent technique,
Modified Kneser-Ney smoothing, which is the best
known.
2) Caching is a widely used
technique that uses the observation that recently observed words are likely to
occur again. Models from recently
observed data can be combined with more general models to improve performance.
3) Skipping models use the
observation that even words that are not directly adjacent to the target word
contain useful information.
4) Sentence-mixture models use the
observation that there are many different kinds of sentences. By modeling each sentence type separately,
performance is improved.
5) Clustering is one of the most
useful language modeling techniques. Words can be grouped together into
clusters through various automatic techniques; then the probability of a
cluster can be predicted instead of the probability of the word. Clustering can be used to make smaller models
or better performing ones. I will talk
briefly about clustering issues specific to the huge amounts of data used in
language modeling (hundreds of millions of words) to form thousands of clusters.
I will spend some time talking
about the applications of language modeling, to areas including speech
recognition, spelling correction, entering data in Chinese or Japanese, as well
as to non-language problems, such as recommender systems.
I will briefly describe some
recent, but more speculative language modeling techniques, including maximum
entropy models. Finally, I will also
talk about some practical aspects of language modeling. I will describe how freely available, off-the-shelf
tools can be used to easily build language models, where to get data to train a
language model, and how to use methods such as count cutoffs or relative-entropy
techniques to prune language models.
Those who attend the tutorial
should walk away with a broad understanding of current language modeling
techniques, and the background needed to either build their own language
models, or to apply some of these techniques to other fields.
The tutorial is somewhat
similar to a tutorial presented at AAAI 2002 (with Eugene Charniak), slides for
which are available as powerpoint and in postscript: 6
slides/page or one
slide/page. Those slides give a
feeling for the content of the talk, but the ICML tutorial will be updated and
tailored to the machine learning community. Much of the material covered in the
presentation is also in the paper A Bit of
Progress in Language Modeling.
About the Presenter
Joshua Goodman is a
Researcher at Microsoft Research. He has
worked at Dragon Systems on speech recognition.
He received his Ph.D. in Computer Science with a focus on Statistical
Natural Language Processing from