The State of the Art in Language Modeling

Joshua Goodman, Microsoft Research

Summary: This tutorial covers the state of the art in language modeling. Because of size and data issues, language modeling is an especially challenging subfield of machine learning.  The bulk of the tutorial will describe current techniques in language modeling, including techniques like word clustering and smoothing (regularization) that are useful in many areas besides language modeling, and more language-model specific techniques such as high order n-grams and sentence mixture models.  Finally, the talk will describe applications of language modeling in more detail, including applications outside of language, as well as available toolkits and corpora.

 

 

 

Tutorial Description and Outline

This tutorial will cover the state-of-the-art in language modeling. Language models give the probability of word sequences, e.g. "recognize speech" is much more probable than "wreck a nice beach."  Language modeling is a very challenging problem for most machine learning techniques, because instead of predicting the probability of one or two things, we need a probability distribution over words -- i.e. tens of thousands of things.

Language models are useful in a large number of areas, including speech recognition, handwriting recognition, machine translation, information retrieval, context-sensitive spelling correction, and text entry for Chinese and Japanese or on small input devices.  Many language modeling techniques can be applied to other areas or to modeling any discrete sequence.  This tutorial should be accessible to anyone interested in machine learning.

The most basic language models -- n-gram models -- essentially just count occurrences of words in training data.  I will describe five relatively simple improvements over this baseline: smoothing, caching, skipping, sentence-mixture models, and clustering.  I will talk a bit about the applications of language modeling, including to areas other than language, and then I will quickly describe other recent promising work, and available tools and resources.

I begin by describing conventional-style language modeling techniques.

1) Smoothing (also called regularization) addresses the problem of data sparsity: there is rarely enough data to accurately estimate the parameters of a language model.  Smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data.  I will describe two classic techniques -- deleted interpolation and Katz (or Good-Turing) smoothing -- and one recent technique, Modified Kneser-Ney smoothing, which is the best known.

2) Caching is a widely used technique that uses the observation that recently observed words are likely to occur again.  Models from recently observed data can be combined with more general models to improve performance.

3) Skipping models use the observation that even words that are not directly adjacent to the target word contain useful information.

4) Sentence-mixture models use the observation that there are many different kinds of sentences.  By modeling each sentence type separately, performance is improved.

5) Clustering is one of the most useful language modeling techniques. Words can be grouped together into clusters through various automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word.  Clustering can be used to make smaller models or better performing ones.  I will talk briefly about clustering issues specific to the huge amounts of data used in language modeling (hundreds of millions of words) to form thousands of clusters.

I will spend some time talking about the applications of language modeling, to areas including speech recognition, spelling correction, entering data in Chinese or Japanese, as well as to non-language problems, such as recommender systems.

I will briefly describe some recent, but more speculative language modeling techniques, including maximum entropy models.  Finally, I will also talk about some practical aspects of language modeling.  I will describe how freely available, off-the-shelf tools can be used to easily build language models, where to get data to train a language model, and how to use methods such as count cutoffs or relative-entropy techniques to prune language models.

Those who attend the tutorial should walk away with a broad understanding of current language modeling techniques, and the background needed to either build their own language models, or to apply some of these techniques to other fields.

The tutorial is somewhat similar to a tutorial presented at AAAI 2002 (with Eugene Charniak), slides for which are available as powerpoint  and in postscript: 6 slides/page or one slide/page.  Those slides give a feeling for the content of the talk, but the ICML tutorial will be updated and tailored to the machine learning community.  Much of the material covered in the presentation is also in the paper A Bit of Progress in Language Modeling.

About the Presenter

Joshua Goodman is a Researcher at Microsoft Research.  He has worked at Dragon Systems on speech recognition.  He received his Ph.D. in Computer Science with a focus on Statistical Natural Language Processing from Harvard University in 1998.  After graduation, he began work in the speech group of Microsoft Research, where he continued his research in language modeling.  He has presented tutorials on language modeling at NA-ACL (2000), AMTA, and AAAI, and his language modeling results in A Bit of Progress in Language Modeling are perhaps the best reported.  Currently, he works on Spam Filtering in the Machine Learning and Applied Statistics group of Microsoft Research.