Sunday, September 27, 2009

N-gram thoughts

Suppose we have an arbitrary natural language, and we represent its lexicon as a set of words V. An n-gram is then a sequence of n (not necessarily distinct) elements of V. An n-gram distribution is associated with a sequence of n-1 words, and for each word w in V gives us a probability that the nth word in the sequence is w. An n-gram model is a set of n-gram distributions, one for every possible sequence of n-1 words.

I use empirical n-gram distribution to refer to the n-gram distribution that occurs in the wild, which one could only determine precisely from an infinite corpus. Now suppose that there are no complexity or data sparsity issues, so that we can make n arbitrarily large and essentially have access to the empirical distributions for that n. If we used the resulting model for text prediction, how good would it be? As good as a human text predictor? Worse? Better? Why? (In other words, what's the upper bound on the performance of n-gram language models on the scale of human performance?)

My first impulse: >1. The amount of information contained in such a model is too vast for a human to compete with. As a document grows so does the amount of context available to the model, and its predictions will asymptotically approach optimality.

My second impulse: <1. Producing natural language is partly a planning problem. Earlier text is partly determined by what text the writer plans for later, which is totally unknown to an n-gram model. A human text predictor may have access to this sort of knowledge, or other psychological data that will influence their predictions (e.g. prior knowledge of the writer). This will give humans the edge in any practical situation.

Thoughts?