UWEE Tech Report Series

Automatic Learning of Language Model Structure


UWEETR-2004-0014

Author(s):
Kevin Duh, Katrin Kirchhoff

Keywords:
statistical language modeling, genetic algorithms, structure learning

Abstract

In spite of recent advances in statistical algorithms and increased availability of large text corpora, statistical language modeling remains a challenging task, in particular for morphologically rich languages. Recently, new approaches based on factored language models have been developed to address this problem. Whereas standard language models only condition on preceding words, these models provide principled ways of including additional conditioning variables other than the preceding words, such as morphological or syntactic features. However, the number of possible choices for model parameters creates a large space of models that cannot be searched exhaustively. This paper presents an entirely data-driven model selection procedure based on genetic search, which is shown to outperform both knowledge-based and random selection procedures on two different language modeling tasks (Arabic and Turkish).

This paper is an extended version of the paper (pdf) that appeared in COLING-2004.

Download the PDF version