UWEETR-2004-0014 Author(s): Keywords: Abstract In spite of recent advances in statistical algorithms and increased
availability
of large text corpora, statistical language modeling remains a
challenging task, in particular for
morphologically rich languages. Recently, new approaches based on
factored
language models have been developed to address this problem. Whereas
standard language models only condition on preceding words, these
models
provide principled ways of including additional conditioning
variables other
than the preceding words, such as morphological or syntactic
features. However, the number of possible choices for model
parameters creates
a large space of models that cannot be searched exhaustively. This
paper
presents an entirely data-driven model selection procedure based on
genetic
search, which is shown to outperform both knowledge-based and random
selection
procedures on two different language modeling tasks (Arabic and
Turkish).
|