In some languages like Finnish or Hungarian phone duration is a very
important distinctive acoustic cue. The conventional HMM speech recognition
framework, however, is known to poorly model the duration information.
In this paper we compare different duration models within the framework
of HMM/ANN hybrids. The tests are performed with two different hybrid
models, the conventional one and the "averaging hybrid" recently proposed.
Independent of the model configuration, we report that the usual exponential
duration model has no detectable advantage over using no duration
model at all. Similarly, applying the same fixed value for all state
transition probabilities, as is usual with HMM/ANN systems, is found
to have no influence on the performance. However, the practical trick
of imposing a minimum duration on the phones turns out to be very
useful. The key part of the paper is the introduction of the gamma
distribution duration model, which proves clearly superior to the
exponential one, yielding a 12-20% relative improvement in the word
error rate, thus justifying the use of sophisticated duration models
in speech recognition.