A Discriminative Segmental Speech Model and its Application to Hungarian Number Recognition

László Tóth, András Kocsor, Kornél Kovács

1  Generative vs. Discriminative Modeling

Automatic Speech Recognition is concerned with mapping speech signals A to a corresponding string of symbols W, where the signal is given by a series of frame-based measurements, while W=w1w2...wn is a phonetic transcription. Stochastic recognizers map an incoming signal A to the sequence W with the highest probability, so the result of the recognition is
W*=arg
max
W 
P(W|A)=arg
max
W 
P(A|W)P(W)

P(A)
.
Notice that A is taken from a continuous space, while the Ws form a discrete distribution, so P(W|A) clusters the observation space (by means of probabilities). Thus systems that model P(W|A) directly are called discriminative, while the ones that model P(A|W) are called generative, since P(A|W) gives the likelihood of A having been generated by W.
Current recognition systems employ the generative approach almost exclusively. The most notable exceptions are the HMM/ANN hybrid recognizers where the emission probabilities are modeled using discriminatively trained ANNs. The efficiency of these systems suggests that it is worth experimenting with discriminative models. This paper describes our results with a segment-based recognizer, in which the phonetic segments are modeled discriminatively. We report our results on phonetic classification and on utterance-level integration. As a specialty, our phonetic classifier uses our Kernel-LDA to transform the feature space, so we also give a short description of this.

2  Segment-Based Recognition

The basis of segment-based recognition is that the joint distribution P(W,A) can be decomposed if we assume that the identity of a particular phoneme wi is affected by only the underlying segment of the signal Ai. In our discriminative system the decomposition
P(W|A)=
Õ
i 
P(wi|Ai-1,Ai,Ai+1)
is applied, that is we train phoneme probabilities conditioned on the underlying segment and on the neighbouring ones.
During recognition the correct phonetic segmentation is not known and automatic segmentation is known to be far from straightforward. So the best we can do is to evaluate many possible segmentations. In a discriminative model this means
P(W|A)=
å
S 
P(W,S|A)=
å
S 
P(W|S,A)P(S|A) »
max
S 
P(W|S,A)P(S|A),
where the approximation turns the summation into a search problem.
The factor P(W|A,S) will be approximated as
P(W|A,S)=
Õ
i 
P(wi|Ai-1,Ai,Ai+1).
The phoneme classifier resulting this way can be trained on a manually segmented corpus with many discriminative learners and with good results[2]. The problematic point is P(S|A), which should assess the probabilities of the possible segmentations without knowledge about W. Practically, it normalizes the different segmentation paths to make them comparable.
Training P(S|A) discriminatively would mean to contrast the probability of a segmentation to all other possible segmentations. Since the space of all possible segmentations is prohibitively large, we made two approximations. Firstly, we localized the segmentations, which means that a particular segment is contrasted only with those segments of the competing segmentations which span the same time interval. Secondly, these local comparisons are also approximated by introducing the änti-segment". More exactly, we trained the probabilities P(Si|Ai-1,Ai,Ai+1) on (quasi-)randomly cut pieces of the training database, the training samples being labeled either ßegment" or änti-segment". During recognition P(S|A) is approximated from the probabilities of the involved segments being änti-segments". For this we tried several aggregation strategies, which will be discussed in the paper.

3  Discriminative Phoneme Classification

One advantage of the segmental approach is that it allows the modeling of feature trajectories along segments. For this we used the simple technique of taking the feature averages over segment thirds. These three averages of 24 critical-band log-energies and of four formant-band energies were used as features for phoneme classification (the latter were found to help gross phonetic categorization). To these we also added segment length, which we think to be especially important for languages like Hungarian, where phonemic duration plays a discriminative role.
Besides, we needed features to tell ßegments" from änti-segments". For this we used the feature variances and the feature derivatives at the boundaries.
Both our simple segmental model and the more sophisticated ones reported in the literature (e.g. [1]) were found slightly or considerably better than HMM. Our experiments also justified that discriminative models are more efficient than generative ones. There are, however, some special problems concerning discriminative models, for example context modeling. For this reasons we attach great importance to feature space transformations which can find representative features. After examining the effectiveness of the linear transformation methods LDA, PCA and ICA when combined with several learners[2], we now moved towards studying non-linear transformations like Kernel-LDA[3]. In this paper we discuss these methods and the results we reached with them on the phoneme level.

4  Results and Conclusions

We evaluated our system on a small corpus, the training set consisting of 20, the test set of 6 talkers pronouncing 52-52 Hungarian numbers. As a comparison, an HMM system was trained on the same corpus using monophone models (the corpus is too small to train triphones). We found that even our simplest segment models reached the results of the HMM on the phoneme level, and later we considerably outperformed it thanks to the feature transformations. As regards the änti-segment" recognition, when training 5 gross phonetic classes plus an änti-phoneme" we reached 94% correct classification, which we find satisfying. Finally, on the world level we could only approach the results of the HMM, which means that our aggregation method still needs refinement. Thus now we are working on training P(S|A) discriminatively on the word level.

References

[1]
Fukada, T., Sagisaka, Y. and Kuldip, K. P. Model Parameter Estimation For Mixture Density Polynomial Segment Models. In Proceedings of ICASSP'97, 1997.
[2]
Kocsor, A., Tóth, L., Kuba, A. Jr., Kovács, K., Jelasity, M., Gyimóthy, T. and Csirik, J. A Comparative Study of Several Feature Transformation and Learning Methods for Phoneme Classification submitted to International Journal of Speech and Technology.
[3]
Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K., Rätsch, G. and Smola, A. J. Input Space vs. Feature Space in Kernel-Based Methods, IEEE Transactions on Neural Networks, 1999(in press).

 

 
e="rect" coords="17,80,64,105" href="index.html">