1 Generative vs. Discriminative
Modeling
Automatic Speech Recognition is concerned with mapping speech signals
A to a corresponding string of symbols W, where the signal is given
by a series of frame-based measurements, while W=w
1w
2...w
n
is a phonetic transcription. Stochastic recognizers map an incoming
signal A to the sequence W with the highest probability, so the result
of the recognition is
W*=arg |
max W
|
P(W|A)=arg |
max W
|
|
P(A|W)P(W)
P(A)
|
. |
|
Notice that A is taken from a continuous space, while the Ws form
a discrete distribution, so P(W
|A) clusters the observation space (by means of probabilities). Thus systems that
model P(W
|A) directly are called
discriminative, while the ones that model P(A
|W) are called
generative, since P(A
|W) gives the likelihood of A having been generated by W.
Current recognition systems employ the generative approach
almost exclusively. The most notable exceptions are the
HMM/ANN
hybrid recognizers where the emission probabilities are modeled using
discriminatively trained
ANNs. The efficiency of these systems suggests
that it is worth experimenting with discriminative models. This paper
describes our results with a segment-based recognizer, in which the
phonetic segments are modeled discriminatively. We report our results
on phonetic classification and
on utterance-level integration.
As a specialty, our phonetic classifier
uses our
Kernel-LDA to transform the feature
space, so we also give a short description of this.
2 Segment-Based Recognition
The basis of segment-based recognition is that the joint distribution
P(W,A) can be decomposed if we assume that the identity of a particular
phoneme w
i is affected by only the underlying segment of
the signal A
i. In our discriminative system the decomposition
P(W|A)= |
Õ
i
|
P(wi|Ai-1,Ai,Ai+1) |
|
is applied, that is we train phoneme probabilities conditioned on
the underlying segment and on the neighbouring ones.
During recognition the correct phonetic segmentation is not known
and automatic segmentation is known to be far from straightforward.
So the best we can do is to evaluate many possible
segmentations. In a discriminative model this means
P(W|A)= |
å
S
|
P(W,S|A)= |
å
S
|
P(W|S,A)P(S|A) » |
max
S
|
P(W|S,A)P(S|A), |
|
where the approximation turns the summation into a search problem.
The factor P(W
|A,S) will be approximated as
P(W|A,S)= |
Õ
i
|
P(wi|Ai-1,Ai,Ai+1). |
|
The phoneme classifier resulting this way can be trained on a manually
segmented corpus with many discriminative learners and with good results[
2].
The problematic point is P(S
|A), which should assess the probabilities of the possible segmentations without
knowledge about W. Practically, it normalizes the different segmentation
paths to make them comparable.
Training P(S
|A) discriminatively would mean to contrast the probability of
a segmentation to
all other possible segmentations. Since the space
of all possible segmentations is prohibitively large, we made two approximations.
Firstly, we localized the segmentations, which
means that a particular segment is contrasted only with those segments of the
competing segmentations which span the same time interval.
Secondly, these local
comparisons are also approximated by introducing the änti-segment".
More exactly, we trained the probabilities
P(S
i|A
i-1,A
i,A
i+1)
on (quasi-)randomly cut pieces of the training database,
the training samples being labeled either ßegment" or änti-segment".
During recognition
P(S
|A) is approximated from the probabilities of the involved segments
being änti-segments". For this we tried several aggregation strategies,
which will be discussed in the paper.
3 Discriminative Phoneme Classification
One advantage of the segmental approach is that it allows the modeling
of feature trajectories along segments. For this we used the simple
technique of taking the feature averages over segment thirds. These
three averages of 24 critical-band log-energies and of four formant-band
energies were used as features for phoneme classification (the latter
were found to help gross phonetic categorization). To these we also
added segment length, which we think to be especially important for
languages like Hungarian, where phonemic duration plays a discriminative
role.
Besides, we needed features to tell
ßegments" from änti-segments". For this we used the feature
variances and the feature derivatives at the boundaries.
Both our simple segmental model and the more sophisticated ones reported
in the literature (e.g. [
1])
were found slightly or considerably better than
HMM. Our
experiments also justified that discriminative models are more efficient
than generative ones. There are, however, some special problems concerning
discriminative models, for example context modeling. For this reasons
we attach great importance to feature space transformations which
can find representative features. After examining the effectiveness
of the linear transformation methods
LDA, PCA and
ICA
when combined with several learners[
2],
we now moved towards studying non-linear transformations like
Kernel-LDA[
3].
In this paper we discuss these methods and the results we reached
with them on the phoneme level.
4 Results and Conclusions
We evaluated our system on a small corpus, the training set consisting of 20,
the test set of 6 talkers pronouncing 52-52 Hungarian numbers. As a comparison,
an
HMM system was trained on the same corpus using monophone models (the corpus is
too small to train triphones). We found that even our simplest segment models
reached the results of the
HMM on the phoneme level, and later we considerably outperformed it
thanks to the feature transformations. As regards the änti-segment" recognition,
when training 5 gross phonetic classes
plus an änti-phoneme" we reached 94% correct classification, which we find
satisfying. Finally, on the world level we could only approach the results of the HMM,
which means that our aggregation method still needs refinement. Thus now we are working
on training P(S
|A) discriminatively on the word level.
References
- [1]
- Fukada, T., Sagisaka, Y. and Kuldip, K. P. Model Parameter
Estimation For Mixture Density Polynomial Segment Models. In Proceedings
of ICASSP'97, 1997.
- [2]
-
Kocsor, A., Tóth, L., Kuba, A. Jr., Kovács, K.,
Jelasity, M., Gyimóthy, T. and Csirik, J.
A Comparative Study of Several Feature Transformation
and Learning Methods for Phoneme Classification
submitted to
International Journal of Speech and Technology.
- [3]
- Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch,
P., Müller, K., Rätsch, G. and Smola, A. J. Input
Space vs. Feature Space in Kernel-Based Methods, IEEE Transactions
on Neural Networks, 1999(in press).