Universal translator

BabelfishAnother great example of collaboration and inter-disciplinarity. Scientists at Carnegie Mellon are developping oodles of juicy new technologies that are making the good old Altavista Babelfish translation service look like an illiterate Klingon speaking Portuguese. Gone is the time when you had to write down a speech you wanted translated. Now you can do it all in real-time ! Well, not quite but a couple seconds of delay isn’t that bad…

Being a statistician, what is really interesting here is the principles used to convert human speech in a format easily undertandable by a computer. Though I don’t know about the specifics of the Carnegie Mellon initiative, on common and efficient approach is the use of Hidden Markov Models (HMMs). For the uninitiated, a Markov model is a stochastic (random) process in either discrete or cotinuous time in which the state at a certain time depends only on the “recent” past. For example, one could model the weather of each day by conditioning on the weather of the previous day only. Some necessary ingredients are initial probabilities, on the first day what’s the probability that it rains or that it be sunny, and transition probabilities, given that it rained yesterday, what is the probability that it will be sunny today. Now an HMM adds a layer on top of such a process. That is you can observe a certain behaviour that is regulated by an underlying, and unobservable, Markov model. To go on with the example, Let’s say you’re on vacation and want to know the weather in your home town but could only do so by calling your friend that lives there and asking him what he is currently doing. He might say he went outside or he stayed inside. Him being your friend, he won’t lie. If you know the probabilities of the observed process given the hidden process, the probability of going outside if it rains, and the two previous probabilities, then you can evaluate the most plausible weathers for the week you were away.

So what does it all have to do with speech recognition ? Well, language is dictated by syllables. Some syllables are more often followed by specific syllables than others, that is to say speech is not completely random. What is really important is that the syllable you are about to say is only dependent on the syllable you just pronounced. That is a Markov model. However, the computer doesn’t know the syllables, and to a greater degree, it doesn’t even understand what you just said. But it can look at the acoustic features you produced. That is the observed process. Then all it needs to do is calculte the most likely sequence of syllables that could have produced the acoustic features. It is rather involved but it works fairly well in the long run. Then given the syllables, you need to work out the words and once you get the words, you need to group them in phrases prior to translating them.

One delivery method that has been developped is a pair of glasses on which the translation appears in text. Future development would be a simple ear piece that would translate directly to speech with your favourite mechanical accent. Anyone wants to listen to Shakespeare in it’s original Klingon ?

October 28th, 2005 | General Science, Technology

No comments