2000

Speech Interfaces for Games
Part 1: How Speech Recognition Works
by François Dominic Laramée

I'll let you in on a little secret: while nobody was looking, I sneaked out of the game industry in early 1998 and spent a year in a now-defunct "spoken dialogue interface" company's research unit. It's true, it's true; I'm not only a mediocre game designer and producer, I'm also a lousy AI freak. It's amazing how many different things you can fail at when you benefit from attention deficit disorder.

As a result, I have an idea or ten about what the game industry would gain by using speech as an input mechanism, in addition to mice, keyboards and $200 flight sticks. It is my contention that spoken interfaces may soon become the "new frontier" for a certain class of game developers, as super-powerful graphics hardware leaves gigahertz-spinning CPUs twiddling their virtual thumbs in boredom.

This three-part article series will therefore examine the basic techniques behind speech recognition, natural language understanding, automated dialogue strategies, and their applications to the games of the future. Nothing too fancy for sure; this is heavy-duty stuff, no game developer in her right mind will ever code her own speech recognizer, and I am not about to risk forcible defenestration by revealing my former employer's trade secrets.

Why Speech Recognition Is Hard

First, the obvious: speech is a complex audio signal, made up of a large number of component sound waves. Speech can easily be captured in wave form, transmitted and reproduced by common equipment; this is how the telephone has worked for a century.

However, once we move up the complexity scale and try to make a computer understand the message encoded in speech, the actual wave form is unreliable. Vastly different sounds can produce similar wave forms, while a subtle change in inflection can transform a phoneme's wave form into something completely alien. In fact, much of the speech signal is of no value to the recognition process. Worse still: any reasonably accurate mathematical representation of the entire signal would be far too large to manipulate in real time.

Therefore, a manageable number of discriminating features must somehow be extracted from the wave before recognition can take place. A common scheme involves "cepstral coefficients" (cepstral is a mangled form of spectral); the recognizer collects 8,000 speech samples per second and extracts a "feature vector" of at most a few dozen numbers from each one, through a mathematical analysis process that is far beyond the scope of this article. From now on, when I mention "wave form" or "signal", I am actually talking about these collections of feature vectors.

Acoustic Pattern Recognition

Now that you have a compact, manageable and hopefully sufficient data set, how does the computer "understand" it? By comparison with words it already knows, using techniques from statistical pattern recognition.

A speech recognizer is equipped with two crucial data structures:

  • A database of typical wave forms for all of the phonemes (i.e., basic component sounds) of a language. Since many of these phonemes' pronunciation varies with context, the database usually contains a number of different wave forms, or "allophones", for each phoneme. Databases containing 200, 800 or even 10,000 allophones of the English language can be purchased on the open market.

  • A lexicon containing transcriptions of all the words known to the system into a phonetic language. There must be a "letter" in this phonetic alphabet for each allophone in the acoustic database. A good lexicon will contain several transcriptions for most words; the word "the", for example, can be pronounced "duh" or "dee", so it should have at least these two entries in the lexicon.

The allophone database works on a single phoneme at a time, so it must be expanded to recognize entire words and phrases. A data structure called a "trellis" is required; picture it as a massive Hidden Markov Model which consists of a combinatorial explosion of every possible allophone followed by every possible allophone, linked by weighted transitions, ad nauseam, until the trellis is long enough to recognize the longest word in the lexicon. (In practice, a complete trellis would require more memory than there is on Earth, and various techniques are used to simulate/reduce it.)

The recognizer then walks through each path in the trellis, using the Viterbi dynamic programming algorithm, and compares the models of its allophones with the features extracted from the input utterance. The paths receive scores according to how closely they match the data; the recognizer then reads the allophone sequences represented by the highest scores and looks them up in the lexicon, to see if they form legal words. If a lexicon entry scores much higher than all others, the recognizer has "understood" that word with a high level of confidence; otherwise, it may return several choices and hope that the rest of the application will be able to pick the most likely one on its own.

Training Acoustic Models

What if the allophonic database in your recognizer do not match your voice or accent? For example, what happens when a database compiled in the United States is deployed in England (or India?)

This is a real problem, and the only way to solve it is to "train" the acoustic models. For single-user systems, the user is expected to read a pre-determined text to the recognizer. For multi-user systems (i.e., 411 phone directory assistance, which can't ask people to talk for 15 minutes before answering their questions), a corpus of utterances spoken by many people is compiled and transcribed by hand; it is then fed to the recognizer and the weights on the transitions in its HMM are adjusted to nudge the results towards an "average" model capable of identifying as many of the "correct" values as possible. This is a rather ad hoc solution, and it quickly reaches the point of diminishing returns after a few thousand utterances.

Language Models

When recognizing complete sentences (as opposed to isolated words), a system can take advantage of the fact that language has structure.

For example, suppose that the acoustic recognizer believes that the current word is either "ours" or "hours". If you know that the previous word was "two", you can safely assume that the correct choice is "hours", because the sequence "two hours" makes sense, while "two ours" does not.

This is the theory behind statistical language models. The "trigram" model is very popular: given two previous words, a trigram gives probabilities that the next word will be X or Y, independent of the actual signal! Therefore, the system knows what to "expect", and may spend more time looking for probable word sequences than on pure hodge-podge. Building a trigram model is very easy: feed it a chunk of text, and it will count the frequencies of all three-word sequences!

See the reference section at the end of the article for more details on language models and their refinements.

Limitations of Speech Recognition

For all of the effort which dozens of PhD's have been putting into their work for years, speech recognition is nowhere near Star Trek yet. Among the unresolved issues:

  • Plain old speech recognizers are dumb. Even those smart enough to recognize complete sentences and equipped with language models will only spit out collections of words. It's up to someone or something else to make sense of them. (This is why most, if not all, speech input systems also include a natural language understanding unit; I will describe this component in detail next month.)

  • Speech recognition works best in quiet, controlled environments. Trying to make it work in Quake III noise levels is not very effective.

  • The larger the vocabulary, the easier it is to confuse a recognizer. If the vocabulary contains true homonyms, the system is in trouble.

  • Speech recognition is a processor hog; it can easily eat up the equivalent of a 300 Mhz Pentium II, leaving chump change for the rest of the application.

  • It is a lot easier to differentiate between long words; unfortunately, most common words are short.

  • For some people, speech recognition just doesn't work, and not necessarily because of an unusual accent either. For others, it works almost too well; they could be speaking through a kazoo in the middle of Mardi Gras, and the recognizer would return flawless results. No one knows why; the phenomenon has been whimsically called "sheep and goats", after the Biblical good guys and bad guys.

  • If the input equipment has trouble handling high frequencies, the recognizer will produce less accurate results, on average, for female speakers. This is a significant problem with telephone-based systems.

  • Even at the best of times, a speech recognizer will constantly skip words, hear words the user hasn't said, or misunderstand. In perfect circumstances, word accuracy rates can reach a seemingly impressive 95%... But statistically, 95% means that 8-10 word sentences will be recognized with at least one error more than half the time.

Games and Speech Recognition

Given all of this, three things should be clear:

  • Since speech recognition does not work well for everyone in all circumstances, no game will be able to get by with only a spoken interface.

  • The first games equipped with significant speech interfaces will be "quiet" ones, like trivia or adventure, rather than noisy action fare.

  • Because of its performance limitations, simple speech recognition is not enough for applications which require long inputs. Single-word commands are OK; detailed orders are not.

Here are the only examples I can think of of games which might benefit from being hooked up to a no-frills recognizer:

  • A trivia game. Most players enjoy these games in a quiet environment. It is also possible to build custom lexicons, each of which contains only a few likely answers, for each question; this reduces the error of a "false positive" match.

  • An arcade game where the interface can be limited to a few single-word commands. Arkanoid could be played with two words: "Left" and "Right"!

Speech Generation

Finally, a few words about automated speech generation. While the commercial tools tend to be very easy to use (i.e., one function call, passing a string of text and receiving a WAV file in return), speech quality is questionable at best. For games, this is rarely acceptable; unless you want a robot-like voice, you should have an actor record the computer character(s)' most common responses, and use word-stitching for the rest.

Further reading

If you find the gross oversimplifications, shortcuts and warp-speed half-truths in this article revolting, I suggest you look up Frederick Jelinek's book, Statistical Methods for Speech Recognition, published by MIT Press in 1997.

Next month, I will discuss how to extract meaning from a sequence of words, how this dramatically extends spoken interfaces' usefulness, and why it is more difficult when dealing with speech instead of written data. See you there!

Bio

François-Dominic Laramée is a freelance interactive game designer, developer and producer.  He has been involved in one form or another of the game industry, whether PC, console, online, set-top box or even play-by-mail, for the past decade, including more than 8 years of experience as Head of Studio, Game Designer, Software Engineer, Producer and Quality Assurance Manager in the interactive entertainment and spoken dialogue interfaces industries.  Learn more on his website at: http://pages.infinit.net/idjy/ 

<<<Back GIG Home         On to Part 2>>>

 

GIGnews is a publication of GIGnews.com, Inc.
"Get In the Game" is a registered trademark used with permission.

© 1
999- 2005 GIGnews.com, Inc.
Legal