
2000
Speech Interfaces for Games
Part 1: How Speech Recognition Works
by François
Dominic Laramée
I'll let you in on a little
secret: while nobody was looking, I sneaked out of the game industry in
early 1998 and spent a year in a now-defunct "spoken dialogue
interface" company's research unit. It's true, it's true; I'm not
only a mediocre game designer and producer, I'm also a lousy AI freak.
It's amazing how many different things you can fail at when you benefit
from attention deficit disorder.
As a result, I have an idea
or ten about what the game industry would gain by using speech as an
input mechanism, in addition to mice, keyboards and $200 flight sticks.
It is my contention that spoken interfaces may soon become the "new
frontier" for a certain class of game developers, as super-powerful
graphics hardware leaves gigahertz-spinning CPUs twiddling their
virtual thumbs in boredom.
This three-part article
series will therefore examine the basic techniques behind speech
recognition, natural language understanding, automated dialogue
strategies, and their applications to the games of the future. Nothing
too fancy for sure; this is heavy-duty stuff, no game developer in her
right mind will ever code her own speech recognizer, and I am not about
to risk forcible defenestration by revealing my former employer's trade
secrets.
Why Speech Recognition Is Hard
First, the obvious: speech
is a complex audio signal, made up of a large number of component sound
waves. Speech can easily be captured in wave form, transmitted and
reproduced by common equipment; this is how the telephone has worked for
a century.
However, once we move up the
complexity scale and try to make a computer understand the
message encoded in speech, the actual wave form is unreliable. Vastly
different sounds can produce similar wave forms, while a subtle change
in inflection can transform a phoneme's wave form into something
completely alien. In fact, much of the speech signal is of no value to
the recognition process. Worse still: any reasonably accurate
mathematical representation of the entire signal would be far too large
to manipulate in real time.
Therefore, a manageable
number of discriminating features must somehow be extracted from the
wave before recognition can take place. A common scheme involves "cepstral
coefficients" (cepstral is a mangled form of spectral); the
recognizer collects 8,000 speech samples per second and extracts a
"feature vector" of at most a few dozen numbers from each one,
through a mathematical analysis process that is far beyond the scope of
this article. From now on, when I mention "wave form" or
"signal", I am actually talking about these collections of
feature vectors.
Acoustic Pattern Recognition
Now that you have a compact,
manageable and hopefully sufficient data set, how does the computer
"understand" it? By comparison with words it already knows,
using techniques from statistical pattern recognition.
A speech recognizer is
equipped with two crucial data structures:
-
A database of typical
wave forms for all of the phonemes (i.e., basic component sounds) of
a language. Since many of these phonemes' pronunciation varies with
context, the database usually contains a number of different wave
forms, or "allophones", for each phoneme. Databases
containing 200, 800 or even 10,000 allophones of the English
language can be purchased on the open market.
-
A lexicon containing
transcriptions of all the words known to the system into a phonetic
language. There must be a "letter" in this phonetic
alphabet for each allophone in the acoustic database. A good lexicon
will contain several transcriptions for most words; the word
"the", for example, can be pronounced "duh" or
"dee", so it should have at least these two entries in the
lexicon.
The allophone database works
on a single phoneme at a time, so it must be expanded to recognize
entire words and phrases. A data structure called a "trellis"
is required; picture it as a massive Hidden Markov Model which consists
of a combinatorial explosion of every possible allophone followed by
every possible allophone, linked by weighted transitions, ad nauseam,
until the trellis is long enough to recognize the longest word in the
lexicon. (In practice, a complete trellis would require more memory than
there is on Earth, and various techniques are used to simulate/reduce
it.)
The recognizer then walks
through each path in the trellis, using the Viterbi dynamic programming
algorithm, and compares the models of its allophones with the features
extracted from the input utterance. The paths receive scores according
to how closely they match the data; the recognizer then reads the
allophone sequences represented by the highest scores and looks them up
in the lexicon, to see if they form legal words. If a lexicon entry
scores much higher than all others, the recognizer has
"understood" that word with a high level of confidence;
otherwise, it may return several choices and hope that the rest of the
application will be able to pick the most likely one on its own.
Training Acoustic Models
What if the allophonic
database in your recognizer do not match your voice or accent? For
example, what happens when a database compiled in the United States is
deployed in England (or India?)
This is a real problem, and
the only way to solve it is to "train" the acoustic models.
For single-user systems, the user is expected to read a pre-determined
text to the recognizer. For multi-user systems (i.e., 411 phone
directory assistance, which can't ask people to talk for 15 minutes
before answering their questions), a corpus of utterances spoken by many
people is compiled and transcribed by hand; it is then fed to the
recognizer and the weights on the transitions in its HMM are adjusted to
nudge the results towards an "average" model capable of
identifying as many of the "correct" values as possible. This
is a rather ad hoc solution, and it quickly reaches the point of
diminishing returns after a few thousand utterances.
Language Models
When recognizing complete
sentences (as opposed to isolated words), a system can take advantage of
the fact that language has structure.
For example, suppose that
the acoustic recognizer believes that the current word is either
"ours" or "hours". If you know that the previous
word was "two", you can safely assume that the correct choice
is "hours", because the sequence "two hours" makes
sense, while "two ours" does not.
This is the theory behind
statistical language models. The "trigram" model is very
popular: given two previous words, a trigram gives probabilities that
the next word will be X or Y, independent of the actual signal!
Therefore, the system knows what to "expect", and may spend
more time looking for probable word sequences than on pure hodge-podge.
Building a trigram model is very easy: feed it a chunk of text, and it
will count the frequencies of all three-word sequences!
See the reference section at
the end of the article for more details on language models and their
refinements.
Limitations of Speech Recognition
For all of the effort which
dozens of PhD's have been putting into their work for years, speech
recognition is nowhere near Star Trek yet. Among the unresolved
issues:
-
Plain old speech
recognizers are dumb. Even those smart enough to recognize complete
sentences and equipped with language models will only spit out
collections of words. It's up to someone or something else to make
sense of them. (This is why most, if not all, speech input systems
also include a natural language understanding unit; I will describe
this component in detail next month.)
-
Speech recognition works
best in quiet, controlled environments. Trying to make it work in Quake
III noise levels is not very effective.
-
The larger the
vocabulary, the easier it is to confuse a recognizer. If the
vocabulary contains true homonyms, the system is in trouble.
-
Speech recognition is a
processor hog; it can easily eat up the equivalent of a 300 Mhz
Pentium II, leaving chump change for the rest of the application.
-
It is a lot easier to
differentiate between long words; unfortunately, most common words
are short.
-
For some people, speech
recognition just doesn't work, and not necessarily because of an
unusual accent either. For others, it works almost too well; they
could be speaking through a kazoo in the middle of Mardi Gras, and
the recognizer would return flawless results. No one knows why; the
phenomenon has been whimsically called "sheep and goats",
after the Biblical good guys and bad guys.
-
If the input equipment
has trouble handling high frequencies, the recognizer will produce
less accurate results, on average, for female speakers. This is a
significant problem with telephone-based systems.
-
Even at the best of
times, a speech recognizer will constantly skip words, hear words
the user hasn't said, or misunderstand. In perfect circumstances,
word accuracy rates can reach a seemingly impressive 95%... But
statistically, 95% means that 8-10 word sentences will be recognized
with at least one error more than half the time.
Games and Speech Recognition
Given all of this, three
things should be clear:
-
Since speech recognition
does not work well for everyone in all circumstances, no game will
be able to get by with only a spoken interface.
-
The first games equipped
with significant speech interfaces will be "quiet" ones,
like trivia or adventure, rather than noisy action fare.
-
Because of its
performance limitations, simple speech recognition is not enough for
applications which require long inputs. Single-word commands are OK;
detailed orders are not.
Here are the only examples I
can think of of games which might benefit from being hooked up to a
no-frills recognizer:
-
A trivia game. Most
players enjoy these games in a quiet environment. It is also
possible to build custom lexicons, each of which contains only a few
likely answers, for each question; this reduces the error of a
"false positive" match.
-
An arcade game where the
interface can be limited to a few single-word commands. Arkanoid
could be played with two words: "Left" and
"Right"!
Speech Generation
Finally, a few words about
automated speech generation. While the commercial tools tend to be very
easy to use (i.e., one function call, passing a string of text and
receiving a WAV file in return), speech quality is questionable at best.
For games, this is rarely acceptable; unless you want a robot-like
voice, you should have an actor record the computer character(s)' most
common responses, and use word-stitching for the rest.
Further reading
If you find the gross
oversimplifications, shortcuts and warp-speed half-truths in this
article revolting, I suggest you look up Frederick Jelinek's book, Statistical
Methods for Speech Recognition, published by MIT Press in 1997.
Next month, I will discuss
how to extract meaning from a sequence of words, how this
dramatically extends spoken interfaces' usefulness, and why it is more
difficult when dealing with speech instead of written data. See you
there!
Bio
François-Dominic
Laramée is a freelance interactive game designer, developer and
producer. He has been involved in one form or another of the game
industry, whether PC, console, online, set-top box or even play-by-mail,
for the past decade, including more than 8 years of experience as Head
of Studio, Game Designer, Software Engineer, Producer and Quality
Assurance Manager in the interactive entertainment and spoken dialogue
interfaces industries. Learn more on his website at: http://pages.infinit.net/idjy/
<<<Back
GIG Home On
to Part 2>>>
|