
11 October 2005
Speech Interfaces for Games
Part 2: Natural Language Understanding
by François
Dominic Laramée
(If you missed Part 1, click
here).
Oh,
good, you're back. Last month, we discussed how a computer extracts
words from an audio signal. Now, time to explain what to do with the
results.
To a computer, a list of words has no
intrinsic meaning. Taking speech recognition results and extracting
useful information which the computer can act upon requires very
sophisticated programming techniques. This is not surprising: it takes a
child years to master the basics of his or her mother tongue, and anyone
who has tried to learn a second or third language as an adult can
testify to the brain-wracking involved, especially when the language is
not related to one the student already understands. And, since the exact
ways in which the human brain processes language are still basically
unknown, the first computational linguists had to start from scratch.
This article will cover some of the
basics of natural language understanding (NLU), and how they can make
spoken interfaces feasible for a class of games much larger than simple
speech recognition.
Who Works In This Field?
These days, most of the advanced research
in natural language understanding is performed by (or for) telephone
companies or, increasingly, by internet corporations involved in IP
telephony.
The reason for this lies in the poor
quality of the telephone's audio signal. As we saw last time, speech
recognition can reach word accuracy rates of 90% to 95% or more when
conditions are ideal, for example when a calm speaker talks into a
quality microphone, in a noise-free environment. However, the
telephone's low bandwidth so degrades the voice signal that 80% accuracy
is about as good as it gets, which means that at least one word in
every single sentence should be expected to come out false. When
cell phones are involved (especially the hands-free type), performance
drops to 70% or less.
With such terrible accuracy results,
next-generation automated telephony applications would be hopeless
without some sort of error-fixing mechanism. This is where NLU comes
into its own: a run-of-the-mill NLU can take 75% accurate recognition
results and extract 85% or 90% of the underlying meaning, which is what
counts in the end.
Top-Down and Bottom-Up Parsing
Before getting in too deep, a word of
warning: for our purposes, analyzing an utterance's grammatical
structure is of limited interest. While nearly all NLU systems implement
Augmented Transition Networks and related techniques to break sentences
into structural components and accelerate meaning extraction, this
aspect of the work is largely irrelevant to a game designer. If you want
more details on this topic, look up the references at the end of the
text.
As far as the application developer is
concerned, NLU works through a process very similar to software
compilation, but in reverse. Called "bottom-up parsing", this
technique is inherently pessimistic: while a compiler's top-down parser
will scream in pain at the slightest colon misplacement, a bottom-up
parser assumes that something will be wrong with its input and prepares
to do the best job it can anyway.
For example, suppose that a C compiler is
looking at the statement:
expos = 4 * betterThanYankees;
The compiler knows that C contains a
construct called "assignment", which requires the presence of
an assignment operator, a variable, and an expression whose value will
be copied into it. If the variable "expos" has not yet been
defined, the compiler will holler in protest; otherwise, and assuming
that "4 * betterThanYankees" is a valid expression in the
language, the compiler will accept the (unfortunately preposterous)
statement and move on.
However, before top-down parsing could be
applied to English, several problems would have to be overcome, namely:
- All human languages teem with
ambiguity, exceptions to every grammatical rule, and inconsistency.
Defining a complete parsing grammar for any language more
complicated than Esperanto is absolutely out of the question.
- People can't be expected to write (or
talk) in precisely defined fashion, and any system requiring such
unhuman behavior would be exceedingly brittle.
NLU employs a more focused, robust
approach to the problem.
- First, an application will only
attempt to define a grammar for the small subset of language
appropriate to its domain; this way, much ambiguity can be
side-stepped. For example, a routine trying to determine a
starship's destination system might know a lot about things like
"Aldebaran", "Departure" and "Warp
Drive", but nothing at all about "Honey Bees" and
"Sunshine".
- Second, instead of parsing from the
top down, the NLU will parse from the bottom up, trying to squeeze
out fragments of meaning from short word chunks, and then combining
them into as big a picture as can be painted.
For example, consider the phrase "I
want to go to Boston this afternoon." A bottom-up parser might
interpret the words "this afternoon" as a fragment describing
a moment in time; the word "Boston" as a city name; the word
"to" followed by a city name as a destination; and the rest of
the phrase as an intention, which can be discarded because the rest of
the phrase (time and destination) is enough to convey the utterance's
meaning.
If something had gone wrong during input
and the NLU had received the utterance "Eye wand two grow to Boston
three afternoon", the important fragments "to Boston" and
"afternoon" would still have been understood (possibly with
uncertainty as to which day the user is talking about) and 80-100% of
the meaning conveyed to the system, despite a truly revolting 38% word
accuracy rate.
Grammars
Both types of parsing require a set of
strict, complete and unambiguous rules, encoded into a grammar.
Decades ago, linguist and political
activist Noam Chomsky defined several classes of languages and the
grammars which underly them. For example, the "regular
expressions" can be traversed in linear fashion, one
"word" at a time, while the "context-free" grammars
which define computer languages are recursive in nature. Chomsky's
hierarchy revolutionized linguistics, and all NLU systems are founded on
his ideas.
Here is an example of a minuscule grammar
for departures and arrivals, in a travel domain:
<CITY> -> Boston
<CITY> -> Montreal
<CITY> -> New York
<ARRIVAL> -> [going] to <CITY>
<DEPARTURE> -> [starting] from <CITY>
<TRIP> -> <DEPARTURE> <ARRIVAL>
<TRIP> -> <ARRIVAL> <DEPARTURE>
The terms enclosed within <> are
variable names. Words within square brackets are optional. The
<TRIP> rules specify that a trip is defined either as a departure
city followed by a destination, or the reverse. The <ARRIVAL> rule
says that users can specify a destination either by saying
"to" followed by the name of a city, or by saying "going
to" followed by the city name.
Therefore, our system would be able to
understand phrases such as "To Boston starting from New York",
or "Starting from Montreal, going to Boston".
Even if the input was garbled into "BZZT
to brrZZT grrrzzz from New York", the system would at least be able
to grasp the departure city and ask a pointed question about the
destination, instead of being reduced to "Please repeat, I did not
understand."
The Bad News: It's More Difficult With
Speech
When people are writing, they are
relatively careful. They can spend some time thinking about proper
formulations, correct spelling mistakes, etc. However, when they are
speaking, things are not so simple. Numerous sources of errors will be
introduced into the NLU's input stream:
- People don't talk at a constant rate.
They hesitate, stop, speed up, insert non-words like "uh",
"ah" and "like" at the most inopportune times.
- Quite often, speakers change their
minds in mid-sentence. A human listener may understand that
"No, I'm sorry, I meant Friday" means that any previous
reference to a weekday can be discarded, but to a computer, this is
difficult.
- People stutter, duplicate phrases, or
use wrong parts of speech at the wrong times. If you listen to an
interview on TV, it will seem to flow quite naturally; but if you
try to read a verbatim transcript of the same interview, it will
drive you insane.
- People assume that a listener will be
able to refer pronouns and locutions like "it",
"she" and "the other one" to the appropriate
concepts. Again, this is incredibly complicated for a machine.
- And even if the user speaks in a clear
and decisive manner, the recognizer may throw a monkey wrench into
the whole system by misunderstanding "the" as
"they" and wrecking the entire structure of the utterance.
Therefore, while an NLU system connected
to a keyboard may expect input like: "I want to go to Boston from
Toronto Friday at ten AM", its cousin working with a speech
recognizer is far more likely to receive gibberish like "Ah, yes,
uh, I want, like, a ticket for New York at 8 on Thurs- no, I mean
Friday, in the evening I mean. Oh, and I'm in Boston now." And if
the user really wants to confuse the machine, he'll add: "No, I
meant Toronto". For which of the trip's endpoints?
How To Write A Grammar
To make the system as robust as possible,
the linguists who write its grammar will have to create rules accepting
the most common errors as correct input, hope that their higher-level
rules (like our <TRIP> rules above) support enough formulations to
handle a majority of cases (without "overgenerating" and
accepting pure gibberish as valid data), and rely on the parser's
inherent robustness for the rest. Thank God for bottom-up parsing!
Whenever possible, grammar writers try to
squeeze their domains into regular expressions, which can be parsed in
O(n) time, while context-free grammars require O(n³). Unfortunately,
this is rarely possible except in very simple cases.
Finally, if it isn't already obvious, let
me state that writing grammars is difficult. Every time the writer adds
a rule to cover a new case, she runs the risk of introducing ambiguity
and breaking the system for some other utterance. The growth in
complexity is explosive; once the system reaches several hundred rules,
supporting one more utterance form without breaking ten becomes almost
impossible.
As a result, some researchers have
experimented with automated learning of grammar rules; for example, an
intelligent version of our travel system might have
"discovered" that a city name following the word
"to" is usually indicative of a destination, if we had fed it
enough examples where we had said that "to Boston" means that
we want to go there. So far, the results are interesting, but nowhere
near production-grade.
Games With NLU
Text adventures have used simple parsers
for twenty years, so the concept has been proven viable for games. These
days, if I wanted to write a game driven by spoken input and using NLU,
I would employ the following strategy:
- Let the machine drive the game's
interactions with the user, via single question-answer pairs. This
helps set-up expectations as to what the user will say, and
therefore eliminates some ambiguity. For example, if the machine
asks "How many phaser banks shall we fire at the Klingons,
captain?", the answer "3" will be interpreted as a
number of weapons instead of a simple numeral.
- Define small, independent grammars for
each task the NLU is expected to handle, instead of trying to fit
them all into a big one. This keeps things simple and allows
division of labor. For example, one grammar might handle the
starship's movement between star systems, while another would take
over during combat.
- Use the NLU to handle simple commands
which can be specified in a single sentence. "Let's go to Gamma
Epsilon IV" is complete in itself. So is "Who is George
Washington?" in a Jeopardy knock-off. Complete dialogues with
NPC's are beyond the power of an NLU and require discourse
management software, which we will talk about next month.
I believe that just about any time of
game short of dialogue-intensive interactive fiction can be handled
through NLU. Real-time strategy, with its simple unit-based orders, is a
natural fit.
Further Reading
If you care to find out more, take a peek
at James Allen's book, Natural
Language Understanding, published by Addison-Wesley. The second
edition, released in 1995, is current. You may also want to look up the
proceedings of several very active conferences, including those
organized by the Association for Computational Linguistics and the
IEEE-sponsored International Conference on Spoken Language Processing.
Next month, we will close out the series
by discussing how to build a complete dialogue into an application and
give the computer as realistic an interaction style as possible. See you
there!
Bio
François-Dominic
Laramée is a freelance interactive game designer, developer and
producer. He has been involved in one form or another of the game
industry, whether PC, console, online, set-top box or even play-by-mail,
for the past decade, including more than 8 years of experience as Head
of Studio, Game Designer, Software Engineer, Producer and Quality
Assurance Manager in the interactive entertainment and spoken dialogue
interfaces industries. Learn more on his website at: http://pages.infinit.net/idjy/
<<<Back to Part
1 On
to Part 3 >>>
|