11 October 2005 

Speech Interfaces for Games
Part 2: Natural Language Understanding
by François Dominic Laramée

(If you missed Part 1, click here).

Oh, good, you're back. Last month, we discussed how a computer extracts words from an audio signal. Now, time to explain what to do with the results.

To a computer, a list of words has no intrinsic meaning. Taking speech recognition results and extracting useful information which the computer can act upon requires very sophisticated programming techniques. This is not surprising: it takes a child years to master the basics of his or her mother tongue, and anyone who has tried to learn a second or third language as an adult can testify to the brain-wracking involved, especially when the language is not related to one the student already understands. And, since the exact ways in which the human brain processes language are still basically unknown, the first computational linguists had to start from scratch.

This article will cover some of the basics of natural language understanding (NLU), and how they can make spoken interfaces feasible for a class of games much larger than simple speech recognition.

Who Works In This Field?

These days, most of the advanced research in natural language understanding is performed by (or for) telephone companies or, increasingly, by internet corporations involved in IP telephony.

The reason for this lies in the poor quality of the telephone's audio signal. As we saw last time, speech recognition can reach word accuracy rates of 90% to 95% or more when conditions are ideal, for example when a calm speaker talks into a quality microphone, in a noise-free environment. However, the telephone's low bandwidth so degrades the voice signal that 80% accuracy is about as good as it gets, which means that at least one word in every single sentence should be expected to come out false. When cell phones are involved (especially the hands-free type), performance drops to 70% or less.

With such terrible accuracy results, next-generation automated telephony applications would be hopeless without some sort of error-fixing mechanism. This is where NLU comes into its own: a run-of-the-mill NLU can take 75% accurate recognition results and extract 85% or 90% of the underlying meaning, which is what counts in the end.

Top-Down and Bottom-Up Parsing

Before getting in too deep, a word of warning: for our purposes, analyzing an utterance's grammatical structure is of limited interest. While nearly all NLU systems implement Augmented Transition Networks and related techniques to break sentences into structural components and accelerate meaning extraction, this aspect of the work is largely irrelevant to a game designer. If you want more details on this topic, look up the references at the end of the text.

As far as the application developer is concerned, NLU works through a process very similar to software compilation, but in reverse. Called "bottom-up parsing", this technique is inherently pessimistic: while a compiler's top-down parser will scream in pain at the slightest colon misplacement, a bottom-up parser assumes that something will be wrong with its input and prepares to do the best job it can anyway.

For example, suppose that a C compiler is looking at the statement:

expos = 4 * betterThanYankees;

The compiler knows that C contains a construct called "assignment", which requires the presence of an assignment operator, a variable, and an expression whose value will be copied into it. If the variable "expos" has not yet been defined, the compiler will holler in protest; otherwise, and assuming that "4 * betterThanYankees" is a valid expression in the language, the compiler will accept the (unfortunately preposterous) statement and move on.

However, before top-down parsing could be applied to English, several problems would have to be overcome, namely:

  • All human languages teem with ambiguity, exceptions to every grammatical rule, and inconsistency. Defining a complete parsing grammar for any language more complicated than Esperanto is absolutely out of the question.
  • People can't be expected to write (or talk) in precisely defined fashion, and any system requiring such unhuman behavior would be exceedingly brittle.

NLU employs a more focused, robust approach to the problem.

  • First, an application will only attempt to define a grammar for the small subset of language appropriate to its domain; this way, much ambiguity can be side-stepped. For example, a routine trying to determine a starship's destination system might know a lot about things like "Aldebaran", "Departure" and "Warp Drive", but nothing at all about "Honey Bees" and "Sunshine".
  • Second, instead of parsing from the top down, the NLU will parse from the bottom up, trying to squeeze out fragments of meaning from short word chunks, and then combining them into as big a picture as can be painted.

For example, consider the phrase "I want to go to Boston this afternoon." A bottom-up parser might interpret the words "this afternoon" as a fragment describing a moment in time; the word "Boston" as a city name; the word "to" followed by a city name as a destination; and the rest of the phrase as an intention, which can be discarded because the rest of the phrase (time and destination) is enough to convey the utterance's meaning.

If something had gone wrong during input and the NLU had received the utterance "Eye wand two grow to Boston three afternoon", the important fragments "to Boston" and "afternoon" would still have been understood (possibly with uncertainty as to which day the user is talking about) and 80-100% of the meaning conveyed to the system, despite a truly revolting 38% word accuracy rate.

Grammars

Both types of parsing require a set of strict, complete and unambiguous rules, encoded into a grammar.

Decades ago, linguist and political activist Noam Chomsky defined several classes of languages and the grammars which underly them. For example, the "regular expressions" can be traversed in linear fashion, one "word" at a time, while the "context-free" grammars which define computer languages are recursive in nature. Chomsky's hierarchy revolutionized linguistics, and all NLU systems are founded on his ideas.

Here is an example of a minuscule grammar for departures and arrivals, in a travel domain:

<CITY> -> Boston
<CITY> -> Montreal
<CITY> -> New York
<ARRIVAL> -> [going] to <CITY>
<DEPARTURE> -> [starting] from <CITY>
<TRIP> -> <DEPARTURE> <ARRIVAL>
<TRIP> -> <ARRIVAL> <DEPARTURE>

The terms enclosed within <> are variable names. Words within square brackets are optional. The <TRIP> rules specify that a trip is defined either as a departure city followed by a destination, or the reverse. The <ARRIVAL> rule says that users can specify a destination either by saying "to" followed by the name of a city, or by saying "going to" followed by the city name.

Therefore, our system would be able to understand phrases such as "To Boston starting from New York", or "Starting from Montreal, going to Boston".

Even if the input was garbled into "BZZT to brrZZT grrrzzz from New York", the system would at least be able to grasp the departure city and ask a pointed question about the destination, instead of being reduced to "Please repeat, I did not understand."

The Bad News: It's More Difficult With Speech

When people are writing, they are relatively careful. They can spend some time thinking about proper formulations, correct spelling mistakes, etc. However, when they are speaking, things are not so simple. Numerous sources of errors will be introduced into the NLU's input stream:

  • People don't talk at a constant rate. They hesitate, stop, speed up, insert non-words like "uh", "ah" and "like" at the most inopportune times.
  • Quite often, speakers change their minds in mid-sentence. A human listener may understand that "No, I'm sorry, I meant Friday" means that any previous reference to a weekday can be discarded, but to a computer, this is difficult.
  • People stutter, duplicate phrases, or use wrong parts of speech at the wrong times. If you listen to an interview on TV, it will seem to flow quite naturally; but if you try to read a verbatim transcript of the same interview, it will drive you insane.
  • People assume that a listener will be able to refer pronouns and locutions like "it", "she" and "the other one" to the appropriate concepts. Again, this is incredibly complicated for a machine.
  • And even if the user speaks in a clear and decisive manner, the recognizer may throw a monkey wrench into the whole system by misunderstanding "the" as "they" and wrecking the entire structure of the utterance.

Therefore, while an NLU system connected to a keyboard may expect input like: "I want to go to Boston from Toronto Friday at ten AM", its cousin working with a speech recognizer is far more likely to receive gibberish like "Ah, yes, uh, I want, like, a ticket for New York at 8 on Thurs- no, I mean Friday, in the evening I mean. Oh, and I'm in Boston now." And if the user really wants to confuse the machine, he'll add: "No, I meant Toronto". For which of the trip's endpoints?

How To Write A Grammar

To make the system as robust as possible, the linguists who write its grammar will have to create rules accepting the most common errors as correct input, hope that their higher-level rules (like our <TRIP> rules above) support enough formulations to handle a majority of cases (without "overgenerating" and accepting pure gibberish as valid data), and rely on the parser's inherent robustness for the rest. Thank God for bottom-up parsing!

Whenever possible, grammar writers try to squeeze their domains into regular expressions, which can be parsed in O(n) time, while context-free grammars require O(n³). Unfortunately, this is rarely possible except in very simple cases.

Finally, if it isn't already obvious, let me state that writing grammars is difficult. Every time the writer adds a rule to cover a new case, she runs the risk of introducing ambiguity and breaking the system for some other utterance. The growth in complexity is explosive; once the system reaches several hundred rules, supporting one more utterance form without breaking ten becomes almost impossible.

As a result, some researchers have experimented with automated learning of grammar rules; for example, an intelligent version of our travel system might have "discovered" that a city name following the word "to" is usually indicative of a destination, if we had fed it enough examples where we had said that "to Boston" means that we want to go there. So far, the results are interesting, but nowhere near production-grade.

Games With NLU

Text adventures have used simple parsers for twenty years, so the concept has been proven viable for games. These days, if I wanted to write a game driven by spoken input and using NLU, I would employ the following strategy:

  • Let the machine drive the game's interactions with the user, via single question-answer pairs. This helps set-up expectations as to what the user will say, and therefore eliminates some ambiguity. For example, if the machine asks "How many phaser banks shall we fire at the Klingons, captain?", the answer "3" will be interpreted as a number of weapons instead of a simple numeral.
  • Define small, independent grammars for each task the NLU is expected to handle, instead of trying to fit them all into a big one. This keeps things simple and allows division of labor. For example, one grammar might handle the starship's movement between star systems, while another would take over during combat.
  • Use the NLU to handle simple commands which can be specified in a single sentence. "Let's go to Gamma Epsilon IV" is complete in itself. So is "Who is George Washington?" in a Jeopardy knock-off. Complete dialogues with NPC's are beyond the power of an NLU and require discourse management software, which we will talk about next month.

I believe that just about any time of game short of dialogue-intensive interactive fiction can be handled through NLU. Real-time strategy, with its simple unit-based orders, is a natural fit.

Further Reading

If you care to find out more, take a peek at James Allen's book, Natural Language Understanding, published by Addison-Wesley. The second edition, released in 1995, is current. You may also want to look up the proceedings of several very active conferences, including those organized by the Association for Computational Linguistics and the IEEE-sponsored International Conference on Spoken Language Processing.

Next month, we will close out the series by discussing how to build a complete dialogue into an application and give the computer as realistic an interaction style as possible. See you there!

Bio

François-Dominic Laramée is a freelance interactive game designer, developer and producer.  He has been involved in one form or another of the game industry, whether PC, console, online, set-top box or even play-by-mail, for the past decade, including more than 8 years of experience as Head of Studio, Game Designer, Software Engineer, Producer and Quality Assurance Manager in the interactive entertainment and spoken dialogue interfaces industries.  Learn more on his website at: http://pages.infinit.net/idjy/ 

 <<<Back to Part 1         On to Part 3 >>>

 

GIGnews is a publication of GIGnews.com, Inc.
"Get In the Game" is a registered trademark used with permission.

© 1
999- 2005 GIGnews.com, Inc.
Legal