11 October 2005 

Speech Interfaces for Games
Part 3:
Designing a Speech-Based Application 
by François Dominic Laramée

(If you missed Part 1, click here)
(If you missed Part 2, click here)

So, after the first two articles in this series, we have the infrastructure required to extract word sequences and meaning from spoken utterances, and transform them into a data structure (often called a "logical form", by the way) which the computer can digest.

Great. Now, what can we do with it?

Well, if you're lucky enough to benefit from a perfectly trained user who provides you with every bit of needed information in intelligible form every time, you can take the logical form, turn it into a set of commands and go on to do whatever it is that you are supposed to do with them. However, this is about as unlikely as a grocery wrapper becoming the NFL's MVP... What if the user's request is incomplete or contradictory? Must you throw your hands up in the air and give up, or is there some way to do better?

A Conversation With Your CPU

Consider the following bits of dialogue:

Scenario #1

Computer: "Captain, do you want to open hailing frequencies, open fire or activate the Jump Drive?"

User: "Fire all weapons on the nearest enemy frigate."

Computer: "Fire lasers, missiles or all weapons?"

User: "All weapons."

Computer: "Fire on a frigate, a battleship or a freighter?"

User: "I said the nearest enemy frigate!"

Computer: "There are three enemy frigates within range: the Saratoga, the Seville and the San Diego. Which one do you want to target?"

User: "The nearest, damn you!"

Computer: "I'm sorry, I did not understand. Please try again."

User: "Malefice! You foul silicon-based denizen of the tar pits of Hell! A plague on you and your unnatural family!"

<Ship is broad-sided by an enemy battle cruiser and destroyed.>

Scenario #2

User: "Computer, fire aft lasers at the target of your choice, then open a channel to the refueling ship and ask for a rendezvous at 157 Mark 8."

Computer: "Aye, captain!"

<Computer activates Jump Drive and plows the ship right into the sun.>

Scenario #3

Computer: "Yes, Captain?"

User: "Do you have any suggestions?"

Computer: "We are low on fuel, and aft laser batteries have suffered heavy damage. Do you want me to plot a course to the starbase?"

User: "What is the status of our forward batteries?"

Computer: "Loaded and undamaged. We have 11 more rounds of missiles in storage."

User: "Good. Open fire on the weakest enemy vessel within range."

Computer: "That would be the frigate Saratoga, sir. Confirm attack with all forward batteries?"

User: "Confirmed!"

<Ship opens fire and destroys the enemy frigate, opening a safe corridor to the starbase.>

Scenario #1 is an example of a fixed-initiative system, and a rather bad one. (Effective fixed-initiative systems are in common use in 411 directory assistance; they are still somewhat brittle and work without human intervention in a mere fraction of calls, but the phone company still saves fortunes thanks to them.)

Scenario #1's computer asks all the questions, handles very specific answers, and doesn't even listen to anything else. If the user volunteers information to try to speed up the interaction, the computer will ignore it until the "proper" time for this bit of data is reached. Only in the simplest of cases (and with the most docile of users) will a fixed-initiative dialogue achieve good results quickly and naturally.

Scenario #2 is an example of what happens when proper error-handling mechanisms are not implemented in a speech-based application. In this case, the computer fails to understand the user's very complicated query, defaults to a simple action (i.e., moving straight ahead) which is incredibly ill-suited to the context, and wrecks the game without giving the user time to ask for corrections.

Scenario #3 is a best-case mixed-initiative dialogue. The computer responds to the captain's request for suggestions by following a pre-programmed script, which checks on the ship's condition and determines that "escape" would probably be preferable to "attack" in this situation; since there is a starbase nearby, the computer picks "go to starbase" as a suitable way to implement the "escape" goal, seizes the initiative of the conversation and asks for the captain's approval. However, the captain ignores the suggestion, taking back the initiative, and asks for specific data on his ship's weaponry; the computer falls back to "slave mode" and provides the information. Then, when the captain under-specifies his ship's next target, the computer assumes that the captain wants to fire the forward batteries, looks for a target in front of the ship and asks for confirmation before firing.

Our job as speech-based application designers is to get as close to Scenario #3 as possible in as many cases as possible. Perfect performance is rarely achievable, and the user will probably have to study the system's quirks and adjust his behavior accordingly, but never as much as in Scenario #1, or else nobody will ever play our game!

Dialogue management, the design and implementation of techniques allowing computers to conduct natural conversations with humans, is a wide-open research topic, and there is a lot of work to be done before a truly effective and versatile system can be commercialized. Fortunately, there are ways for us to cheat and give our players the illusion of more smarts than are actually in the machine; I will now discuss a few of them.

Dialogue Phenomena

Natural language understanding works within the scope of single utterances, which are complicated enough. When we up the ante to entire conversations, several characteristics of human speech compound the difficulty.

First, the topic being discussed may be difficult in itself. There may be lots of information to transmit, or technical details, or special cases, etc.; after all, if it were easy to convey the information, we would do it in a single utterance and wouldn't need dialogue management at all! This is a bit of a circular argument, but it is valid.

Next, we must contend with ellipsis. When talking, humans often skip words and phrases that are clearly implied by the context. For example, if the computer asks "What is our destination, captain?", an adult is far more likely to respond "Alpha Centauri" than "Our destination is Alpha Centauri"; the computer must be able to translate the name of a star system into a destination. Ellipsis resolution is trickier in the following bit of dialogue:

User: "What is the status of Saratoga?"

Computer: "Heavily damaged, sir."

User: "And Kinshasa?"

Here, the computer must remember that the user's last question was about a ship's status. The new ship name is not intended as a topic of conversation, but rather as a new object for that "old" question.

Another problem is that of pronoun resolution. Consider the following dialogue snippet:

User: "Plot in a course for Alpha Centauri."

Computer: "Aye captain!"

User: "Before we go, please open a channel to Saratoga, and ask her to meet us there."

Here, the computer must not only understand that "her" refers to "Saratoga", but also that "there" refers to "Alpha Centauri", which was mentioned in an entirely different utterance.

Both ellipsis and pronoun resolution are typically handled by establishing expectations in the dialogue manager. These rules allow under-specified information to be "promoted" to more detailed (and useful) status, by assuming common human speech patterns and working accordingly. For example, if the computer asks: "Where to, captain?", the expectation rule "Promote star system name to Destination" will be activated, while the rival "Promote star system name to <something else>" will not. Expectations also serve as an error-detection mechanism: if expectations are not met by the user's response, then it is quite possible that his last utterance had been misunderstood and that the system had been expecting something wrong after all.

We're Off To See The Wizard

To appear intelligent, a dialogue manager must be able to conduct the same basic conversation in a vast number of different ways. Application designers will therefore build a variety of conversation paths into the system's controlling mechanism (something akin to an expert system, or a finite state machine). If a specific conversation happens to fall outside of these typical scenarios, the system may attempt to steer it back into one of them, or give up and resort to a default fixed-initiative alternative; it won't seem as natural, but it will get the job done.

In order to build a sufficient number of dialogue scenarios and a robust grammar, speech application designers use an iterative technique: create a few simple examples, ask users to test the system, look at what didn't work, and add new rules and scenarios accordingly. After a while, the system's built-in scenarios may cover 90% or more of all user utterances, and asking the user to re-phrase a difficult sentence will usually be enough to fix the rest.

However, it turns out that users' behavior changes markedly depending on whether they talk to people or machines. Because of this peculiar quirk of human nature, face-to-face interviews are not the most effective way to build speech applications. Therefore, until the system is mature enough to be used as its own iterative development mechanism, users will often be placed in front of dummy applications whose lines are fed by hidden human operators, located in another room. This way, the user is tricked into acting as he would with the real system!

This technique has been whimsically named "Wizard of Oz development", and it is a universal constant in the industry.

Of course, no computer will be able to resolve a voluntarily obfuscated utterance filled with circular references and confusion, but really, who cares?

Any Way To Learn This Automatically?

Since the state of the art in spoken dialogue application development largely consists of writing conversation scripts by hand, one at a time, and hoping that most users will happen to fall within one of them, researchers have recently begun to attempt automated learning of intelligent dialogue strategies. The results, however, are more amusing than impressive.

For example, one very basic experiment attempted to teach a computer receptionist how to select the department to which to route incoming phone calls. Correct routings received bonuses, and incorrect routings were heavily penalized. After training the system with literally tens of thousands of sample calls, it finally developed its first "strategy": hang up on everybody, to avoid bad routing penalties!

All humor aside, automated development of spoken dialogue applications is a long, long way away.

Final Hints

If you ever decide to write a spoken interface for a game, keep the following in mind:

  • For the computer, merely deciding when speech begins is a tough job. A recognizer eats just as much processing power when it is confined to "listener mode", so if spoken interaction happens in short or infrequent bursts, you might want to let the recognizer sleep when it is not needed, and add a GUI device to your application to "activate" it when needed. This will leave you with more CPU time to spend on something else the rest of the time.
  • Do not ever try to cover 100% of all conversation scenarios for a specific interaction. It will take you years and you will undoubtedly end up adding contradictory rules at some point, which means that any extra effort past a certain level will actually be detrimental to your global performance. If your game can go through 85% of conversations flawlessly and fix most of the rest with a small number of confirmation or re-phrasing requests, you win.
  • Use long, unique words for your game's important concepts. If you need star names in your game, pick "Alpha Centauri" and "Aldebaran" and "Betelgeuse" instead of "Vega" or "Sol". Speech recognizers prefer big words.
  • Try to reach a balance between splitting the application into many small domains (with small vocabularies which help recognition and small rule sets which make correct behavior choices easier) and having a single, versatile system which can handle any type of user request at any time. The latter seems more natural, but not if it covers so much ground that it gets confused all the time!

All right, that's about it for me. Thanks for reading, and I'll be looking for your speech-based games at E3 next year!

Bio

François-Dominic Laramée is a freelance interactive game designer, developer and producer.  He has been involved in one form or another of the game industry, whether PC, console, online, set-top box or even play-by-mail, for the past decade, including more than 8 years of experience as Head of Studio, Game Designer, Software Engineer, Producer and Quality Assurance Manager in the interactive entertainment and spoken dialogue interfaces industries.  Learn more on his website at: http://pages.infinit.net/idjy/ 

 <<<Back to Part 2       <<<Back to GIG Home

 

GIGnews is a publication of GIGnews.com, Inc.
"Get In the Game" is a registered trademark used with permission.

© 1
999- 2005 GIGnews.com, Inc.
Legal