
11 October 2005
Speech Interfaces for Games
Part 3:
Designing a Speech-Based Application
by François
Dominic Laramée
(If you missed Part 1, click
here)
(If you missed Part 2, click here)
So,
after the first two articles in this series, we have the infrastructure
required to extract word sequences and meaning from spoken utterances,
and transform them into a data structure (often called a "logical
form", by the way) which the computer can digest.
Great. Now, what can we do with it?
Well, if you're lucky enough to benefit
from a perfectly trained user who provides you with every bit of needed
information in intelligible form every time, you can take the logical
form, turn it into a set of commands and go on to do whatever it is that
you are supposed to do with them. However, this is about as unlikely as
a grocery wrapper becoming the NFL's MVP... What if the user's request
is incomplete or contradictory? Must you throw your hands up in the air
and give up, or is there some way to do better?
A Conversation With Your CPU
Consider the following bits of dialogue:
Scenario #1
Computer: "Captain, do you want to
open hailing frequencies, open fire or activate the Jump Drive?"
User: "Fire all weapons on the
nearest enemy frigate."
Computer: "Fire lasers, missiles
or all weapons?"
User: "All weapons."
Computer: "Fire on a frigate, a
battleship or a freighter?"
User: "I said the nearest enemy
frigate!"
Computer: "There are three enemy
frigates within range: the Saratoga, the Seville and the
San Diego. Which one do you want to target?"
User: "The nearest, damn
you!"
Computer: "I'm sorry, I did not
understand. Please try again."
User: "Malefice! You foul
silicon-based denizen of the tar pits of Hell! A plague on you and
your unnatural family!"
<Ship is broad-sided by an enemy
battle cruiser and destroyed.>
Scenario #2
User: "Computer, fire aft lasers
at the target of your choice, then open a channel to the refueling
ship and ask for a rendezvous at 157 Mark 8."
Computer: "Aye, captain!"
<Computer activates Jump Drive and
plows the ship right into the sun.>
Scenario #3
Computer: "Yes, Captain?"
User: "Do you have any
suggestions?"
Computer: "We are low on fuel, and
aft laser batteries have suffered heavy damage. Do you want me to plot
a course to the starbase?"
User: "What is the status of our
forward batteries?"
Computer: "Loaded and undamaged.
We have 11 more rounds of missiles in storage."
User: "Good. Open fire on the
weakest enemy vessel within range."
Computer: "That would be the
frigate Saratoga, sir. Confirm attack with all forward
batteries?"
User: "Confirmed!"
<Ship opens fire and destroys the
enemy frigate, opening a safe corridor to the starbase.>
Scenario #1 is an example of a
fixed-initiative system, and a rather bad one. (Effective
fixed-initiative systems are in common use in 411 directory assistance;
they are still somewhat brittle and work without human intervention in a
mere fraction of calls, but the phone company still saves fortunes
thanks to them.)
Scenario #1's computer asks all the
questions, handles very specific answers, and doesn't even listen to
anything else. If the user volunteers information to try to speed up the
interaction, the computer will ignore it until the "proper"
time for this bit of data is reached. Only in the simplest of cases (and
with the most docile of users) will a fixed-initiative dialogue achieve
good results quickly and naturally.
Scenario #2 is an example of what happens
when proper error-handling mechanisms are not implemented in a
speech-based application. In this case, the computer fails to understand
the user's very complicated query, defaults to a simple action (i.e.,
moving straight ahead) which is incredibly ill-suited to the context,
and wrecks the game without giving the user time to ask for corrections.
Scenario #3 is a best-case
mixed-initiative dialogue. The computer responds to the captain's
request for suggestions by following a pre-programmed script, which
checks on the ship's condition and determines that "escape"
would probably be preferable to "attack" in this situation;
since there is a starbase nearby, the computer picks "go to
starbase" as a suitable way to implement the "escape"
goal, seizes the initiative of the conversation and asks for the
captain's approval. However, the captain ignores the suggestion, taking
back the initiative, and asks for specific data on his ship's weaponry;
the computer falls back to "slave mode" and provides the
information. Then, when the captain under-specifies his ship's next
target, the computer assumes that the captain wants to fire the forward
batteries, looks for a target in front of the ship and asks for
confirmation before firing.
Our job as speech-based application
designers is to get as close to Scenario #3 as possible in as many cases
as possible. Perfect performance is rarely achievable, and the user will
probably have to study the system's quirks and adjust his behavior
accordingly, but never as much as in Scenario #1, or else nobody will
ever play our game!
Dialogue management, the design and
implementation of techniques allowing computers to conduct natural
conversations with humans, is a wide-open research topic, and there is a
lot of work to be done before a truly effective and versatile system can
be commercialized. Fortunately, there are ways for us to cheat and give
our players the illusion of more smarts than are actually in the
machine; I will now discuss a few of them.
Dialogue Phenomena
Natural language understanding works
within the scope of single utterances, which are complicated enough.
When we up the ante to entire conversations, several characteristics of
human speech compound the difficulty.
First, the topic being discussed may be
difficult in itself. There may be lots of information to transmit, or
technical details, or special cases, etc.; after all, if it were easy to
convey the information, we would do it in a single utterance and
wouldn't need dialogue management at all! This is a bit of a circular
argument, but it is valid.
Next, we must contend with ellipsis. When
talking, humans often skip words and phrases that are clearly implied by
the context. For example, if the computer asks "What is our
destination, captain?", an adult is far more likely to respond
"Alpha Centauri" than "Our destination is Alpha
Centauri"; the computer must be able to translate the name of a
star system into a destination. Ellipsis resolution is trickier in the
following bit of dialogue:
User: "What is the status of Saratoga?"
Computer: "Heavily damaged,
sir."
User: "And Kinshasa?"
Here, the computer must remember that the
user's last question was about a ship's status. The new ship name
is not intended as a topic of conversation, but rather as a new object
for that "old" question.
Another problem is that of pronoun
resolution. Consider the following dialogue snippet:
User: "Plot in a course for Alpha
Centauri."
Computer: "Aye captain!"
User: "Before we go, please open a
channel to Saratoga, and ask
her to meet us there."
Here, the computer must not only
understand that "her" refers to "Saratoga",
but also that "there" refers to "Alpha Centauri",
which was mentioned in an entirely different utterance.
Both ellipsis and pronoun resolution are
typically handled by establishing expectations in the dialogue manager.
These rules allow under-specified information to be "promoted"
to more detailed (and useful) status, by assuming common human speech
patterns and working accordingly. For example, if the computer asks:
"Where to, captain?", the expectation rule "Promote star
system name to Destination" will be activated, while the rival
"Promote star system name to <something else>" will not.
Expectations also serve as an error-detection mechanism: if expectations
are not met by the user's response, then it is quite possible that his
last utterance had been misunderstood and that the system had been
expecting something wrong after all.
We're Off To See The Wizard
To appear intelligent, a dialogue manager
must be able to conduct the same basic conversation in a vast number of
different ways. Application designers will therefore build a variety of
conversation paths into the system's controlling mechanism (something
akin to an expert system, or a finite state machine). If a specific
conversation happens to fall outside of these typical scenarios, the
system may attempt to steer it back into one of them, or give up and
resort to a default fixed-initiative alternative; it won't seem as
natural, but it will get the job done.
In order to build a sufficient number of
dialogue scenarios and a robust grammar, speech application designers
use an iterative technique: create a few simple examples, ask users to
test the system, look at what didn't work, and add new rules and
scenarios accordingly. After a while, the system's built-in scenarios
may cover 90% or more of all user utterances, and asking the user to
re-phrase a difficult sentence will usually be enough to fix the rest.
However, it turns out that users'
behavior changes markedly depending on whether they talk to people or
machines. Because of this peculiar quirk of human nature, face-to-face
interviews are not the most effective way to build speech applications.
Therefore, until the system is mature enough to be used as its own
iterative development mechanism, users will often be placed in front of
dummy applications whose lines are fed by hidden human operators,
located in another room. This way, the user is tricked into acting as he
would with the real system!
This technique has been whimsically named
"Wizard of Oz development", and it is a universal constant in
the industry.
Of course, no computer will be able to
resolve a voluntarily obfuscated utterance filled with circular
references and confusion, but really, who cares?
Any Way To Learn This Automatically?
Since the state of the art in spoken
dialogue application development largely consists of writing
conversation scripts by hand, one at a time, and hoping that most users
will happen to fall within one of them, researchers have recently begun
to attempt automated learning of intelligent dialogue strategies. The
results, however, are more amusing than impressive.
For example, one very basic experiment
attempted to teach a computer receptionist how to select the department
to which to route incoming phone calls. Correct routings received
bonuses, and incorrect routings were heavily penalized. After training
the system with literally tens of thousands of sample calls, it finally
developed its first "strategy": hang up on everybody, to avoid
bad routing penalties!
All humor aside, automated development of
spoken dialogue applications is a long, long way away.
Final Hints
If you ever decide to write a spoken
interface for a game, keep the following in mind:
- For the computer, merely deciding
when speech begins is a tough job. A recognizer eats just as much
processing power when it is confined to "listener mode",
so if spoken interaction happens in short or infrequent bursts,
you might want to let the recognizer sleep when it is not needed,
and add a GUI device to your application to "activate"
it when needed. This will leave you with more CPU time to spend on
something else the rest of the time.
- Do not ever try to cover 100%
of all conversation scenarios for a specific interaction. It will
take you years and you will undoubtedly end up adding
contradictory rules at some point, which means that any extra
effort past a certain level will actually be detrimental to your
global performance. If your game can go through 85% of
conversations flawlessly and fix most of the rest with a small
number of confirmation or re-phrasing requests, you win.
- Use long, unique words for your
game's important concepts. If you need star names in your game,
pick "Alpha Centauri" and "Aldebaran" and
"Betelgeuse" instead of "Vega" or
"Sol". Speech recognizers prefer big words.
- Try to reach a balance between
splitting the application into many small domains (with small
vocabularies which help recognition and small rule sets which make
correct behavior choices easier) and having a single, versatile
system which can handle any type of user request at any time. The
latter seems more natural, but not if it covers so much ground
that it gets confused all the time!
All right, that's about it for me. Thanks
for reading, and I'll be looking for your speech-based games at E3 next
year!
Bio
François-Dominic
Laramée is a freelance interactive game designer, developer and
producer. He has been involved in one form or another of the game
industry, whether PC, console, online, set-top box or even play-by-mail,
for the past decade, including more than 8 years of experience as Head
of Studio, Game Designer, Software Engineer, Producer and Quality
Assurance Manager in the interactive entertainment and spoken dialogue
interfaces industries. Learn more on his website at: http://pages.infinit.net/idjy/
<<<Back to Part
2 <<<Back
to GIG Home
|