1
1
1
1
1
Home | Contact | About D.B. | D.B.'s Résumé | Eclectic Editing
From Technology Review, April 1996

Talking Hands

. . . . . . . . . . . . .
By David Brittan

JUSTINE CASSELL, a professor in the Learning and Common Sense section of the MIT Media Laboratory, is seated at her desk inspecting photos that capture some of her peers in the act of speaking. Her practiced eye zooms in on the hands. "Here's a shot where my colleague is clearly talking about digital books—I would put money on it." And indeed the man appears to be grasping a piece of air the shape of a small hand-held computer. "This person is talking about the interaction between different kinds of experiences," Cassell ventures, noting the splayed fingers on the verge of interlacing. In another photo, a hand seems to pinch imaginary grains of salt. "There's a 'precision' gesture," she says.

Cassell dares to attempt this parlor trick only because of her special knowledge: first, she is acquainted with the speakers, and, second, she has spent a decade studying the rich, if imprecise, language of the hands. Her painstaking observations of videotapes have led her to conclude that the gestures that accompany speech, sometimes dismissed as redundant or extraneous, are anything but. Now she is fine-tuning her theories with a computer model. Animated Conversation, a program that displays two simulated humans conversing and gesturing more or less spontaneously, enables her to test different assumptions and record the outcome. It is, she says, "the first formal model of the relationship between speech and gesture." Encouraged by the results, she believes gestures can play an important role in the future of human-machine interaction. As computers come to be viewed more as conversational partners than mere tools, interfaces that understand hand motions could allow more natural communication between people and machines.

Researchers have identified four different types of gesture associated with speech. "Iconics" are those air pictures that literally represent some aspect of an object under discussion—the hands delineating a digital book, for example. "Metaphorics," exemplified by the salt-pinch of precision and the dovetailed fingers of interaction, are gestures that stand for abstract concepts. "Deictics" (pronounced "dike tics") are pointing motions—they represent the spatial location of persons, places, and things. "Beats" are little waves of the hand that underscore the value of the accompanying speech.

This quartet of gestures is found in every culture, Cassell says, "but what's universal is the types, not the shapes." Metaphorics are especially diverse, probably because metaphors themselves vary so widely from one language or culture to another. When speakers of English tell a story, for instance, they often begin with what Cassell calls a "conduit" gesture—a handing over of information in the form of a package. Chinese speakers, she says, are more likely to spread out their hands, palms down, as if laying out a landscape.

After careful scrutiny of the way ordinary people tell stories, Cassell is convinced that gestures complement, rather than duplicate, speech. "We know we can understand conversations just fine in the absence of gesture; we talk on the phone, no problem. But when gestures are available, people take them into account." For example, when speakers represent two characters or concepts on different hands, listeners find it easier to tell the entities apart.

In one study, conducted with David McNeill, a psycholinguist at the University of Chicago, Cassell asked a storyteller to use incongruous gestures as he related a tale about the cartoon characters Sylvester and Tweety Bird. The narrator bounced his hand up and down when referring to Sylvester coming out the bottom of a drainpipe, for example, and mixed up the hands representing the two antagonists. Listeners detected nothing out of the ordinary, noting only that the narrator was "an animated storyteller." But the mismatched gestures had a potent effect on their understanding: when asked to recall the story, the listeners remembered Sylvester as having bounced down the street instead of emerging from a pipe, and appeared confused about which character did what.

Rules of Thumb

Despite individual and cultural differences, Cassell theorizes that speakers employ gestures in predictable ways. For example, she says, gestures tend to mark information that is new to the listener. In the statement "Joan bought earrings," the new information might be "earrings" (if the listener is aware that Joan bought jewelry but doesn't know what kind) or "Joan" (if the listener knows that earrings were purchased but doesn't know by whom). The speaker would make either an iconic gesture to represent earrings or a deictic gesture to single out Joan.

Iconics are especially apt to appear, Cassell finds, if the entity being discussed is outside the everyday lexicon. "You probably don't make a gesture for the moon even when it's mentioned for the first time, because that's shared knowledge," she says. But test subjects who are asked to retell a story involving a street-car inevitably make an iconic gesture when they come to the apparatus that connects the trolley to the wire. "We probably can't recover 'pantograph' without some help," says Cassell.

It is the predictability she discerns in human gesturing that has enabled Cassell to assemble her formal model. In Animated Conversation, a system she built with colleagues at the University of Pennsylvania, two on-screen characters, Gilbert and George, engage in unscripted dialogue at a bank. Gilbert has rudimentary knowledge of the duties of a teller. George knows he is a customer who needs $50. As one program creates dialogue, a "gesture generator" applies Cassell's putative rules for when, whether, and how hand motions will occur. "It knows that if an entity is mentioned for the first time, a gesture will be produced," says Cassell. "Next, it looks at the nature of the entity—is it concrete? is it metaphorical? does it have an existence in space, or does it not?" One of four "subgenerators"—iconic, metaphoric, deictic, or beat—then determines the form of the gesture. While deictics and beats are always pointing or waving gestures, iconics and metaphorics are drawn from dictionaries containing many possible gestures—a departure from real-life spontaneity that may eventually be remedied with software that builds new hand motions on the spot.

Other body parts get into the act as well: a program designed by Catherine Pelachaud while a postdoctoral fellow in computer science at the University of Pennsylvania governs Gilbert's and George's head nods, eyebrow movements, and shifts in gaze. And the characters' intonation—the rises in pitch that coincide with facial movements and hand gestures—varies according to a program written by Scott Prevost, a post-doc in Cassell's group at the Media Lab.

Witnessing Gilbert and George in action is a bit like watching the "two wild and crazy guys" of vintage Saturday Night Live. George is clueless about how to make a withdrawal, thrusting out his hands imploringly as he asks, "Will you help me get $50?" While a human teller might reach for the alarm button at this point, Gilbert explains that George will need a blank check, which he floridly models for his naive customer. Meanwhile the characters' heads nod constantly, as if on springs, and both sets of eyebrows work overtime. As much as their gestures seem to occur in the right places, the pair are not quite of this world.

All of which is good for science. "I realize there are too many gestures, too much facial expression," says Cassell. "The simulation allows us to see that we don't have the balance quite right."

As she and her colleagues work the bugs out of the model, Cassell is already thinking about the next steps. One is to send Gilbert or George into cyberspace to serve as a speaking, gesturing "avatar," or representative, of a computer user. The other step is to allow the user to communicate, using speech and gestures, with an onsceen agent. This, she believes, would be a milestone on the way to equipping computers with true "social interfaces." "The dominant metaphor for computer-human interaction these days is conversational: you're supposed to be able to talk to the machine, and it's supposed to talk back. But the people who've designed such interfaces thus far haven't fully exploited the metaphor." A real conversational partner, Cassell maintains, must do what humans do when they interact—interpret gestures. "It has to know that when I represent spatial information on both hands I'm drawing a contrast between two things. It has to know that if I say 'He went down the street' and my hands are bobbing up and down, I mean 'bounced,' even though I didn't say that."

A big challenge in designing a gesture-recognizing interface is enabling the computer to see what the user's hands are doing. One possibility relies on work done by Media Lab colleague Alex P. Pentland and his students, who have devised a video system that can follow the orientation of a person's palms. But fingers, which are crucial to gestures, are difficult to track because they are so small. Another solution may emerge from the work of the Media Lab's Neil Gershenfeld on magnetic force fields, which, when hands are thrust into them, "seem to do a good job of picking up changes in hand shapes."

In gesture generation and recognition, Cassell sees liberation. "To get computers to understand us," she says, "humans have learned to act like computers. We shape our behavior to an extraordinary extent when we sit down in front of a computer. But less and less so. Teaching a computer to interpret gestures is just another step toward it shaping its behavior."

A necessary step, or just frosting on the cake? "It may be frosting for people who already use computers a lot—though even they could work faster by getting simultaneous information across," says Cassell. "But consider the still vast set of people who don't use computers. The ability to converse and gesture naturally may open up the world of computing to many more of them."
db
Home | Contact | About D.B. | D.B.'s Résumé | Eclectic Editing
Justine Cassell's homepage,  Northwestern University