Prospects for in-depth story understanding by computer

by Erik T. Mueller

November 29, 1999

Abstract
While much research on the hard problem of in-depth story understanding by computer was performed starting in the 1970s, interest shifted in the 1990s to information extraction and word sense disambiguation. Now that a degree of success has been achieved on these easier problems, I propose it is time to return to in-depth story understanding. In this paper I examine the shift away from story understanding, discuss some of the major problems in building a story understanding system, present some possible solutions involving a set of interacting understanding agents, and provide pointers to useful tools and resources for building story understanding systems.

Introduction

In 1976 John McCarthy wrote a memo discussing the problem of getting a computer to understand the following story from the New York Times:

A 61-year old furniture salesman was pushed down the shaft of a freight elevator yesterday in his downtown Brooklyn store by two robbers while a third attempted to crush him with the elevator car because they were dissatisfied with the $1,200 they had forced him to give them.
The buffer springs at the bottom of the shaft prevented the car from crushing the salesman, John J. Hug, after he was pushed from the first floor to the basement. The car stopped about 12 inches above him as he flattened himself at the bottom of the pit.
... (McCarthy, 1990, p. 70)

In 1999 the problems raised by McCarthy remain unsolved. No program can understand the above story at a level anywhere near that of a human reader. Only a limited understanding is currently possible.

Information extraction technology (Chinchor, 1999), for instance, can be used to read the above story and fill a template with information such as the following:

Incident: Location          United States: Brooklyn (CITY)
Incident: Type              Robbery
Perpetrator: Individual ID  "two robbers"
Human Target: Name          "John J. Hug"
Human Target: Description   "61-year old furniture salesman"

Given a particular class of events such as terrorist incidents, programs can be built that extract basic information such as the type of incident, perpetrators, and targets. Information extraction technology was refined during a series of DARPA-supported Message Understanding Conferences (MUCs) beginning in 1987 (Grishman & Sundheim, 1996).

Word sense disambiguation technology (Ide & Véronis, 1998) can be used to assign a sense (meaning) to each ambiguous word in a text. Various word sense disambiguation methods have been developed that can read the above story and determine that:

elevator refers to a lift in a building rather than the control surface on the tail of an airplane, and
store refers to a shop rather than a storehouse or memory device.

Given the heavy ambiguity of natural language, word sense disambiguation is essential to understanding. The first conference for evaluating word sense disambiguation systems was held in 1998 (Kilgarriff, Palmer, Rosenzweig, & Tiberius, 1998).

But the type of understanding McCarthy was concerned with is much deeper than information extraction and word sense disambiguation. He proposed that a computer should be able to demonstrate its understanding by answering questions about the above story such as:

Who was in the store when the events began?
Who was in the store during the attempt to kill Mr. Hug?
Who had the money at the end?
What would have happened if Mr. Hug had not flattened himself at the bottom of the pit?
Did Mr. Hug want to be crushed?
Did the robbers tell Mr. Hug their names?
Did Mr. Hug like the robbers, and did they like him?
What would have happened if Mr. Hug had tried to run away?
What can Mr. Hug do to avoid this in the future?
Did Mr. Hug know he was going to be robbed?
Does he know that he was robbed?
How did the robber try to crush him with the car?
How long did the events take?
What crimes were committed?
...
(excerpted from McCarthy, 1990, pp. 70-71)

These questions are beyond the state of the art.

In the 1970s and 1980s a number of researchers worked on the hard problem McCarthy discusses, which I refer to as in-depth story understanding. What happened to this research?

Charniak (1972) built a program for understanding simple children's stories and continued to work on story understanding well into the 1980s, developing marker passing techniques for explaining the motivations of characters. But he then shifted his focus to statistical techniques for syntactic parsing and word sense disambiguation (Charniak, 1993), apparently having run out of steam on story understanding. In 1986 he wrote:

The problem with marker passing is that it is not obvious if it can do the job of finding important inferences in a very large and interconnected database. ... Indeed, [the program] finds a lot of garbage.

Beginning in the 1970s, Roger Schank and his academic descendants (see Belew, 1999) built a number of programs for understanding and answering questions about simple stories (Lehnert, 1978; Schank & Riesbeck, 1981; Dyer, 1983) by tracking typical situations (scripts) and the plans and goals of story characters (Schank & Abelson, 1977). But by the early 1990s, many Schank descendants had drifted away from the story understanding problem. Lehnert (1994) provides an interesting account of her own evolution toward information extraction:

SAM [Script Applier Mechanism] at this time [1975] only knew about one script (whereas people have hundreds or thousands of them). ... In truth, SAM was carefully engineered to handle the input it was designed to handle and produce the output it was designed to produce. ...
SAM was just a prototype. As such, it didn't have to be robust or bullet proof. ...
As a graduate student, I was repeatedly reassured that it was necessary to walk before one could hope to run. Robust systems are nothing to worry over when you're still trying to master the business of putting one foot in front of the other. When I became a professor I said the same thing to my own students, but with a growing unease about the fact that none of these robustness issues had budged at all in 5 or 10 or 15 years. ...
By the time we got into the 90s, portability and scalability had become legitimate issues for basic research. In particular, our government sponsors were increasingly preoccupied with the question of whether or not certain technologies would scale up. It was within this atmosphere that the series of Message Understanding Conferences were first conceived. ...
I'd say that the UMass/MUC-4 system understands terrorism about as well as SAM understood restaurants. ... On the other hand, UMass/MUC-4 runs rings around SAM when it comes to robustness and generality. ...

My guess is that researchers abandoned story understanding for a variety of reasons. Some decided they would rather succeed on an easy problem than fail on a hard problem. Some got tired of the topic or ran out of ideas. Some did not spend enough time coding and debugging and thus were never able to realize their ideas. Some became distracted by easier subproblems. Some wished to develop immediate business applications. Funding pressures and changing fashions played a large role. Whereas in 1983, Dyer could write that "a simple metric, such as 'this program correctly reads all stories 100 words or less in length' is unattainable" (p. 18), by 1990 measurement in natural language processing was all the rage. Though performance measurement is certainly useful for evaluating and comparing systems, it can also skew research. As Shieber (1994) puts it:

Any behavioral test that is sufficiently constrained for our current technology must so limit the task and domain as to render the test scientifically uninteresting. ... The trend among [MUC] entrants over the last several conferences has been toward less and less sophisticated natural-language-processing techniques, concentrating instead on engineering tricks oriented to the exigencies of the restricted task—keyword-spotting, template-matching, and similar methods. In short, this is because such limited tests are better addressed in the near term by engineering (building bigger springs) than science (discovering the airfoil).

With some exceptions (see, for example, Domeshek, Jones, & Ram, 1999), little research on in-depth story understanding has been performed in the last decade. Work has instead concentrated on the easier tasks of information extraction and word sense disambiguation. (I have even built an information extraction system [Mueller, 2000b].) Now that research on these tasks has achieved some success, and indeed may be reaching a point of diminishing returns (for example Ide & Véronis [1998] write about word sense disambiguation that "we may have nearly reached the limit of what can be achieved in the current framework" [p. 27]), I propose it is time to return to the harder problem of story understanding.

In the remainder of this paper, I present one approach for building an in-depth story understanding system. First, I discuss some of the major problems that arise in attempting to build a story understanding system, and some possible solutions. Second, I provide pointers to tools and resources that can be used when building a story understanding system, including part-of-speech taggers, syntactic parsers, lexicons, and commonsense knowledge bases. These tools were not widely available when the story understanding problem was first attacked in the 1970s. My focus in this paper is on understanding the English language.

The hard understanding problem

How does a human understand language and how do we get a computer to understand language? Of course these are separate problems. My primary concern is with getting a computer to understand language using whatever techniques we can invent. Yet we know humans can understand language, so even an inkling of how humans do this can be helpful for building understanding computers.

In this section I present more problems than solutions, but that is the challenge of building a story understanding system.

Managing complexity

To get a computer to understand stories, we have a large problem on our hands. By understanding I do not mean simply generating parse trees, disambiguating words, or filling templates, but being able to answer arbitrary questions, generate paraphrases and summaries, fill arbitrary templates, make inferences, reason about the story, follow reasoning in the story, relate the story to general knowledge, hypothesize alternative versions of the story, look back over the story, and more. Thus our first problem is:

How, as human programmers, do we manage the complexity of building a language understanding system?

Hobbs, Stickel, Appelt, and Martin (1993) claimed that "issues of modularity simply go away" (p. 115) in their abduction-based story understanding system in which all linguistic and world knowledge is uniformly represented as predicate calculus axioms. But this approach suffers from combinatorial explosion problems (Domeshek, Jones, & Ram, 1999, pp. 93-94).

Minsky (1986, pp. 291-299) has hypothesized that the mind understands language simultaneously in multiple realms of thought. Each realm is concerned with a different aspect of the world, such as the physical, personal, or mental. He proposes that "the agencies within each realm can simultaneously apply their own methods for dealing with the corresponding aspect of the common topic of concern" (p. 294).

A growing body of research (Hirschfeld & Gelman, 1994) supports the view that much of human cognition is domain-specific. That is, humans have distinct explanatory frames, skills, and mechanisms for areas such as numbers, substances, middle-sized objects, physical processes, living kinds, artifacts, kinship categories, mental states, and moral beliefs.

This suggests a strategy: Take one area and build an understanding agent that is an expert in understanding that area. Then build a UA that "smashes" the next area (solves it well), and so on, for many areas.

The mind may not divide up the world in the same way as our UAs. But the division into UAs makes it easier for the mind of the programmer to cope with the complexity of the task and build the system.

What areas do we need UAs for? Let's consider what realms are involved for some of the questions McCarthy listed:

Who had the money at the end?
The realm of coercion (since the story stated that the robbers forced Mr. Hug to give them money) and the realm of possession (give).
What would have happened if Mr. Hug had not flattened himself at the bottom of the pit?
The physical realm.
Did Mr. Hug want to be crushed?
The realm of human needs and goals (pain avoidance and survival).
Did the robbers tell Mr. Hug their names?
The realm of communication (tell) and the legal and criminal realms (that robbery is illegal and robbers do not want to get caught).
Did Mr. Hug like the robbers, and did they like him?
The emotional realm.
What would have happened if Mr. Hug had tried to run away?
The emotional and criminal realms (that robbers are often armed and sometimes harm or kill).
Did Mr. Hug know he was going to be robbed? Does he know that he was robbed?
The mental realm.
How did the robber try to crush him with the car?
The physical realm and the device realm (that buttons or levers control an elevator).

In the above cases it is fairly clear what realms are involved. However, deciding what realm a concept falls into, and what UA should handle it, is not always easy:

Should Thea peeked around the tree be handled by the physical UA or the eyesight UA?
Does How long did the events take? involve the "temporal realm"? In order to make an overall judgment about duration, an understanding of the story in many realms is needed: the physical realm (how long it takes to walk from one part of the room to another), the device realm (how long it takes an elevator to descend), the criminal realm (how long a robbery takes), the personal realm (when someone comes to help), and so forth.
Is Who was in the store when the events began? in the physical realm? What about the "salesman realm"? (A salesperson works in a store. Sometimes there are customers in the store and sometimes there are not.) The "robber realm"? (Robbers prefer to strike when there are fewer people present. People might attempt to prevent the robbery or notify the police.)

Thus our next problem is:

What is the "right" division? How do we choose the modules of the system, and once they are chosen, how do we decide where any given piece of functionality should go? Can the understanding task really be modularized? Will this even work technically?

For example, the physical UA makes the inference that Mr. Hug might be crushed if the elevator descends. That Mr. Hug makes a similar inference and knows he might be crushed is an inference made by the mental UA. That Mr. Hug then has the goal to avoid being crushed is an inference made by the human need/goal UA. These UAs seem hopelessly intertwined. It seems futile to break the problem into UAs, since each UA ends up redoing the work of (and having the same expertise as) other UAs.

The answer is that UAs are not isolated entities. Rather, UAs depend on each other since information in one UA may be essential to another UA's understanding. Further, UAs need to come to some sort of agreement on an interpretation. We would like to avoid major inconsistencies where one UA thinks Mr. Hug is stuck in an elevator shaft while another thinks he is selling furniture to a customer.

Managing interaction

How do UAs interact and how do we manage their interaction? This is our next problem:

How do the modules of a language understanding system provide information to one another, make use of this information, and settle on a unified interpretation?

Minsky (1986, p. 294) has proposed a type of agent called a paranome that operates simultaneously on agents in several different realms. In understanding Mary gives Jack the kite, an origin paranome forces agents in the physical, possessional, and psychological realms to consider and agree on the origin of the transfer of the kite. A destination paranome forces agents to consider and agree on the destination of the transfer.

What do the UAs need to agree on when processing the newspaper story? Let's consider some of the alternative hypotheses that might be made by the UAs. Hypotheses that will eventually be ruled out are marked with an asterisk.

personal UA for S

S is furniture salesman
S works at furniture store
S is 61 years old

personal UA for R

R is robber

physical UA

*elevator = control surface
elevator = lift
push = physically push

R pushes S to elevator
R pushes S down elevator shaft
- S falls

crush = compress

R tries to crush S with elevator
- S is between bottom of elevator and ground
- Elevator descends
- S flattens himself
- Elevator hits buffer springs
- Elevator stops

give = hand over

S opens cash register
S grabs $1200
S moves hand to R's hand
S releases $1200
R grabs $1200

*forced = physically force

*R grabs $1200 from S

store = shop

Store is in downtown Brooklyn
Elevator is in store

human need/goal UA for S

Elevator descends

S wants to stay alive
- S wants to avoid being crushed
- Elevator stops
- S avoids being crushed

S wants to keep his job

human need/goal UA for an armed robber R

R is armed
R is not afraid to kill
R wants lots of money

R wants to rob furniture store

coercion UA

forced = coerce

R forces S to give R $1200
R threatens S with gun

*push = impel

*R impels S to go down elevator shaft

device UA

*store = memory device
*elevator = control surface
elevator = lift

R pushes buttons on elevator
- Elevator descends

*car = motor vehicle
car = part of lift

emotion UA for S

S is fearful of R

emotion UA for R

R is dissatisfied with $1200

R is angry at S

possession UA

give = transfer of possession

S gives R $1200
- R has $1200

competition UA

*crush = defeat

Each UA may entertain several alternative hypotheses. For example, the physical UA considers the details of how the robbers obtained the money: Perhaps the robber physically forced the money from the salesman, or perhaps the salesman handed the money to the robber.

Alternatives arise for a number of reasons: Words in the story may have multiple parts of speech (force may be a noun or verb), words may have multiple senses (give may mean any transfer of possession as well as physically handing over and force may mean physical force as well as coercion), sentences may have alternative syntactic parse trees, pronouns may have several possible referents, and so on. Leaving many details unspecified is the norm for natural language.

The UAs need to coordinate and settle on an overall shared interpretation. How is this done?

In general there are hard and soft constraints on the shared interpretation. Constraints may be within a single UA or across UAs. A hard constraint is that the device and physical UAs must agree on whether the sense of elevator is control surface or lift. A soft constraint is that "S is fearful of R" is correlated with "R tries to crush S with elevator."

Let's take a shorter example:

Jim set the milk on the table
1. set = place
2. set = jell
3. milk = glass of milk
4. milk = milk by itself
5. table = dining table
6. table = desk
7. table = table in a document
human need/goal UA for Jim:
8. Jim is thirsty
9. Jim wants to drink
10. Jim is hungry
11. It is mealtime
12. Jim wants to eat meal
13. Jim wants to do something strange
physical UA:
14. Table is in kitchen
15. Table is in dining room
16. Table is in office

The hard constraints are:

One sense per word:
xor(1,2)
xor(3,4)
xor(5,6,7)
One goal:
xor(9,12,13)
One location:
xor(14,15,16)

The soft constraints are:

Placing a glass of milk on the dining table agrees with eating a meal:
and(1,3,5,12)
Placing a glass of milk on the dining table or desk agrees with drinking:
and(1,3,or(5,6),9)
Allowing milk to jell on the table agrees with doing something strange:
and(2,4,5,13)
Goals agree with their motivation:
and(8,9)
and(or(10,11),12)
Dining table agrees with kitchen:
and(5,14)
Dining table agrees with dining room:
and(5,15)
Desk agrees with office:
and(6,16)

Reasonable solutions are:

set = place
milk = glass of milk
table = dining table
It is mealtime
Jim wants to eat meal
Table is in dining room

or:

set = jell
milk = milk by itself
table = desk
Jim wants to do something strange
Table is in kitchen

Solutions may be found in a number of ways: UAs can negotiate directly with one another or through intermediary paranomes. Ferber (1999) discusses protocols for collaboration in a society of agents. A weighted maximum satisfiability (MAX-SAT) algorithm (see, for example, Jiang, Kautz, & Selman, 1995) might be used.

Each UA can use whatever representations are best for understanding its realm. The physical UA might use a combination of spatial propositions, 2-dimensional grids, and 3-dimensional models. The emotion UA might use analog vectors. But in order for UAs to communicate, they must share a language.

If one UA is unsure about two alternatives, it might request the opinion of a UA that is an expert in the area. The trouble is, the UA might not even know the right question to ask the expert UA. It might not even realize the two alternatives exist.

A UA needs to be able to link a part of its representation to a part of a representation in another UA. That is, just as there can be causal links within a UA such as:

R is angry at S BECAUSE
R is dissatisfied with $1200
(in the emotion UA for R)

there can be causal links across UAs:

R is dissatisfied with $1200 (in the emotion UA for R) BECAUSE
S gives R $1200 (in the possession UA)
R pushes S to elevator (in the physical UA) BECAUSE
R is angry at S (in the emotion UA for R)

Linked representations are not necessarily in the same language, though a common language must be used to set up the link.

Managing updates

As new inputs are processed, the solution needs to be updated. This is our next problem:

How does the system update its interpretation as new inputs are received? How are new inputs related to previous inputs? How does the system again settle on a unified interpretation?

Each input provides new information that must be incorporated into the current understanding, and also provides new constraints on the interpretation. Norman (1982) describes an experiment that demonstrates how readers update their hypotheses about what the story is about after reading each sentence:

Dear little thing.
letter, pet
It was nice to feel it again.
pet, toy, indefinite
She had taken it out of its box that afternoon, given it a good brush and rubbed life back into the dim little eyes.
pet, toy
Little rogue! Yes, she really felt that way about it.
pet, toy
She put it on.
clothing, fur
Little rogue, biting its tail just by her left ear.
clothing, fur
When she breathed, something gentle seemed to move on her bosom.
fur
The day was cool and she was glad she had decided on her little fur.
fur (adapted from pp. 92-94)

The UAs must negotiate an agreement about how to relate and merge events in the discourse: For example, the second sentence of the newspaper story states that Mr. Hug was pushed from the first floor to the basement. The physical UA must realize that this is the same event as the one mentioned in the first sentence: that Mr. Hug was pushed down the shaft of a freight elevator. The physical UA has enough knowledge of the physical world to make this connection, but other UAs may not.

Managing possibilities

How long should UAs hold onto possibilities? That is:

How many alternative interpretations does the language understanding system maintain? When does it drop possibilities?

The trouble with keeping possibilities is that every possibility leads to more possibilities and we get a combinatorial explosion. The trouble with dropping unlikely possibilities is that those possibilities may later prove correct.

There are two ways the system could recover from a poor choice about what to drop: The system could fix the current interpretation on-the-fly, or it could back up to an earlier point in the story and reread (as humans sometimes do).

Managing detail

The next problem is:

How much detail does the language understanding system compute while reading the story? How much does it compute on demand during question answering (and other tasks)?

The more we compute as we read, the faster we can answer questions. The less we compute as we read, the faster we can read.

For example, some of the questions about the newspaper story can be answered based on information that was computed by the UAs while reading the story:

Who had the money at the end?
possession UA: R has $1200
Did Mr. Hug want to be crushed?
human need/goal UA for S: S wants to avoid being crushed
How did the robber try to crush him with the car?
physical UA: S is between bottom of elevator and ground
device UA: R pushes buttons on elevator
device and physical UAs: Elevator descends
How long did the events take?
Durations are calculated by many UAs and these are constrained to be in rough agreement.

Other questions require further reasoning:

What would have happened if Mr. Hug had not flattened himself at the bottom of the pit?
The physical UA reasons that Mr. Hug may have been crushed.
What would have happened if Mr. Hug had tried to run away?
The emotional and human need/goal UAs reason about the likely behavior of the robbers.
Does he know that he was robbed?
A mental state UA infers that he knew.

How much inferencing humans perform during reading is a topic of considerable debate (McKoon & Ratcliff, 1992), but it seems clear that we perform varying amounts of processing depending on the situation: When we glance at a story our understanding is similar to that of an information extraction system (incident: robbery, location: Brooklyn). When reading for pleasure we imagine the details of the setting. When reading a mystery we might engage in an elaborate stream of thought trying to crack it. How much energy we devote depends on our reasons for reading the story, our mood, and other factors. An in-depth story understanding system should have the same flexibility.

We can set a baseline level of understanding for the system by providing a list of questions that each UA should be able to answer quickly. (These baseline questions could eventually be used in a MUC-like evaluation for story understanding systems.) The baseline questions for the human need/goal UA for a person P are:

Why did P perform an action or cause a state?
How did P react to an action or state?
What did P expect to happen when performing an action or causing a state?
What was P's goal?
Did P's goal succeed or fail?

The baseline questions for the emotion UA for P are:

How did P feel?
Why did P feel a state?
How did P's emotions/feelings affect P's actions?
Who/what did P like/dislike?
Why did P like/dislike someone/something?
How did P liking/disliking someone/something affect P's actions?

The physical UA should create physical models (propositional, 2-d, and/or 3-d) for the events in the story and be able to answer the questions:

Where was someone/something?
Where did someone/something go?
Where did someone/something come from?
How did someone/something move?
Who/what was near/in/on/... someone/something?
Where was someone/something in relation to someone/something else?

The baseline questions for the possession UA are:

Who had something before an action?
Who had something after an action?
Who transferred something to someone?

The baseline questions for the competition UA are:

Were the goals of person1 and person2 in concord or conflict?
What were those goals?
What was the outcome?

Other baseline questions that involve multiple UAs are:

What time of day was it?
How long did something take?
Why did an action or state occur?
Who/what performed an action or caused a state?
What is the theme of the story?

Managing processing

A deeper level of understanding is possible if the system asks itself questions and tries to answer them (Ram, 1994). The answers lead to further questions and answers. The system considers alternatives and the consequences of those alternatives. Eventually it decides what it thinks happened.

A problem that arises is how to control this process. While humans have an amazing ability to cut straight to the heart of the matter, computers often get lost. That is:

How does the language understanding system decide what to consider next? How does it decide what not to consider? How does it know when it has reached a stumbling block? How do we trigger and direct processing?

If the system considers alternatives, and those alternatives lead to yet other alternatives, the system easily gets bogged down. When reading the newspaper story, it has a number of opportunities to get lost in fruitless lines of thought:

What was the layout of the store?

Was the elevator in the same room as the cash register?

Maybe there were two rooms.
Maybe there were three rooms.
Maybe there were four rooms.
...

Why did the robbers push him down the elevator shaft?

Were the robbers unarmed?
...

How did Mr. Hug get into the furniture business?

Maybe his family was in the business.
...

Why didn't somebody call the police?

Why was nobody in the store?

Maybe it was the weekend.
Maybe it was 8 pm.
Maybe it was 9 pm.
Maybe it was 10 pm.
...

Why was a small amount of money in the cash register?

Maybe they had just taken cash to the bank.
Maybe the store was not doing well.

Why was the store not doing well?
- Maybe their prices were too high.
- Maybe they did not have a good selection.
- ...

Where do the robbers live?

Maybe they live in Brooklyn.
Maybe they live in the Bronx.
Maybe they live in Manhattan.
...

Whether the above streams of thought are in response to questions generated by the system or questions provided as input, we have the same problem: how to control processing.

Minsky (1986, p. 59) has proposed dividing the brain into an A-Brain and a B-Brain. The A-Brain is hooked up to the external world and the B-Brain watches and advises the A-Brain. He suggested the following B-Brain rules:

A seems disordered and confused.
A appears to be repeating itself.
A does something B considers good.
A is occupied with too much detail.
A is not being specific enough. Inhibit that activity.
Make A stop. Do something else.
Make A remember this.
Make A take a higher-level view.
Focus A on lower-level details.

Processing in the language understanding system needs to be carried out by A-Brain tasks that are subject to control (starting, reducing priority, stopping) by a B-Brain. In addition to the domain-independent rules above, the B-Brain can contain rules particular to the reading task:

If A is spending too much time considering alternative spatial layouts and spatial layout is not crucial to the story, make A stop.
If A is spending too much time on a task that does not relate to the purpose of reading the story, make A stop.
If A is confused, tell A to go back and reread. A may have ruled out a possibility earlier that needs to be reconsidered.

The basics of understanding

I now provide some pointers to tools and resources to support the UAs described in the previous section. I review a number of off-the-shelf programs, emphasizing what is readily available.

Recognizing textual entities

Various textual entities in the input text must be recognized. Examples in the newspaper article are:

Words: was, pushed
Phrases: freight elevator, buffer springs
Times: yesterday
Places: downtown Brooklyn
Names: John J. Hug
Numbers: $1,200, 12 inches

These entities may be detected using various techniques. Regular expressions and pattern matching are often used. Apple Data Detectors (Nardi, Miller, & Wright, 1998) use context-free grammars. Silberztein (1997) uses finite-state transducers for parsing spelled out numbers such as four thousand five hundred twenty three.

Reynar and Ratnaparkhi (1997) and Palmer and Hearst (1994) provide methods for recognizing the ends of sentences.

ThoughtTreasure (Mueller, 1998a, pp. 81-104) provides text agents for recognizing lexical entries, names, places, times, telephone numbers, media objects, products, prices, communicons, and email headers.

Tagging

Taggers can be used to determine the most likely parts of speech for each word in a sentence. The de facto standard tagger is Eric Brill's (1994) freely available rule-based tagger. This tagger's accuracy is about 97 percent. A best tagger provides a single guess per word while an n-best tagger provides several guesses per word. The result of running the Brill best tagger on the first sentence of the news story is:

A/DT 61-year/JJ old/JJ furniture/NN salesman/NN was/VBD pushed/VBN
down/RP the/DT shaft/NN of/IN a/DT freight/NN elevator/NN yesterday/NN
in/IN his/PRP$ downtown/NN Brooklyn/NNP store/NN by/IN two/CD
robbers/NNS while/IN a/DT third/JJ attempted/VBD to/TO crush/VB
him/PRP with/IN the/DT elevator/NN car/NN because/IN they/PRP were/VBD
dissatisfied/JJ with/IN the/DT $1,200/CD they/PRP had/VBD forced/VBN
him/PRP to/TO give/VB them/PRP

The part-of-speech abbreviations used in the first line above are:

DT  determiner
JJ  adjective
NN  noun, singular or mass
VBD verb, past tense
VBN verb, past participle

Silberztein (1997, pp. 198-201) argues against the use of tagging and for exhaustive lexical analysis: Even with 97 percent accuracy per word, the accuracy for a 10-word sentence is only 73.7 percent. If one word in a sentence is incorrectly tagged, syntactic parsing is likely to fail. Taggers also do not take into account phrases. Given All of a sudden Jim set the French fries on the table, the Brill tagger produces:

All/DT of/IN a/DT sudden/JJ Jim/NNP set/VBD the/DT French/JJ fries/NNS
on/IN the/DT table/NN

It makes life easier on the syntactic parser if all of a sudden is treated as a single adverb and French fries as a single noun. ThoughtTreasure (Mueller, 1998a, pp. 261-271) contains a dictionary of inflections and mechanisms for inflectional and derivational morphology. Technology for exhaustive lexical analysis is also available from Teragram.

Syntactic parsing

A syntactic parser takes text or tokenized text such as Jim set the milk on the table and converts it into parse trees such as:

[S
 [NP [Name Jim]]
 [VP
  [V set]
  [NP [Det the] [N milk]]
  [PP [Prep on] [NP [Det the] [N table]]]]]

Many syntactic parsers have been built. Some that can be downloaded are:

Name Formalism English lexical entries

Link Grammar Parser (Sleator & Temperley, 1993) link grammar 25,000

ThoughtTreasure 0.00022 (Mueller, 1998b) CYK primordial soup 35,000

XTAG (XTAG Research Group, 1999) FB-LTAG 105,000

Full syntactic parsing is often slow and subject to combinatorial explosions resulting from ambiguity. For example, prepositional phrase attachment ambiguity leads to an alternative parse of the above sentence:

[S
 [NP [Name Jim]]
 [VP
  [V set]
  [NP [Det the]
      [N milk]
      [PP [Prep on] [NP [Det the] [N table]]]]]]

Partial parsing is therefore often used: Instead of returning a complete list of all the possible parses of a sentence, the parser returns a list of parse fragments (Bangalore, 1997).

Semantic parsing

A semantic parser converts input sentences or syntactic parse trees into semantic representations such as frames or logical formulas. These representations provide a surface-level understanding and serve as a point of departure for a more in-depth understanding.

Given Jim set the milk on the table, the ThoughtTreasure semantic parser (Mueller, 1998a, pp. 125-173) produces something like:

[set-on Jim1 milk1 table1]

The UNITRAN parser (Dorr, 1992) produces something like:

[CAUSE
  ([Thing JIM],
   [GO Ident
       ([Thing MILK],
        [TOWARD Ident
                ([Thing MILK],
                 [ON Ident
                     ([Thing MILK],
                      [Thing TABLE])])])])]

The Cyc-NL semantic parser (Burns & Davis, 1999) produces something like:

(and (isa Action1 SettingAnObject)
     (isa Milk1 Milk)
     (isa Table1 Table)
     (isa Jim1 Person)
     (objectActedOn Action1 Milk1)
     (onLocation Action1 Table1)
     (performedBy Action1 Jim1))

A semantic parser requires information about argument structure. For a verb this consists of information about what noun phrases, prepositional phrases, adjective phrases, and clauses can be provided as arguments, and how those arguments map to thematic roles (slots) of the output frame. For the verb set the lexicon contains something like:

NP1 set NP2 on NP3
=>
SettingAnObject
performedBy NP1
objectActedOn NP2
onLocation NP3

Lexicons containing broad coverage of argument structure and thematic roles useful for semantic parsing are not widely available: Mitton's (1992) lexical database contains argument structure information for 23,227 verbs (of a total of 70,646 lexical entries). However, the database does not distinguish word senses and does not specify argument structure in detail. It indicates, for example, that a verb takes a prepositional phrase argument, but not what particular prepositions are used.

Derived from Mitton's database, Comlex Syntax (Grishman, Macleod, & Meyers, 1994) contains detailed syntactic information for 38,000 English words (including 21,000 nouns, 8,000 adjectives, and 6,000 verbs). Comlex Syntax uses 92 features for describing verb argument structure.

WordNet (Fellbaum, 1998) uses 34 argument structure patterns and associates each verb sense with one or more of these patterns. However, detailed information such as what preposition is used is not always provided. Kohl, Jones, Berwick, and Nomura (1998) have extended WordNet to 226 argument structure patterns and entered these patterns for 2,600 verbs, but they are not yet part of the general release. These patterns also provide the mapping from arguments to thematic roles.

Levin (1993) argued that the syntactic behavior of a verb is determined to a large degree by its meaning. Levin classified 4,183 verbs into 191 classes with similar meaning and syntactic behavior. This work is often consulted when building lexicons for semantic parsers, but not used directly: In coding the argument structure for 4,000 verbs, Burns and Davis (1999) encountered a number of exceptions and concluded that the classification "does not completely obviate entering single words by hand" (p. 16). Kohl, Jones, Berwick, and Nomura (1998) used the example sentences and annotated them in Prolog from scratch instead of using the proposed classes.

A semantic parser also incorporates selectional constraints that specify what types of arguments are permitted. Selectional constraints may be on concepts, as in:

bark prefers a dog as a subject
eat prefers food as an object

Selectional constraints may also apply to specific words:

a yellow lemon is called yellow
yellow hair is called blond

Anaphora

A story understanding system must resolve various anaphoric entities into the objects to which they refer. Examples of anaphoric entities are pronouns (she, they), possessive determiners (my, his), and arbitrary constructions involving:

adjectives (the pink milk),
genitives (Jim's milk),
indefinite and definite articles (an elevator salesman, the shaft, the buffer springs),
names (John J. Hug), and
relative clauses (the $1,200 they had forced him to give them, the milk that fell on the floor).

Anaphora resolution is a difficult problem. ThoughtTreasure contains tools for recognizing anaphoric entities and suggesting possible referents (Mueller, 1998a, pp. 154-162). ThoughtTreasure partially implements a method proposed by Huls, Bos, and Claassen (1995) for determining the likelihood of each possible referent. Resolution of anaphora is best performed by the UAs.

Integrated parsing

Dyer (1983) argued for integrated parsing in which all parsing processes work in close cooperation. Integrated parsing is valid from a cognitive standpoint since humans are able to understand in real time as words are read or heard; there is no need to wait until the end of a sentence in order to understand it. Integrated parsing also reduces combinatorial explosion since incorrect possibilities can be eliminated sooner rather than later.

However, integration is difficult to achieve since the right information is not always available at the right time.

Suppose we want to integrate anaphor resolution into semantic parsing. In order to resolve a pronoun, its antecedent must be available. But only those antecedents from prior sentences are available inside semantic parsing. Antecedents within the current sentence might not yet be available, since those are the very concepts that semantic parsing is in the process of constructing. (What concepts are available inside semantic parsing depends on where one is in the tree and the order of traversal of branches.)

Suppose we want to integrate semantic parsing into syntactic parsing: A syntactic subtree may not contain all the information needed to parse it. For example, it is more difficult to perform semantic parsing of a verb phrase in isolation, since selectional constraints on the unavailable subject cannot be used to disambiguate the verb sense. Many subordinate clauses cannot be parsed in isolation either since they inherit arguments from their superordinate clauses.

A solution to the syntax-semantics integration problem has been proposed by Mahesh, Eiselt, and Holbrook (1999).

Commonsense knowledge bases

Several databases of commonsense knowledge are being built. Although the eventual goal is to capture the common sense that humans have, the emphasis so far has been on defining the atomic concepts of the world such as car and motor-vehicle and relations on those concepts such as:

ako: Cars are a kind of motor-vehicle.
isa: Boston is a city.
part-of: Cars have ignition-switches.
material-of: T-shirts are made of cotton.
used-for: Cars are used to drive.
used-at: Frying-pans are used in the kitchen.
color-of: Grass is green.
size-of: Chairs are 3 feet tall.
duration-of: Plays last 2 hours.
typical-subject-of: Dogs bark.
typical-object-of: Food is eaten.
implies: Cry implies sad.
causes: Cheer-up causes happy.

The following table shows the commonsense knowledge bases that currently exist and counts of the number of concepts and common relations:

Name Concepts ako/isa part-of/material-of Other

Cyc (Guha & Lenat, 1994) 30,000 25% 35% 40%

Cyc Upper Ontology 2.1 (Cycorp, 1998) 2,846 7,161 0 2,579

Mikrokosmos (Mahesh, 1996) 4,500 - - -

MindNet (Richardson, Vanderwende, & Dolan, 1993) 45,000 47,000 14,100 32,900

SENSUS (Knight & Luk, 1994) 70,000 - - -

ThoughtTreasure 0.00022 (Mueller, 1998b) 27,093 28,818 666 21,821

WordNet 1.6 (Fellbaum, 1998) 99,642 78,446 19,441 42,700

ThoughtTreasure also contains a database of 100 typical situations or scripts (Mueller, 2000a).

A commonsense knowledge base is a useful resource for a story understanding system: The unambiguous concept names can be used for communication between UAs as well as within a single UA. The relational information in the commonsense knowledge base can be used by UAs to recognize and resolve ambiguities.

Most importantly:

The commonsense knowledge base can be evolved along with the story understanding system: Whenever a piece of commonsense knowledge would come in handy in the story understanding system, it can be added to the database. The database can thus be grown to be useful for the story understanding application.

The above databases have various advantages and disadvantages: WordNet (Fellbaum, 1998) was designed as a lexical, rather than a conceptual, database so it lacks links between words in different syntactic categories. For example there is no link between the noun creation and the verb create.

The accuracy of the relations in MindNet is only about 78 percent (Richardson, Vanderwende, & Dolan, 1993) since it was built automatically by parsing a machine-readable dictionary. The concept names also cannot be relied upon since they were obtained by automatic word sense disambiguation (Richardson, Dolan, & Vanderwende, 1998).

Summary

In summary, I propose the following plan for building an in-depth story understanding system:

Feed input text to text agents for recognizing entities such as names and places.
Perform lexical analysis on the input.
Use part-of-speech taggers and word sense disambiguators to reduce possibilities.
Feed textual entities and lexical entries to a partial syntactic parser.
Feed syntactic parse fragments to a semantic parser.
Feed semantic parse fragments to a collection of understanding agents.
Build understanding agents for multiple realms including the physical realm, devices, human needs and goals, emotions, and mental states.
Design and implement mechanisms for understanding agents to negotiate a shared interpretation and renegotiate that interpretation as each input is received.
Build B-Brain mechanisms for controlling processing by understanding agents.
Adapt and extend existing parsers and lexicons.
Evolve existing commonsense databases by adding knowledge as needed.
Build links between existing resources, allowing multiple resources to be used.

Acknowledgments

This research was supported by the MIT Media Laboratory and Signiform.

References

Bangalore, Srinivas. (1997). Complexity of lexical descriptions and its relevance to partial parsing (PhD thesis). University of Pennsylvania, Philadelphia, PA. Available: http://www.cis.upenn.edu/~srini/diss.ps.gz

Belew, Richard K. (1999). AI genealogy database (Online). University of California, San Diego. Available: http://www-cse.ucsd.edu/users/rik/.

Brill, Eric. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence. pp. 722-727. Menlo Park, CA: AAAI Press and Cambridge, MA: MIT Press. Available: http://www.cs.jhu.edu/~brill/TAGGING_ADVANCES.ps

Burns, Kathy J., & Davis, Anthony R. (1999). Building and maintaining a semantically adequate lexicon using Cyc. In Viegas, Evelyne. (Ed.), Breadth and depth of semantic lexicons. Dordrecht: Kluwer.

Charniak, Eugene. (1972). Toward a model of children's story comprehension (AI Laboratory Technical Report 266). Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Available: ftp://publications.ai.mit.edu/ai-publications/pdf/AITR-266.pdf and ftp://publications.ai.mit.edu/ai-publications/0-499/AITR-266.ps

Charniak, Eugene. (1986). A neat theory of marker passing. In Proceedings of the Fifth National Conference on Artificial Intelligence. pp. 584-588. Menlo Park, CA: AAAI Press.

Charniak, Eugene. (1993). Statistical language learning. Cambridge, MA: MIT Press.

Chinchor, N. (Ed.). (1999). MUC-7 proceedings (Online). Available: http://www.muc.saic.com/proceedings/muc_7_toc.html

Cycorp. (1998). Cyc Upper Ontology 2.1 (Online). Austin, TX: Cycorp. Available: http://www.cyc.com/cyc-2-1/index.html

Domeshek, Eric, Jones, Eric, & Ram, Ashwin. (1999). Capturing the contents of complex narratives. In Ram, Ashwin, & Moorman, Kenneth. (Eds.), Understanding language understanding. pp. 73-105. Cambridge, MA: MIT Press.

Dorr, Bonnie J. (1992). The use of lexical semantics in interlingual machine translation. Journal of Machine Translation, 7(3), 135-193. Available: ftp://ftp.umiacs.umd.edu/pub/bonnie/jmt92.ps

Dyer, Michael G. (1983). In-depth understanding. Cambridge, MA: MIT Press.

Fellbaum, Christiane. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

Ferber, Jacques. (1999). Multi-agent systems. Harlow, England: Addison-Wesley.

Grishman, Ralph, Macleod, Catherine, & Meyers, Adam. (1994). COMLEX Syntax: Building a computational lexicon. In Proceedings of the 15th COLING. Kyoto, Japan. Available: ftp://cs.nyu.edu/pub/html/comlex.html/coling94.ps

Grishman, Ralph, & Sundheim, Beth. (1996). Message Understanding Conference - 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen: Ministry of Research, Denmark. Available: http://cs.nyu.edu/cs/projects/proteus/muc/muc6-history-coling.ps

Guha, R. V., & Lenat, Douglas B. (1994). Enabling agents to work together. Communications of the ACM. 37(7), 126-142.

Hirschfeld, Lawrence A., & Gelman, Susan A. (Eds.). (1994). Mapping the mind: Domain specificity in cognition and culture. Cambridge: Cambridge University Press.

Hobbs, Jerry R., Stickel, Mark E., Appelt, Douglas E., & Martin, Paul. (1993). Interpretation as abduction. In Pereira, Fernando C., & Grosz, Barbara J. Natural language processing. pp. 69-142. Cambridge, MA: MIT Press.

Huls, Carla, Bos, Edwin, & Claassen, Wim. (1995). Automatic referent resolution of deictic and anaphoric expressions. Computational Linguistics, 21(1), 59-79.

Ide, Nancy, & Véronis, Jean. (Eds.) (1998). Special Issue on Word Sense Disambiguation. Computational Linguistics, 24(1).

Jiang, Yuejun, Kautz, Henry, & Selman, Bart. (1995). Solving problems with hard and soft constraints using a stochastic algorithm for MAX-SAT. In First International Joint Workshop on Artificial Intelligence and Operations Research. Available: http://www.research.att.com/~kautz/papers-ftp/maxsat.ps

Kilgarriff, Adam, Palmer, Martha, Rosenzweig, Joseph, & Tiberius, Carole. (1998). Proceedings of SENSEVAL (Online). Available: http://www.itri.brighton.ac.uk/events/senseval/PROCEEDINGS/

Knight, Kevin, & Luk, Steve K. (1994). Building a large knowledge base for machine translation. In Proceedings of the Twelfth National Conference on Artificial Intelligence. pp. 773-778. Menlo Park, CA: AAAI Press and Cambridge, MA: MIT Press. Available: http://www.isi.edu/natural-language/resources/aaai94-sensus.ps

Kohl, Karen T., Jones, Douglas A., Berwick, Robert C., & Nomura, Naoyuki. (1998). Representing verb alternations in WordNet. In Fellbaum, Christiane. (Ed.), WordNet: An electronic lexical database. pp. 153-178. Cambridge, MA: MIT Press.

Lehnert, Wendy G. (1978). The process of question answering. Hillsdale, NJ: Erlbaum.

Lehnert, Wendy G. (1994). Cognition, computers, and car bombs: How Yale prepared me for the 1990s. In Schank, Roger C., and Langer, Ellen. (Eds.), Beliefs, reasoning, and decision making. pp. 143-173. Hillsdale, NJ: Erlbaum. Available: http://ciir.cs.umass.edu/info/psfiles/tepubs/cognition3.ps and http://ciir.cs.umass.edu/info/psfiles/tepubs/cognition.ps [no graphics].

Levin, Beth. (1993). English verb classes and alternations. Chicago: University of Chicago Press.

Mahesh, Kavi. (1996). Ontology development for machine translation (Technical Report MCCS 96-292). Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico. Available: http://crl.nmsu.edu/Research/Projects/mikro/htmls/theoretical.tr-htmls/theoretical-tr.html

Mahesh, Kavi, Eiselt, Kurt P., & Holbrook, Jennifer K. (1999). Sentence processing in understanding: Interaction and integration of knowledge sources. In Ram, Ashwin, & Moorman, Kenneth. (Eds.), Understanding language understanding. pp. 27-72. Cambridge, MA: MIT Press.

McCarthy, John. (1990). An example for natural language understanding and the AI problems it raises. In John McCarthy, Formalizing common sense. pp. 70-76. Norwood, NJ: Ablex. Available: http://www-formal.stanford.edu/jmc/mrhug.html

McKoon, G., & Ratcliff, R. (1992). Inference during reading. Psychological Review. 99, 440-466.

Minsky, Marvin. (1986). The society of mind. New York: Simon and Schuster.

Mitton, Roger. (1992). A description of a computer-usable dictionary file based on the Oxford Advanced Learner's Dictionary of Current English (Online). London: Department of Computer Science, Birkbeck College, University of London. Available: ftp://ota.ox.ac.uk/pub/ota/public/dicts/710/

Mueller, Erik T. (1998a). Natural language processing with ThoughtTreasure. New York: Signiform.

Mueller, Erik T. (1998b). ThoughtTreasure: A natural language/commonsense platform (Online). Available: http://www.signiform.com/tt/htm/overview.htm

Mueller, Erik T. (2000a). A database and lexicon of scripts for ThoughtTreasure. In preparation. Available: http://www.signiform.com/tt/htm/script.htm

Mueller, Erik T. (2000b). Making news understandable to computers. In preparation. Available: http://www.signiform.com/erik/pubs/newsund.htm

Nardi, Bonnie A., Miller, James R., & Wright, David J. (1998). Collaborative, programmable intelligent agents. Communications of the ACM. 41(3), 96-104.

Norman, Donald A. (1982). Learning and memory. San Francisco: W. H. Freeman.

Palmer, David D., & Hearst, Marti A. (1994). Adaptive sentence boundary disambiguation (Report UCB/CSD 94/797). University of California, Berkeley. Available: http://xxx.lanl.gov/abs/cmp-lg/9411022

Ram, Ashwin. (1994). AQUA: Questions that drive the explanation process. In Schank, Roger C., Kass, Alex, & Riesbeck, Christopher K. (Eds.), Inside case-based explanation. pp. 207-261. Hillsdale, NJ: Erlbaum. Available: http://www.cc.gatech.edu/faculty/ashwin/papers/git-cc-93-47.ps.Z

Reynar, Jeffrey C., & Ratnaparkhi, Adwait. (1997). A maximum entropy approach to identifying sentence boundaries. Available: http://xxx.lanl.gov/abs/cmp-lg/9704002

Richardson, Stephen D., Dolan, William B., & Vanderwende, Lucy. (1998). MindNet: Acquiring and structuring semantic information from text (Technical Report MSR-TR-98-23). Redmond, WA: Microsoft Research. ftp://ftp.research.microsoft.com/pub/tr/tr-98-23.doc

Richardson, Stephen D., Vanderwende, Lucy, & Dolan, William. (1993). Combining dictionary-based and example-based methods for natural language analysis (Technical Report MSR-TR-93-08). Redmond, WA: Microsoft Research. Available: ftp://ftp.research.microsoft.com/pub/tr/tr-93-08.doc

Schank, Roger C., & Abelson, Robert P. (1977). Scripts, plans, goals, and understanding. Hillsdale, NJ: Lawrence Erlbaum.

Schank, Roger C., & Riesbeck, Christopher K. (1981). Inside computer understanding. Hillsdale, NJ: Erlbaum.

Shieber, Stuart M. (1994). Lessons from a restricted Turing test. Communications of the ACM. 37(6), 70-78. Available: http://www.eecs.harvard.edu/shieber/papers/loebner-rev-html/loebner-rev-html.html

Silberztein, Max D. (1997). The lexical analysis of natural languages. In Roche, Emmanuel, & Schabes, Yves. (Eds.), Finite-state language processing. pp. 175-203. Cambridge, MA: MIT Press.

Sleator, Daniel, & Temperley, Davy. (1993). Parsing English with a link grammar. In Third International Workshop on Parsing Technologies. Available: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/link/pub/www/papers/ps/LG-IWPT93.ps

XTAG Research Group. (1999). A lexicalized tree adjoining grammar for English (Online). Philadelphia, PA: University of Pennsylvania. Available: http://www.cis.upenn.edu/~xtag/tech-report/tech-report.html

Name	Formalism	English lexical entries
Link Grammar Parser (Sleator & Temperley, 1993)	link grammar	25,000
ThoughtTreasure 0.00022 (Mueller, 1998b)	CYK primordial soup	35,000
XTAG (XTAG Research Group, 1999)	FB-LTAG	105,000

Name	Concepts	ako/isa	part-of/material-of	Other
Cyc (Guha & Lenat, 1994)	30,000	25%	35%	40%
Cyc Upper Ontology 2.1 (Cycorp, 1998)	2,846	7,161	0	2,579
Mikrokosmos (Mahesh, 1996)	4,500	-	-	-
MindNet (Richardson, Vanderwende, & Dolan, 1993)	45,000	47,000	14,100	32,900
SENSUS (Knight & Luk, 1994)	70,000	-	-	-
ThoughtTreasure 0.00022 (Mueller, 1998b)	27,093	28,818	666	21,821
WordNet 1.6 (Fellbaum, 1998)	99,642	78,446	19,441	42,700