McNeill, David (1998) Models of Speaking (To Their Amazement) Meet Speech-Synchronized Gestures.

MODELS OF SPEAKING (TO THEIR AMAZEMENT) MEET SPEECH-SYNCHRONIZED GESTURES

David McNeill

1. INTRODUCTION

The chapters in this volume have generally accepted the argument that speech-gesture integration is basic to language use. But what explains the integration itself? I will attempt to make the case that it can be understood with the concept of a `growth point' or GP (McNeill & Duncan this volume) It is called a GP since it is a theoretical unit in which principles that explain mental growth -- differentiation, internalization, dialectic, and reorganization -- apply to realtime utterance generation by adults (and children). It is also called a GP since it is meant to be the initial form of a thinking-while-speaking unit out of which a dynamic process of organization emerges. The emergence unpacks the GP into a surface utterance and gesture that articulates its meaning implications.

2. THE GROWTH POINT

A thread that runs through all GP examples that we have analyzed is the following: GPs are the newsworthy elements of the immediate context -- points of departure from the preceding discourse. They are what Vygotsky (1987) called psychological predicates (not necessarily grammatical predicates). The concept of a psychological predicate illuminates the theoretical link between the GP and the immediate context of speaking. This is because the psychological predicate and its context are mutually defining; the psychological predicate:

1. marks a significant departure in the immediate context; and

2. implies this context as a background.

Regarding the GP as a psychological predicate suggests a mechanism of GP formation in which differentiation of a focus from a background plays an essential part. An example that illustrates the phenomenon of differentiation and construction of background is presented in the following. GPs are inferred from speech-gesture synchrony. Let's look at an example to see how this works. The following is the complete description by one speaker of an episode from a cartoon story. This much context is necessary to understand the GPs underlying the various utterances, as I will explain.

2.1. Example

The following bit of a narrative text will be used to illustrate a number of points important for explaining the GP.

Battle Plan for Vivian

(1)     he tries going [[up the insid][e of the drain pipe and]]

                1H: RH rises up

(2)     Tweety Bird runs and gets a bowling ba[ll and drops it do wn the drai]npipe #

               Symmetrical: 2SHs move down

(3)     [[ and / as he's co ming up]

                Asymmetrical: 2DHs , LH holds, RH up

(4)     [[ and the bowling ball's coming d]][own

                Asymmetrical: 2DHs, RH holds, LH down

(5)         he sssw allows it]

                    Asymmetrical: 2DHs, RH up, LH down

(6)         [ # and he comes out the bot tom of the drai][npipe

                    1H: LH comes down

(7)         and he's got thi s big bowling ball inside h][im

                    Symmetrical: 2SHs

(8)         [[and he rolls on down] [into a bow ling all]][ey

                    Symmetrical: 2SHs

(9)         and then you hear a sstri]ke #

                    Symmetrical: 2SHs



To pick one item for analysis, consider the utterance and gesture in (2). My purpose will be to show how this utterance can be explained utilizing GP as the model.

First, to explain the inferred GP itself. The gesture in (2) was made with two symmetrical hands -- the palms loosely cupped and facing downward as if placed on top of a large spherical object, and the hands moved down during the linguistic segments "it do(wn)". The inferred GP is this image of downward movement plus the linguistic content of the "it" (i.e., the bowling ball) and the PATH particle "down". The GP is both image and linguistic categorial content: an image, as it were, with a foot inside the door of language. Such imagery is important, since it grounds the linguistic categories in a specific visuo-spatial context. It may also provide the GP with the property of `chunking', a hallmark of expert performance, in this case speech performance (cf. Chase & Ericsson, 1981), whereby a chunk of linguistic output is organized around the presentation of an image. The downward content of the gesture is a specific case of "down", the linguistic category -- a specific visualization of it -- in which imagery is the context of the category and possibly the unit of performance. The linguistic categorization is also crucial, since it brings the image into the system of categories of the language, which is both a system of classification and a way of patterning action. The speech and its synchronized gesture are the key to this theoretical unit.

2.2. Incorporating context

A GP is a psycholinguistic unit based on contrast, and this concept brings in context as a fundamental component. I will use the terms field of oppositions and significant contrast to analyze the role of contextualization in thinking. A significant contrast and the field of oppositions within which it is drawn are linked meaning structures, under the creative control of the speaker at the moment of speaking. Control by the individual ensures that GPs establish meanings true to the speaker's intentions and memory. The formation of a GP is the highlighting or differentiation of what is novel in such a field. The field defines the significance of the contrast, it establishes what is meaningful about it; the contrast itself is the source of the GP. All of this process of differentiation is meant to be a dynamic system within which new fields are formed and new GPs are differentiated. The background of thinking-for-speaking is constantly being updated, and this is possible since it is a creation by the speaker during the course of the discourse.

2.2.1. Catchments. To explain the utterance in (2) and why it has the growth point we infer, we must consider the complete context above and study how the different parts of it came to be embodied in the utterance. A useful way of organizing this analysis is by means of catchments--a phenomenon first noted by Kendon in 1972. A catchment is defined as a group of gestures with partially recurring features across discourse segments forming a thematic unit. Catchments are objectively recognizable and can potentially support automated detection and monitoring techniques of topic groups. Each gesture in a catchment is simultaneously shaped by its semantic content and its relationship to the catchment, or catchments (since there may be several), of which it is part. At least 3 catchments must be considered to explain (2).

C1. The first is the catchment of one-handed gestures in items (1) and (6). These gestures accompany descriptions of Sylvester's motion, first up the pipe then out of it with the bowling ball inside him. Thus C1 ties together references to Sylvester as a solo force. This one-handed catchment differs from the two-handed gestures, which in turn divide into two other catchments:

C2. Two-handed symmetrical gestures in (2), (7), (8) and (9). These gestures group descriptions where the bowling ball is the antagonist, the dominant force. Sylvester becomes what he eats, a kind living bowling ball, and the symmetric gestures accompany the descriptions where the bowling ball asserts this power. In (2) the bowling ball is beginning its career as antagonist. The rest of the catchment is where it has achieved its result. A two-handed symmetric gesture form highlights the shape of the bowling ball or its motion, an iconicity appropriate for its antagonist role.

C3. Two-handed asymmetric gestures in items (3), (4) and (5). This catchment groups items in which the bowling ball and Sylvester mutually approach each other in the pipe. Here, in contrast to the symmetric set, Sylvester and the bowling ball are equals differing only in their direction of motion.

With these catchments (others can be identified in the episode), we can analyze the origins in realtime of the utterance and gesture in (2) in a way that incorporates context as a fundamental component.

The occurrence of (2) in the symmetrical catchment shows one of the factors that comprised its field of oppositions at this point--the various guises in which the bowling ball appeared in the role of an antagonist. This catchment set the bowling ball apart from its role in C3 where the bowling ball was on a par with Sylvester. The significant contrast in C2 was the downward motion of the bowling ball toward Sylvester. Because of the field of oppositions at this point, this downward motion had significance as an antagonistic force. We can write this meaning as Antagonistic Force: Downward toward Sylvester; this was the contrast. Thus, "it down", unlikely though it may be as a unit from a grammatical point of view, was the intellectual core of the utterance in (2)--the "it" indexing the bowling ball and the "down" indexing the significant contrast itself in the field of oppositions.

The verb "drops", therefore, was excluded from this GP. We can explain this as follows. The verb describes what Tweety did, not what the bowling ball did (it went down), and thus was not a significant contrast in the field of oppositions involving the bowling ball. The core idea at (2) was the bowling ball and its action, not Tweety and his. The detailed synchrony of speech and gesture thus also depended on the context at the moment of speaking.

2.2.2. Unpacking. The gesture at (2) however also contrasted with C1--a one-handed gesture depicting Sylvester as a solo force. This significant contrast led to the other parts of the utterance in (2) via a partial repetition of the utterance structure of (1). Contrasting verbal elements appeared in close to equivalent slots (the match is as close as possible given that the verb in (2) is transitive while that in (1) is intransitive):

(1') (Sylvester) up in "he tries going up the inside of the drainpipe"

(2') (Tweety) down in "and Ø drops it down the drainpipe"

The thematic opposition in this paradigm is counter forces--Tweety-down vs. Sylvester-up. Our feeling that the paradigm is slightly ajar is due to the shift from spontaneous to caused motion with "drops". This verb does not alter the counter forces paradigm but transfers the counter force from Tweety to the bowling ball, as appropriate for the gesture with its downward bowling ball imagery.

The significant contrast of (2) with (1), in addition to bringing out Tweety and downness, was thus also the source of "drops". The verb expressed Tweety's role in that contrast and shifted the downward force theme to the field of oppositions about the bowling ball.

2.2.3. One utterance, several contexts. In this way the utterance at (2), though a single grammatical construction, emerged out of two distinct contexts. The two contexts were clamped together by the gesture in (2). This gesture put the bowling ball in a field of oppositions in which the significant contrast was what it did (rather than how it was launched), made the nongrammatical pair, "it down" into a unit, and made a place for the thematic shift of downness to the bowling ball via "drops", thus providing a rationale for this verb choice.

Let's summarize the contexts that brought (2) into being:

1. The field of oppositions in which the significance of the downward motion of the bowling ball was that of an antagonistic force--the contrast of (2) with (3), (4), (5): this gave the growth point core meaning centered on "it down".

2. The field of oppositions in which the significance was the counter forces of Sylvester-up vs. Tweety-down. This gave a sentence schema that included the words "drops", "down", "drainpipe", and the repetition of the sentence structure with Tweety in the subject slot.

The unpacking sequence would have begun with the first contrast. This was the core meaning embodied in the gesture. Unpacking then went to the counter forces context in (1) for information on how to unpack the sentence in (2). The word order ("drops it down") obviously does not correspond to the genetic sequence. This was something more like: "it down" -->t "Ø drops", etc.

Understanding that one utterance can clamp together multiple contexts also removes the seeming anachronism of explaining (2) in part via contrasts with utterances yet to come. The speaker, recounting the story from memory, knew that the bowling ball was the ultimate force and also that Sylvester and the bowling ball would first have to approach each other as equals. She structured her narrative around these memories and this was the basis of her contrast at (2). The later sentences themselves were not present; they arose from their own growth points at their own moments of speaking.

2.3. The moral

All of this implies that every utterance, even though a seemingly self-contained grammatical structure, organically contains content from outside of its own structure. This other content ties the utterance to the discourse at the level of thinking. Such a model predicts a context, rather than adds context as a parameter. That two contexts could collaborate to form one grammatical structure also implies that a sense of grammatical form enters into utterances in piecemeal and oblique ways that do not necessarily follow the rule-governed patterns of the utterance's formal linguistic description. The gesture had the role of clamping the contexts together, and this may be one reason why gestures occur in the first place.

3. GOALS BEHIND THINKING

Why did the GP of (2) embody C2 vs. C3 rather than C2 vs. C1? The answer goes beyond thought; it arises from goals. The contrast that led to the GP depended on the speaker's sense of narrative direction and her goal of getting Sylvester out of the pipe with the bowling ball inside. This is beyond thinking and illustrates Vygotsky's concept of an `affective-volitional tendency'. At the end of Thought and Language Vygotsky wrote:

"Thought is not begotten by thought; it is engendered by motivation, i.e., by our desires and needs, our interests and emotions. Behind every thought there is an affective-volitional tendency, which holds the answer to the last `why' in the analysis of thinking" (p. 252).

And in our case too -- the ultimate explanation for why "drops" was excluded was the speaker's goals. Given these goals, her GP was centered on the bowling ball as antagonist and here thinking was a process of differentiating these significant contrasts.

4. COMPARISON WITH INFORMATION PROCESSING MODELS

So, what follows from this GP analysis? What is gained over the straightforward kind of information processing (IP) model put forth by de Ruiter (this volume) that posits a sequence of steps such as the following:

(a) forming a message (`Tweety drops a bowling ball into the pipe'),

(b) formulating a sentence (the one uttered) that expresses the message, and at the same time,

(c) sketching a gesture (the one made)?

For one thing, the GP explains speech-gesture synchrony on genetic grounds without need of additional steps to get the timing right.[2] Synchrony arises in the form of the thought itself. IP models also can explain speech-gesture synchrony (though their success at this has been disputed by Duncan, 1996), but they require positing timing links between speech and gesture outputs.

4.1 The differences

4.1.1. Predicting contexts. A unique accomplishment of the GP model is that it can also `predict' the context of the utterance. The IP model does not make this prediction at all. The GP includes context as a fundamental component. The context is embodied in a field of oppositions. This field for item (2) is worked out below. Basing the GP on context means that we can `predict' what this context must have been for just this image and lexical category to have congealed into a unit of thinking-for-speaking. Namely, the right field of oppositions for this image and lexical content to be contrastive in it. It is at this point that we must consider Vivian's full description above. I will demonstrate that two contexts fed into the utterance at (2). One was the context embodied in the GP proper, the other was the context that fueled the unpacking of the GP (see McNeill & Duncan, this volume). Whether or not other utterances can be analyzed in exactly this way, the point remains: to get a grip on the origins of utterances, it is necessary to view them fundamentally as manipulations of contexts.

4.1.2. Incorporating gesture. Another basic difference between IP models and the GP is the positioning of gesture at the foundation level in the overall architecture of thought and language. In de Ruiter's Sketch, a gesture is an addition to more core-like processes in the model -- conceptualizing, formulating, and articulating. In the GP, gesture is a basic component of thought and language. It is capable of adding material substance without which speech would not occur, or not, at least, occur in the same form (see McNeill & Duncan, this volume).

4.2. Three IP models

Three information processing (IP) models based on Levelt's (1989) Speaking model have been recently proposed with the purpose of extending IP to gesture performance. Gesture was not a phenomenon originally considered in this modular model. Two of the extended versions of the Levelt model have been presented in this volume (de Ruiter, and Krauss, et al.); the third is in Cassell & Prevost (1996).

4.2.1. The common element. All three adopt basically the same strategy, which is to attach a new `module'-- a gesture module -- to the speaking module at some point. Thus all three regard gesture as an outgrowth of more basic operations. All also share the IP design feature of treating contextualized thinking as peripheral, not as a central component of speech performance. Context or `situation knowledge' in the Levelt model is in an external information store to which the modular components have access. Context is treated as background data along with such long-term data as encyclopedic information and models of appropriate discourse. This store is segregated from the conceptualizer-formulator mechanism in its operational mode.

4.2.2. Variations. Though the strategies are the same the models differ in other ways. In particular, they differ in where in the speaking module the new gesture module is affixed. In fact, all possible ways of hooking gesture to a hypothetical central processor have been adopted. De Ruiter's model (like the growth point) takes seriously the implications of speech-synchronized gestures as having an impact on conceptualization during speech; the other models do not take this step, as seen in the following:

Krauss, et al. (this volume) -- pre-conceptualizer link-up: the gesture component is linked to the IP structure preconceptually in working memory.

de Ruiter (this volume) -- conceptualizer link-up: the gesture component is linked to the IP structure at the conceptualizer.

Cassell-Prevost (1996) -- post-conceptualizer link-up: the gesture component is linked to the IP structure at the equivalent of the formulator (called the sentence planner).

4.2.3. Context outside the model. Although there is variation, none of the models provides a way to include context as a fundamental component of speaking. The consequences of this absence can be demonstrated by examining examples cited by the authors themselves.

Cassell-Prevost: The sentence planner plans a sentence like "Road Runner zipped over Coyote"; finds the rheme; chooses a gesture to depict this rheme (driving, say) and puts principal stress on the verb going with this gesture. The process, in other words, starts from the sentence and, in keeping with the peripheral status of the rheme, adds the rheme, the gesture, and finally the stress. A gesture-imbued-with-context approach is more or less the reverse of this. The process starts from the narrative context, as it has been built up at the moment of speaking, and differentiates some meaningful step from this context (this is like the rheme). Unpacking the content of the gesture is the process of working out a surface sentence in such a way that it can carry this meaning into a full linguistic realization.

de Ruiter: A gesture occurred in which a speaker appeared to hold binoculars before her eyes, synchronized with `with binoculars at' (part of a longer utterance in Dutch, "...enne, da's dus Sylvester die zit met een verrekijker naar (Eng: 'with binoculars at') de overkant te kijken", `and eh so that is Sylvester who is watching the other side with binoculars'). de Ruiter claims that to say that `with binoculars at' is the linguistic affiliate of the gesture would be circular. But this is because, in his IP model, context is excluded. With context included, we have another source of information that breaks into the logical circle. Moreover, this incorporation of context into the model amounts to a prediction of what the context would have to have been: a context in which the instrument and direction were jointly the newsworthy content (we don't know what the context in fact was since de Ruiter, thinking in IP terms, doesn't report it). His and the other IP models can't make this type of predication at all. If the context was not one in which this gesture-speech combination would be newsworthy, we would have the much sought after falsifying observation that, according to de Ruiter, is possible only in the IP framework. This potential disproves the claim that IP models alone are falsifiable.

Krauss et al.: The key step in this model is the use of gesture to assist lexical retrieval: "In our model, the lexical gesture provides input to the phonological encoder via the kinesic monitor. The input consists of features of the source concept represented in motoric or kinesic form. ...These features, represented in motoric form, facilitate retrieval of the word form by a process of cross modal priming." (pp. XX-XX). If there is some delay in retrieving a word, drainpipe, say, a gesture with some of the spatial features of a drainpipe (e.g., `verticality', `hollowness') can be fed into the lexical system and prime the word, and thereby aid retrieval. This mechanism will work to fish out missing words but it does not explain item (2). The word "drops" had features that matched the downward thrusting gesture (DOWNWARD, CURVED, etc.), yet this word was excluded from the gesture stroke via a prestroke hold, the gesture withheld until the word had gone by. Furthermore, the same DOWNWARD and CURVED features matched a different set of lexical features, those of the bowling ball (indexed by "it") and the path ("down"). Why did the gesture target these lexical items and exclude "drops"? We know the answer -- the contextual contrast of C2 vs. C3. Thus, to achieve its lexical retrieval goals, the gesture system must have access to contextual information at its foundational core. But this content is excluded by the modular design of the Krauss, et al. model.

4.2.4. Summary of IP. In all the models described, context has been excluded and this has created a disparity between what the models are able to do and the facts of human language use as we currently understand them. For, if gesture shows anything, speaking and thinking, at their cores, are manipulations of context, both created and interpreted, with meanings and forms in integral contact -- the very antithesis of IP modularity.

5. PREDICTING GESTURE TIMING

The seemingly straightforward question of when gestures occur in relation to speech has been the source of much confusion. Once this confusion is resolved, we can see that the timing of gestures is a point at which GP demonstrates predictive power.

5.1. Myths of speech-gesture timing

5.1.1. The myth of imprecision. This question is: Do gestures time exactly, or only `approximately', with their semantically related speech?[3] De Ruiter invokes the adjective approximate to characterize the temporal relationship of speech and gesture. It is true that hand and other movements and speech segments can be aligned temporally from videotaped recordings with no more accuracy than the frame-replacement rate of the video that one is employing. In de Ruiter's European tapes, this is 25 times a second, or 40 msecs per video frame; in our North American tapes, it is slightly better -- 30 times a second, or 33 msecs per frame. These figures however have no meaning without a standard of comparison. If we think in terms of electronic devices such as computers, a margin of 33 msecs is wide indeed. On this scale, approximate is justified. However, on the scale of human behavioral events and speech events in particular, 33 to 40 msecs warrants a better cognomen. Syllables take 200 or 300 msecs. 33 msecs is shorter than the release of stop consonants (Lehiste & Peterson 1961)[4]. Observations of speech-gesture timing are therefore well within the confines of a syllable and approach the resolving power of the phonetic system itself. The evidence on which the inference of the GP in "Tweety Bird runs and gets a bowling ba[ll and drops it do wn the drai]npipe" was based thus had an accuracy well within the durations of the speech segments being considered. Observations with a 33 to 40 msecs margin of error can be dignified as `precise' if we consider precision proportionately to what is being measured.

5.1.2. The myth of asynchrony. There is a second source of confusion. Morrel-Samuels & Krauss (1991), in a widely cited paper, report a median asynchrony (with gesture leading) between gestures and their semantically related speech of three-quarters of a second -- 20 times the resolving power just claimed (Butterworth & Hadar, 1989, have made similar claims). There is no question that speech and gesture can be asynchronous on occasion, but is 750 msecs the rule? I will argue that the three-quarters of a second lead-time that Morrel-Samuels and Krauss have observed is actually not an asynchrony at all. It shows, instead, that movements can be synchronous with something else.

To interpret the three-quarters of a second finding, it is necessary to be clear on what is being measured. Morrel-Samuels & Krauss made use of a motion detection device sensitive to the onsets of movement. What preceded the semantically relevant speech by three-quarters of a second was thus movement onset. A gesture typically passes through more than one temporal phase (Kendon 1972). These phases have been termed preparation, prestroke hold, stroke, poststroke hold, and retraction (see Kita 1990).[5] All but the stroke are optional. The stroke is the meaningful phase of the gesture that is performed with the quality known as `effort' (Dell 1971). The onset of movement coincides with the stroke only when there is no preparation or prestroke hold. Let us consider a typical gesture with a preparation and prestroke hold phase; one such is (2). The speaker's hands began to move upward while she was saying "ball" and then froze in midair during the word "drops" -- a prestroke hold. This example is quite typical. There was no other reason for making the upward movement than to get the hands in position for the upcoming stroke. Thus one can infer from the onset of movement that the preparation movement was the moment at which the imagery content of the GP began to take form in the speaker's on-line thinking.

The Morrel-Samuels & Krauss three-quarters of a second finding shows the lead-time of GPs in advance of where, in the median case, the the GP is presented in the surface linguistic output. This presentation depends on meeting the well-formedness standards of the linguistic system and where it can unpack the GP content. The three-quarters of a second latency therefore would have nothing to do with retrieving words from the lexicon, but merely reflect grammatical constraints on the linear sequentially of speech (this argument joins other appraisals of the lexical retrieval hypothesis, see the chapters by Kita, Nobe, and de Ruiter).

From the phases of gestures we can uncover the dynamics of on-line thinking (cf. Duncan, et al,. 1995). Far from uncertainty over the temporal relationship of speech to gesture, there is a precisely working system. This system of dynamics enables us to predict gesture timing.

5.2. Predicting gesture timing

The GP predicts timing points based on the concept that individual utterances contain content from outside their own structure. Such hidden content reveals itself in gesture timing:
 

* Stroke. The image embodied in the gesture automatically synchronizes with the linguistic categories that are also part of the GP. In the case of (2), the GP was comprised of downward motion imagery categorized as `bowling ball' and `down'. This timing was derived from the contrast between C2 and C3.
 

* Prestroke hold. When two contexts intersect in a single utterance, the context embodied in the growth point can be delayed while the other context is materialized in lexical form. In the case of (2), the verb "drops" emerged from the C1 vs. C2 contrast and, while it was being articulated, the GP materialization of the C2-C3 opposition was placed on hold.
 

* Preparation onset. This is explained as the moment when the next idea unit starts to take form. When Viv. mentioned the bowling ball in the preceding clause, "Tweety Bird runs and gets a bowling ba[ll and drops it do wn the drai]npipe", she simultaneously began to organize the GP unit in which the bowling ball was conceptualized as opposing Sylvester by coming down. The bowling ball, as we saw, took over the opposing forces paradigm initiated by Tweety. This shift occurred exactly when Viv. mentioned the bowling ball in the preceding context. Thus, the point at which the ball was mentioned in the discourse is the earliest moment for forming the next GP around the bowling ball and its downward path. Again, the timing is derived from the developing context, in this case the contrast between C1 and C2.

Susan Duncan (pers. com.) has pointed out contrasting utterances that demonstrate the systematic shifting of gesture timing as the context varies; the GP predicts this by generating different GPs according to contexts. In the first example, Sylvester is climbing up the outside of the drainpipe -- this action is the new content-- and the gesture stroke synchronizes with the verb "climbs". In the second example, produced just after the first, Sylvester is making his second assent -- this time on the inside of the pipe. Again the stroke synchronizes with the new content and the verb "climbs" is passed over while the gesture synchronizes with "inside". The gestures themselves are quite similar, thus the timing difference is not due to shifting gesture content but to different fields of opposition and the contrasts within them -- Ways of Reaching Tweety (`climbing the drainpipe') and Ways of Climbing Up the Pipe (`inside'), respectively:[6]

(10)     second part [is he climbs up the drai] to get* <uh> try and get Tweety
                Gesture shows Sylvester moving up the pipe

(11)    tries to climb up in through the [drain* / <nn> / inside the dra inpipe]
                Gesture shows Sylvester moving up the pipe

Inasmuch as predicting the exact timing of two gesture phases from basic principles embodied in the GP counts as support for the model, the timing of the preparation and stroke gesture phases (accurate within 33 msecs!) supports the GP model.[7]

6. GESTURES IN LANGUAGE AND THOUGHT

Why do we make gestures at all? Gestures may be part of thinking. This is the answer that Susan Duncan and I propose in our joint contribution to this volume. Gestures seem to be connected to the novelty or newsworthiness of the content they embody. A new contrast in thought tends to be realized via a gesture of some kind. There are several possible reasons for this:

(a) The gesture adds to the differentiation of the meaning. The mind-body distinction blurs as a body motion enhances a mental difference.

(b) The gesture carries unique meaning components that possibly are critical to the contrast in the field of oppositions.

(c) The gesture is the nexus of the convergence of contexts that help assemble the utterance. If the utterance is pieced together from different contexts, the utterance can't pull the contexts together, but the gesture can.

For at least these reasons, then, gestures are important elements of thinking-for-speaking. They may occur because they are part of thinking itself.[8]

7. CONCLUSIONS

Given the unity of speech and gesture, gesture should be at the center of an IP model of speaking and gesture. Furthermore, the joint pragmatic and semantic content of gesture suggest that contextual elements also should be at the center of the model; and this has proven to be the problem:

1. Gestures and speech production are intimately connected -- `one system'.

2. Gestures embody context as a fundamental component, and this is manifested in their form, meaning, and timing.

3. Therefore, models in which discourse context and propositional content are separated into modules fail to model speech and gesture as they are actually put together. Languages are geared to visuospatial-kinesic cognition, and such models miss this fact totally.

I consider gestures to be a crucial test of speech production models. Models based on modularity fail this test rather spectacularly. Modular models have the opposite effect -- pulling context and gesture/speech as far apart as possible. Is some other kind of module possible -- one that puts context and propositional content in one package? I think not: this would fly against the concept of a subroutine as a device for carrying out recurrent processes; context is precisely that which is not recurrent, but is built up in the process of thinking and speaking, on-line and ad hoc.

I do not advocate an anti-modeling position. On the contrary, I endorse modeling as a tool for cognitive science (cf. de Ruiter, this volume). But if a model is to realize its potential it must be appropriate to the domain being modeled. I hope that someone will find a suitable approach for constructing models of speech and gesture production. I have made clear my belief that any version of IP, including those presented in this volume, are blocked by their own internal organization from ever achieving a successful model of gesture performance. Such modeling cannot be based on other-worldly assumptions such as that there are encapsulated modules at the level of conceptualization and speaking, for these have the contrary-to-fact consequence of forcing the model to exclude context. A dynamic systems approach might offer fewer obstacles. I would invite those with expertise in this area to take up the challenge of modeling speech-synchronized gesture performance, with context as a fundamental component.

NOTES


[1]

 Preparation of this paper was supported by a grant from the Spencer Foundation.

[2 ] A similar point was made by Duncan (1996).

[3] There is also an issue of defining what is `semantically related' in speech. I won't discuss this question but will endorse de Ruiter's term, `conceptual affiliate'. Semantically related speech is the sum of speech used to express the conceptual affiliate of the gesture. This might be a word, part of a word, or more than a word, a construction.

[4] I am grateful to Karl-Erik McCullough for this reference.

[5] Gesture phases can be identified by experienced coders with excellent reliability (95%). To ask untrained volunteers to segment gestures into phases, however, as Morrel-Samuels & Krauss did, before they gave it up as `unreliable', inevitably breeds inaccuracies and omissions.

[6] See McNeill and Duncan (this volume, note 10) for the contexts in which these different oppositions took form.

[7] That the preparation phase of a gesture signals the formation of the GP explains why the onset of gesture movement often precedes and never follows the semantically related speech (Kendon, 1972; Morrel-Samuels & Krauss, 1992; Nobe, this volume). It also explains why the extent of gesture anticipation would be negatively correlated with the frequency of use of the lexical affiliate (Morrel-Samuels & Krauss, 1992). This might be caused, as Kita (this volume) argues, not by lexical retrieval difficulties per se, but by conceptual organization complexities manifested in part by a lower-frequency word choice. It would be important to know through what phase(s) the gesture anticipation takes place. If it is due entirely to an extension of the preparation and/or pre-stroke hold phases, this would suggest that the gesture stroke is waiting for the lexical item, not that the gesture is aiding the retrieval itself. Low frequency words take longer to retrieve from one's mental lexicon and thus delay the completion of a surface output that meets well-formedness conditions. The delay does not imply that the speaker is stuck for meaning; it is finding the form that takes longer. Moreover, as de Ruiter (this volume) points out, a message source (in current terms, a GP) is not necessarily manifested in a single lexical item; a phrase or an even more widely distributed surface structure choice might be engaged by a growth point. Thus a GP can have come into being and the speaker not have found a signifier form with which to present it. The `tip of the tongue' phenomenon demonstrates the reality of having categorized meanings in mind while lacking the word forms with which to express them (Brown & McNeill, 1966).

[8] The GP embodies a vision of language that can be located in a language origins scenario. Gesture is a self-generated kind of visual/actional thinking, a meshing of visual thinking with language whereby speaking and visual cognition can be integrated. If the origin of language depended on visuospatial cognition gesture could have been essential, since gesture creates a form of visuospatial cognition that is not dependent on external stimulation. It empowers the individual to control his own visual representations, free from domination by environmental contingencies. A capacity for language that evolved to create spoken outputs in the presence of visuospatial thinking, yet free of environmental dominanation, could have selected gestures as part of this evolution, assuring thereby a self-generative capacity for visual thinking with speech. Growth points, then, would have evolved as the fundamental units of controlled visuospatial cognition meshed with language in order to make this unprecedented process work. The brain circuits to support GPs would tend to link the right and left cerebral hemispheres (McNeill & Pedelty, 1994, describe disruptions of gesture-language coordination after right hemisphere injuries due to strokes).