Approaches to representing context and intentions in conversational AI

A review of formalisms that aim to capture intents in language modelling

Image Credit:

An important approach to capture ‘language independent knowledge structures’ (i.e. knowledge beyond lexical semantics) that are relevant in dialog are associative network formalisms (e.g. Schank’s (1972, 1975, 1980) conceptual dependency theory). The goal of this kind of formalism is to enable directed and efficient mechanisms that model inference processes based on associations (including ‘causal connections’) between representations grounded in linguistic form and working up across different levels of abstraction (towards modeling higher cognitive functions). This line of work is relevant to dialog systems because these formal models of knowledge representations are indespensable for developing useful symbolic natural language understanding systems that deal with meaning beyond that what is ‘directly’ encoded in linguistic form. In talk-in-interaction this is (1) knowledge about the intentions, plans and goals of different agents, and (2) knowledge about the preceding discourse (Schank’s work and Gillis et al. 2009).

To date, speech act theory (SAT) and its many variations are still the most popular framework to model speaker intention. Within this community linguistic behaviour is often conceived and modeled as some form of ‘rational’ behaviour. For example, Allen and Perrault (1979) view linguistic behaviour as goal-driven planning: speakers plan an utterance in order to achieve a communicative goal while the interlocutors aim to infer that goal from linguistic form (see also Cohen and Perrault (1979). Depending on the specific framework, formal SAT approaches then generally design formalisms that aim to capture the specific “communicative goals and plans” tied to various linguistic forms and represent this knowledge using Type Theory with records (TTR). This view of linguistic behaviour as goal-oriented and plan-based is still widely adopted in the dialog modeling community (e.g. Kobsa 1989; Carberry 1989; Cohen et al. 1990; see also the QUALM system (Lehnert 1978)). Of course there is more to linguistic behaviour than this, and these rather simple question-as-database-query models have been criticized for their lack to capture more complex dimensions of inference (e.g. Wilensky 1983).

A common way to expand these simple speech act approaches is to add more sophisticated models of context and cotext. For an adequate interpretation of an utteance, the hearer needs to take into account both situational context, the interlocutor’s knowledge of the domain as well as as sorts of links between current utterances those that precede it. Various aspects of ‘context’ have been studied under topics such as reference, anaphora resolution and discourse coherence. Many models in these field rely on defining larger discourse structures that go beyond turns and setences, such as schemas or frames. For instance, the pioneering work by McKeown (1985) and Sidner (1983, 1985) on discourse generation is based on formalisms of certain underlying patterns that structure discourse, e.g. “schemas of discourse generation for attaining discourse goals” that both speakers and listeners rely on when jointly producing stretches of talk (Carberry 1989).

What do these ‘discourse formalisms’ look like? Well, most proposed formalisms for the comprehension and production of discourse still tend to ignore many of this complexities and treat discourse as a product, provide a discourse grammar that consists of formal discourse rules as well as as some kind of augmented transition network (ATN) formalism for analyzing discourse. In contrast, however, qualitative research traditions of discourse and talk, most prominently probably conversation analysis (CA), have long advocated that dialog is a joint interactional achievement of multiple parties. In fact, the highly dynamic and complex view of dialog that conversation analysis advocates is yet to find an adequate formal representation, though some attempts were made to fruitfully integrate insights from CA and dialog modeling (e.g. Luff et al. 1990; Gillis et al. 2009).

One of the prerequisites to develop dialog systems in adherence to conversation analytic theory are models that capture and update knowledge of the participants of a dialog and numerous of such modules have been developed as part of dialog systems under to the topic of user models (Wilensky et al. 1988; Chin 1989; Morik 1989). In the context of dialog systems, user models are knowledge bases that contain (and update) all aspects of the user that may be relevant to the dialog behaviour of the system and that inform ‘intelligent’ interaction (Allen 1994). User models are crucial modules of dialog systems that, based on storing relevant information about the interlocutors, enable more appropriate linguistic behaviour through recipient-tailored language generation. The knowledge that is gathered during interaction and stored in user models is kept seperable from the rest of the system as it has to be dynamically reconstructed during each dialog, but it also has to be connected and inform many other dialog system modules. The specific design of user models may significantly vary between different task-oriented dialog systems and non-task oriented dialog systems (Wahlster ad Kobsa 1989). Disourse models and user models are largely intertwined and the exact relationship and interaction between the two is subject to ongoing debate in the field of dialog modeling (see for instance Heland 1988; Norman 1989 and the special issue Computational Linguistics 14/3. User model design has been a fast-changing field in the past decades and many advanced designs are unfortunately developed behind closed doors and are not openly accessible to the scientific community. Generally, user models can be divided into canonical user models that store knowledge of all users in general and individual user models that store knowledge related to individual users and keep records per individual user. A second distinction can be made between long-term and short-term user models, where long-term models track more general knowledge such as a history of discussed topics and and pursued goals and short-term models track more fine-grained specifics of an interaction (and they often interact with discourse models) (for more on this distinction, see also Rich 1988; Kobsa 1989).

User models are primarily constructed from the actual input of a user, from which knowledge, plans and goals of the particular user are inferred. This type of knowledge can be relatedt to other sources of data, such as user profiles or interaction histories. Combining all available knowledge, a user model is created and dynamically updated as the interaction procedes. Here the notion of a stereotype (sometimes also prototype) is useful (Rosch and Mervis 1975). As part of building individual user models, users that display similar behaviour and traits can be grouped together and form user classes. The class attributes or stereotypes then enable the system to infer whole sets of user characteristics on the basis of a smaller number of observations. Stereotypes are useful to assign traits to individual users based on less data, but the particular inference process of assigning class membership is of course defeasible and involve adding all sorts of (arbitrary) uncertainty measures such as numerical confidence ratings (Rich 1989; Chin 1989). However, out of the approaches to stereotype management that have been proposed, many remain somewhat unintuitive due to the complexity involved in task such as the resolution of contractory inferences and the confidence management of the assigned properties. This active area of research is sometimes also referred to as truth management: the study of user model design that deals with assigning new properties to the model while maintaining its consistency and adequacy (Doyle 1983).

Building on user models (that are already challenging enough to design), language generation is the even more complex task of deciding what to say and how to say it (for an overview of the field, see, for instance, Kempen 1989 or McDonald 1992). Generation of adequate turns needs to take knowledge into account related to many aspects including the current domain of discourse, user models and the situational context and cotext. A lot of work in this field is related to Rhetorical Structure Theory (RST) in one way or another. Generally, RST is used to identify rhetoric relations between units of speech, which are then used to produce larger stretches of discourse in a ‘coherent’ fashion. RST can be applied to various levels of discourse structure, ranging from sequential relations between turns and turn-constructional units to putting together whole sequences of talk in an ostensibly coherent manner (e.g. Hovy 1990, 1991l Cawsey 1990, Paris 1991, Scott and de Souza 1990). Other approaches have introduced the notion of focus (also question-under-discussion (QUD)), that broadly speaking aim to model attention and topic management in dialog (McCoy and Cheng 1991). Other increasingly important field in language generation are multimodality and backchannel generators (also ‘linguistic feedback generators’) that model ‘non-linguistic’ or non-lexical aspects of talk-in-interaction.

In summary, current approaches to dialog modeling combine a multitude of different frameworks, each consisting of different representation techniques such as rules, logic, frames or grammars. Orchestrated as a whole these symbollic programming constructs can achieve an unprecedented level of abstraction that allows to mimic human-like linguistic flexibility and creativity. However, due to the high level of abstraction, today’s symbolic systems have become extremely complex to handle as each exception requires additional rules and processing (see scalability problem). In recent years, data-intensive machine learning techniques have been added at many levels of these systems, further increasing complexity while promising potential solutions for existing scalability and robustness problems. Generally, the dialog modeling community seems to have moved away from the description of linguistic structure per se towards a description of more ‘cognitive’ aspects of talk-in-interaction in the form of large associative networks (e.g. semantic webs, framenets, construction networks etc.). In line with this trend, the remainder of this post focused on describing the links between linguistic structure to processes of ‘social action formation’. First, I will review existing approaches in the field, such as speech act theories and dialog act taxonomies. Then I will outline an approach to this task that conceptualizes the links between structure and social action as a multitude of fine-grained, continuous, often ambiguous and always recipient-designed processes. Finally, I will explore how these processes of reflexive action formation (Levinson 2013) can inform a corpus-based method for the development of representations of speech acts as tensors.


Allen, J.F., 1983. Recognizing intentions from natural language utterances. Computational models of discourse, pp.107-166.

Allen, J., 1995. Natural language understanding. Pearson.

Bolden, G. B., 2015. Transcribing as research:“manual” transcription and conversation analysis. Research on Language and Social Interaction, 48(3), 276-280.

Carberry, S., 1983, August. Tracking User Goals in an Information-Seeking Environment. In AAAI (pp. 59-63).

Cawsey, A., 1990. A Computational Model of Explanatory Discourse: local interactions in a plan-based explanation. In Computers and conversation (pp. 221-234). Academic Press.

Chin, D.N., 1989. KNOME: Modeling what the user knows in UC. In User models in dialog systems (pp. 74-107). Springer, Berlin, Heidelberg.

Cohen, P.R. and Perrault, C.R., 1979. Elements of a plan-based theory of speech acts. Cognitive science, 3(3), pp.177-212.

Hovy, E.H., 1988, June. Planning coherent multisentential text. In Proceedings of the 26th annual meeting on Association for Computational Linguistics (pp. 163-169). Association for Computational Linguistics.

Kobsa, A., 1989. A taxonomy of beliefs and goals for user models in dialog systems. In User models in dialog systems (pp. 52-68). Springer, Berlin, Heidelberg.

Kobsa, A., and Wahlster, W. ed., 1989. User models in dialog systems (p. 10). Berlin: Springer-Verlag.

McCoy, K.F., 1989. Highlighting a user model to respond to misconceptions. In User models in dialog systems (pp. 233-254). Springer, Berlin, Heidelberg.

McKeown, K.R., 1985. Discourse strategies for generating natural-language text. Artificial Intelligence, 27(1), pp.1-41.

Rosch, E. and Mervis, C.B., 1975. Family resemblances: Studies in the internal structure of categories. Cognitive psychology, 7(4), pp.573-605.

Schank, R.C., 1972. Conceptual dependency: A theory of natural language understanding. Cognitive psychology, 3(4), pp.552-631.

Scott, D. and de Souza, C.S., 1990. Getting the message across in RST-based text generation. Current research in natural language generation, 4, pp.47-73.

Wilensky, R., 1983. Planning and understanding: A computational approach to human reasoning.

Zock, M. and Sabah, G., 1988. Advances in Natural Generation. Greenwood Publishing Group Inc..

Andreas Liesenfeld
Andreas Liesenfeld
Postdoc in language technology

I am a social scientist working in Conversational AI. My work focuses on understanding how humans interact with voice technologies in the real world. I aspire to produce useful insights for anyone who designs, builds, uses or regulates talking machines. Late naturalist under pressure. “You must collect things for reasons you don’t yet understand.” — Daniel J. Boorstin