The quest for more natural talk between humans and voice bots

Dreaming of talking robots

Image Credit: Isaac Asimov - Robot Dreams

Conversational AI is the field that works to bring about new ways of how humans and technology interact using speech. Or, commerically speaking, the field that enables and enhances user experiences of voice-enabled products. Part of this is to create experiences that make the user feel as if s/he is talking to a human. Voice-enabled technology becomes part of our life when we are woken up by a virtual assistant in the morning, when we are asking our car for directions on our commute and when we contact customer service chatbots while shopping. Ideally, we speak in our normal voices and the devices understand and replies naturally and effortlessly. The ability to simulate human behaviour is an important part of such products. The inner workings of this type of interactive, voice-enabled technology are complex, usually involving multi-step processes and requiring considerably computing power. Typically, the different processing stages involved are grouped into three parts:

1. Automatic Speech Recognitions (ASR)
2. Natural Language Processing (NLP) or Natural Language Understanding (NLU)
3. Text-to-Speech (TTS) with voice synthesis

The first stage is speech recognition, the conversion of human speech input into subtitle-like transcriptions. The second is natural language processing and understanding - the stage at which a variety of modules further processes the ASR transcription, usually by interacting with databases or web services. For instance, translating the input into another language or performing a web search. The last stage, TTS, is the technology in charge of producing synthesized voice-output back the user.

Figure 1: A typical ASR pipeline design. (Credit: developer.nvidia.com/conversational-ai)
Figure 1: A typical ASR pipeline design. (Credit: developer.nvidia.com/conversational-ai)

This typical conceptualization of a conversational AI pipeline suggest that the pipeline consists of three parts. A recognition part where speech is translated into text. A processing part where a certain task is fulfilled, such as machine translation, seq-to-seq question answering or query search. And a voice production part that communciates the output of the process back to the user.

%%UNDER CONSTRUCTION CA critique this ist he first part whenre the CA person would say

transcription is research word level transcription is a reduction of utterances to some information. instead of transduction or translation

a lot is lost in the process and the subtitle-like transcript is certainly not an ideal basis to find out WHY THIS NOW. This is why CA has developed another format of transcribing talk that is better at capturing how human speak in natural talk. (Bolden 2015)

conclusion certainly better representations are needed to do more things with speech input. But before getting there , a closr look at the current state of ASR.

Andreas Liesenfeld
Andreas Liesenfeld
Postdoc in language technology

I am a social scientist working in Conversational AI. My work focuses on understanding how humans interact with voice technologies in the real world. I aspire to produce useful insights for anyone who designs, builds, uses or regulates talking machines. Late naturalist under pressure. “You must collect things for reasons you don’t yet understand.” — Daniel J. Boorstin

Related