Automatic speech recognition. The state of the art.

An introduction to current automatic speech recognition systems

Image Credit: History of speech recognition

Automatic speech recognition (ASR) technology comsumes human voice inout and outputs readable text, usually in the form of subtitle-like tanscriptions of talk. ASR is a key step in conversational AI pipelines because any further processing down the line is conducted on the basis of this output. The ability of ASR modules to produce these subtitles has siginificantly increased in recent years as deep learning has replaced other statistical methods such as Hidden Markov Models or Gaussian Mixture Models. Popular current ASR modules include Deepspeech, Wav2letter, Listen-Attend-Spell and Jasper. They all fall into the category of end-to-end ASR systems and typically work something like this:

Figure 1: A typical ASR pipeline design. (Credit:
Figure 1: A typical ASR pipeline design. (Credit:

The first step is to record sound with some kind of capturing device, usually a microphone array that provides proprocessing capabilties such as noise suppression, echo cancellation, automatic gain control etc. These proprocessing step aim to better separate human voice input from noise and other unwanted input. Second, Mel Frequency Cepstral Coefficient (MFCC) techniques are used to capture audio spectral features in a spectrogram. By “printing” sound as spectrograms, ASR from here on essentially becomes an image recognition task. The question becomes how accurately can our system recognise letters - and by extension - words or Chinese characters in the spectrogram data? Humans can actually also learn to read spectrograms. ASR modules attempt this by passing the spectrograms to a deep learning-based acoustic model that predicts the probability of single letters over a short stretch of spectrogram input. This is a matching task that takes the spectrogram as input and chops it up into a series of time steps. And it takes a small set of letters from an alphabet-like list as input. For English, this might be a list from A-Z, or Pinyin syllables for Chinese. The model then predicts the character for each time step of the spectrogram. It can do that because it has been trained on large datasets that consist of hours of audio with aligned transcripts (e.g. Librispeech, WSJ ASR corpus or the Google Audio set). The output is a series of letters aligned to the time steps of the spectogram input.

Figure 2: Step-wise estimation of letters from spectrogram. (Credit:

The letter-by-letter output of the acoustic model might look like this:

   H H E E E L L L O O O O O O


   Y Y H H E E L L O O O O

Next, a decoder with a language model estimates what word matches these letter series. With the help of a language model (a large corpus of the target language), the decoder computes “Hello” and “Yellow” from series of single letters like those above based on coocurrence patterns. Depending on the product, the words can then be further buffered into phrases or neat sentences with added punctuation before getting sent to additional natural language processing or natural language understanding modules. To save time and computing power, decoders often rely on pretrained models such as BERT.

The Tchaikovsky problem

Character sets and language models as limitations chracter sets and non-lexical utterances oov zipf distributions and proper noun problem


Andreas Liesenfeld
Andreas Liesenfeld
Postdoc in language technology

I am a social scientist working in Conversational AI. My work focuses on understanding how humans interact with voice technologies in the real world. I aspire to produce useful insights for anyone who designs, builds, uses or regulates talking machines. Late naturalist under pressure. “You must collect things for reasons you don’t yet understand.” — Daniel J. Boorstin