Principles of speech technology development and system design
On the conceptualization of language engineering
Like any good engineering, the development of speech technology is guided by the following principles:
problem reduction: a task is decomposed into a number of simpler substasks which can be further decomposed until subtasks are distinguished which can directly be modeled and implemented.
modularity: the program achieving a task is divided into different modules, often corresponding to major substasks, which are more or less autonomous, specialized information processors.
formalization: Within each module, a formalism is designed that puts together the necessary knowledge and methods needed for that module to accomplish its task. The respective formalism needs to capture the representation and manipulation of the specific slice of knowledge, i.e. it provides a bridge between theory and implementation. Such formalizations provide the means with which the necessary knowledge is organized in a certain way. However, this necessary step of (re)organizing the data is often problematic for a number of theoretical reasons (see section “On reductionism”). Formalization ultimately provides a means to enable inference for how the data can be used to solve the task at hand, either through ‘procedual formal models’ that enable direct inference based on specific instances or through ‘declarative formal models’ that exhibit a larger degree of separation between knowledge representation and inference aspects of a problem. Formalisms represent linguistic knowledge in various forms. In the area of grammar and linguistic structure the often take the form of state-space search models, logic-based or rule-based formalizations. In semantics the may come in form of graph-based representations such as semantic networks and frames (see section “framenets and constructicons”).
As for dialog systems, the task of interpreting and generating natual language is usually decomposed into various subtasks tied to ‘linguistic units’, i.e. phoneme, word, turn (or sentence) and discourse. Based on this division, a larger architecture of various modules is established both on the language understanding and the language generation side. Possible modules in language understanding are: speech recognition, morphological analysis, syntactic parsing, semantic analysis and discourse analysis. And for language generation: discourse planning, turn (or sentence) generation, morphological generation and speech production.
The modules are often organized as part of a sequential architecture where the different modules are accessed in sequence and the output of one module is directly input into the next module. Recently, more interactive architectures have also gained popularity where the modules are not strictly accessed in a sequential fashion, but feature more links and feedback loops between different modules (i.e. distributed connectionist models). This is especially true for the syntax-semantics interface where a trend can be observed away from ‘syntax first’ strategies in which a syntactic parse is computed first and then fed into a semantic module. Instead, new approaches to the syntax-semantics interface enable an integration of these formely separate modules, often drawing on cognitive grammars (see section “Framenets and Constructicons”). Noteably, this trend is a reminder that the division into modules is fact driven by design decisions based on the principles of ‘problem reduction’ and ‘modularity’, but that it is by no means garanteed that the dividing lines in such a system accurately reflect talk-in-interaction in humans. This also applies to the semantics-pragmatics interface.