Research Areas and Projects

Research and Development at I2R: Multi-lingual Speech Synthesis

Philosophy and Methodology of AI
  Computationalism
  Connectionism
  Interactionism
  Phenomenalism
  The case of text-to-speech synthesis
Speech Synthesis Using HMM and Unit Selection
Speech Synthesis with Deep Learning

Philosophy and Methodology of Artificial Intelligence:
From Computationalism to Phenomenalism

Putting speech synthesis in a large context helps understand it better. At first thought reading a text aloud seems to be a trivial task, but it actually involves a significant level of intelligence. Letting a computer perform the task belongs to an essential part of artificial intelligence (AI). A clarification of the philosophy and methodology of AI in general puts speech synthesis in a clear context.

AI is almost as old as modern computers. Its philosophy and methodology have evolved in its several decades of short history. Its basic assumptions have been challenged and modified and new methods have been developed. We may identify four major approaches.

Computationalism and Rule-Basedness

Basic Assumption: The human brain works essentially like a computer, so human intelligence can be completely achieved through a computer.

The first generation of AI researchers held the above strong assumption. In some cases a computer was literally called an "electronic brain." This was to a large extent based on the logical empiricist philosophy of science. This approach was vehemently criticised by Dreyfus. He identified its four specific assumptions (What Computers Still Can't Do: A Critique of Artificial Reason, MIT Press 1992, Part II):

Rules, especially logical rules, played a crucial role in the computationalist approach. With the loss of favour of logical empiricism and insurmountable obstacles encountered, this approach also went out of trend. We could call it "old-fashioned AI."

Connectionism and Brain Simulation

Basic Assumption: In order to create intelligence through the computer artificially, we have to simulate the human brain.

Facing the failure of the computationalist approach, the connectionist approach moved away from the strong computationalist claim. Now the human brain received some respect. But the connectionist view of the human brain is a neural network. So the simulation was to build an artificial neural network (ANN) with software. Explicit rules disappeared in an ANN. Instead there were only activation weights between layers of nodes, the virtual neurons. Due to the issue of convergence and great computation complexity ANNs built in the 1980s and 1990s had to be kept very simple, with only one middle layer between the input and output layers. This put a limit on the power of ANNs. Hence they went out of fashion quickly.

Both issues were addressed significantly in the first decade of the new millennium, with effective pre-training methods and greatly improved computation power in computers. This allowed researchers to build much more complex ANNs. The age of deep learning was ushered. As usual, significant technological advancements always fanned wild fantasies and hyperbolic claims. We should be careful before we try to revive the old strong AI dreams.

The connectionist approach originally worked the best in pattern recognition. With more capturing power in deep ANNs, it will definitely find more application fields. However, it can be called brain simulation in a very loose sense. Is back-propagation really happening in the brain? In any way it's committed to the reductionist claim, that all the human brain is about can be reduced to the electro-chemical interaction among neurons as discovered by neuroscience.

Interactionism and the Sensory-Motor Model

Basic Assumption: The interaction between the agent and the environment is an important foundation of intelligence.

Interactionism is a more natural approach than connectionism. Besides the brain the human body was also given due respect. To some extent this approach was inspired by Dreyfus's criticism. Under the influence of Heidegger's philosophy he held the view that intelligence had to be situated in a particular environment. It could not be based only on a "brain in a vat."

The interactionist approach is obviously the foundation of robotics. The sensory-motor model and learning are two essential features. As a robot actively explores its environment it interacts with its surrounding and modifies its own behavior. The limitation of this approach is that it can only be practically applied to lower level of intelligence.

Phenomenalism and Data-Drivenness

Basic Assumption: Some part of human intelligence cannot, or is extremely difficult to, be achieved through bottom-up approaches. In this case we have to face and handle the phenomena directly.

The higher level of intelligence, such as language abilities, looks far beyond the reach of bottom-up approaches. If we start with neurons or sensory-motor interaction with the environment language seems to be light years away. In this case we need a phenomenalist approach. We don't need to care about how the intelligence is generated behind the curtain, but just pay attention to the phenomena of it. For instance, we can just treat speech as acoustic signal and handle features of it directly. We can also treat text writing as a generative process based on statistics (ChatGPT).

In a sense computationalism is also a type of phenomenalist approach. However it holds a simplistic view of the phenomena. It treats mind as computation and language as logical calculus. In contrast genuine phenomenalism gives the phenomena full respect. An important distinction lies in the fact that, whereas computationalism relies heavily on pre-defined rules, genuine phenomenalism is based on raw data.

The Case of Text-to-Speech Synthesis

Language abilities belong to the higher level of human intelligence. No wonder human language processing (HLP) has been an essential part of AI. Since language has two formats, text and speech, an important part of HLP is conversion between these two formats. The conversion from speech to text is called speech recognition. The conversion in the opposite direction is speech synthesis. Although speech synthesis is relatively easy compared with other fields of HLP, as mentioned in the introduction, it's still a very challenging task. It's even more so when we aim higher.

Since ancient time, engineers have been trying to mimic human speech with machines. But significant results were only achieved in the 20th century. The speech synthesis systems have evolved in the past decades.

The first generation systems (up to 1970s & early 1980s) adopted the approach of vocal tract simulation. In this case the output speech was generated from scratch with signal processing techniques on the basis of certain vocal tract model. Pre-defined rules were inevitable. But since hand written rules couldn't capture all the complexity of human speech, the generated speech could only reach basic intelligibility.

The second generation systems (1980s & 1990s) tried to move away from explicit synthesis models and rely more on real speech data. Specifically diphones were extracted from real speech and concatenated to generate speech. In the concatenation process the pitch and timing of the diphones were modified with signal processing techniques and other techniques were applied at the boundaries.

In the third generation systems (since late 1990s) real speech data became dominant, while signal processing was reduced to the minimum. The hidden Markov model and unit selection were two major techniques. Whereas the former was based on the statistical model of raw speech data, the latter used the raw data directly. With the most recent adoption of deep learning, the development of speech synthesis has entered a new phase.

The fourth generation systems (since 2010s) take advantage of the significant modelling power of deep neural networks and have achieved direct modelling of wave samples using huge datasets. In this way, the role of signal processing is further reduced. The quality of the generated speech has been greatly improved in the fourth generation systems. In normal text reading, it's very close to human performance.

We may highlight the following points in the development of speech synthesis technology:

 

This guest lecture at National University of Singapore by me gives a historical introduction to speech synthesis: Part I, Part II