SSW10 - The 10th ISCA Speech Synthesis Workshop

Invited Speakers

Aäron van den Oord (Google DeepMind, London, UK)

Aaron van den Oord

Deep learning for speech synthesis

With the advent of Deep Learning, Generative Modeling has dramatically improved, almost reaching the point that generated samples cannot be distinguished from real data. WaveNet has shown that it is possible to model high-dimensional audio so well that it can be used for speech synthesis, outperforming the best known methods such as concatenative and vocoder based systems. The main advantage of generative TTS, however, may be the flexibility of these learning-based approaches. The same system that learns to speak English fluently can also be trained for other languages, such as Mandarin, or even synthesize non-voice audio such as music. A single model can learn different speaker voices at once and can switch between them by conditioning on the speaker identity. It can also learn to adapt more quickly to new unseen data, learning new speakers from as little as a few sentences. Finally, generative TTS systems open the door to a wide variety of new applications, such as unsupervised phonetic unit discovery and speech compression.

Claire Gardent CNRS/LORIA, Nancy, France

Claire Gardent

Natural Language Generation: Creating Text (https://www.superlectures.com/ssw2019/natural-language-generation-creating-text)

Natural Language Generation (NLG) aims at creating text based on some input (data, text, meaning representation) and some communicative goal (summarising, verbalising, comparing etc.). In the pre-neural era, differing input types and communicative goals led to distinct computational models. In contrast, deep learning encoder-decoder models introduced a shift of paradigm in that they provide a unifying framework for all NLG tasks. In my talk, I will start by briefly introducing the three main types of input considered in NLG. I will then give an overview of how neural models handle these and present some of the work we did on generating text from meaning representations, from data and from text.

Tecumseh Fitch (University of Vienna, Austria), Bart de Boer (Vrije Universiteit Brussel, Belgium)

Tecumseh Fitch Bart de Boer

Synthesizing animal vocalizations and modelling animal speech (https://www.superlectures.com/ssw2019/synthesizing-animal-vocalizations-and-modelling-animal-speech)

In the last two decades, theory from speech science and methods from digital signal processing have been productively used to study animal communication in many different ways. This has led to fundamental advances in our understanding of how animals produce and perceive their vocalizations, and use them to communicate with one another. A central insight was that the source-filter theory of vocal production, initially developed in speech science, applies to most vertebrate vocal systems as well. This opened the door to using methods like linear prediction to analyze source and filter characteristics, and to re-synthesize realistic vocalizations with precise changes to fundamental frequency, formants and other characteristics. We give an overview of this progress, with several specific examples from our own work covered in more detail.