Trainable videorealistic speech animation

1 July 2002

proceedings article
Published by Association for Computing Machinery (ACM)

Vol. 21 (3), 388-398
https://doi.org/10.1145/566570.566594

Abstract

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

Keywords

This publication has 24 references indexed in Scilit:

A Global Geometric Framework for Nonlinear Dimensionality Reduction
Science, 2000
Robustly Estimating Changes in Image Appearance
Computer Vision and Image Understanding, 2000
Polymorph: morphing among multiple images
IEEE Computer Graphics and Applications, 1998
Image Representations for Visual Learning
Science, 1996
The SPHINX-II speech recognition system: an overview
Computer Speech & Language, 1993
Feature-based image metamorphosis
ACM SIGGRAPH Computer Graphics, 1992
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
Speech Communication, 1990
A muscle model for animation three-dimensional facial expression
ACM SIGGRAPH Computer Graphics, 1987
The Laplacian Pyramid as a Compact Image Code
IEEE Transactions on Communications, 1983
Determining optical flow
Artificial Intelligence, 1981

Cited by 88 articles