A bootstrapping training technique for obtaining demisyllable reference patterns

Abstract
The process of obtaining reference patterns for syllablelike units is tedious, error prone and time consuming. Speech recognition systems based on such units are usually tested on only a single talker. A procedure for using demisyllable reference patterns, excised from spoken utterances for 1 talker and automatically creating demisyllable reference patterns for a new talker, is described. The procedure is based on dynamic time warping alignment of the spoken utterances (each containing the relevant demisyllable), and the assumption that the optimum warping path identifies the best matching demisyllable within the utterance. The automatic procedure was used to create demisyllable reference patterns for 2 new talkers, each talking over a dialed-up telephone line (the original recordings were made over a high-quality microphone). Reference patterns were made for 100 isolated words from a given lexical specification of each word in terms of demisyllables in the inventory. Word recognition accuracies > 90% were obtained for both talkers on the 100-word vocabulary. In a 2nd experiment, using a 1109-word vocabulary, recognition accuracies from reference patterns based on automatically extracted demisyllables were from 2-5% worse than accuracies from reference patterns based on hand corrections applied to the demisyllables. Apparently, the automatic demisyllable extraction technique provides a very good 1st pass set of demisyllables, and combined with some manual corrections, provides a set of demisyllable patterns suitable for use in a recognition system.

This publication has 1 reference indexed in Scilit: