SpeeDurCont
Segmental Duration in German Speech

Natural sounding speech is a key factor for the acceptability of practical voice output systems whereby the main factors contributing to naturalness are segmental quality and prosody. Current improvements in the segmental quality of synthesized speech have made it clear that truly high-quality speech synthesis now depends crucially on adequate and natural sounding prosody as well. Currently, most research in prosody is directed at intonation and its realization through fundamental frequency (f0), while duration (and amplitude) have been mostly regarded as secondary factors.

The project aimed at a better model of segmental duration. Moreover, we wanted to get a clearer understanding of the relation between discourse structure and prosodic parameters. Central to the approach is the investigation of the interdependencies between intonation and duration, i.e. fundamental frequency is explicitly taken into account by means of tone labelling. Another novelty is the explicit incorporation of discourse related information such as the division of topic, focus and background in a dedicated part of the corpus. The corpus used in our study is the first corpus of its kind for Austrian German. In order to be able to take into account a large number of (potential) parameters, the corpus, which has to be prosodically labelled, had to have a considerable size (50.000+ phonemes). The statistical methods employed had to be able to cope with the inherently uneven distribution of feature values (data sparsity). We have opted for the use of machine learning methods, in particular structural regression trees (SRTs), which integrate the statistical method of regression trees with the inductive logic programming paradigm.

The results of the study were integrated into our existing speech synthesis component. This provided us with the necessary tool to experimentally test the hypotheses in the evaluation phase. It also forms a showcase for demonstrating the practical enhancement of the quality of synthesized speech by means of duration control.

Research staff

Sponsor

Key facts