Functions

Functions for Concatenating Units

us_unit_concat

sat_as_is void us_unit_concat(EST_Utterance &utt, float window_factor, const EST_String &window_name, bool no_waveform=false)

Iterate through the Unit relation and create theSourceCeof relation, which contains a series of windowed frames of speech and a track of pitch-synchronous coefficients.

SourceCoef contains a single item with two features, coefs and frame

coefs'value is a track with all the concatenated pitchmarks and coefficients from the units.

us_unit_concat is where the pitch synchronous windowing of the frames in each Unit is performed and the result of this is stored as the value of frame

Require:. Unit

Provide:. SourceCoef

Parameters
utt

utterance

window_factor

This specifies how large the analysis window is in relation to the local pitch period. A value of 1.0 is often used as this means each frame approximately extends from the previous pitch mark to the next.

window_name

This specifies the type of window used. "hanning" is standard but any window type available from the signal processing library can be used.

no_waveform

if this is set to true, only the coefficients are copied into SourceCoef - no waveform analysis is performed.

us_get_copy_wave

void us_get_copy_wave(EST_Utterance &utt, EST_Wave &source_sig, EST_Track &source_pm, EST_Relation &source_seg, float window_factor, const EST_String &window_name)

This function provides the setup for copy resynthesis. In copy resynthesis, a natural waveform is used as the source speech for synthesis rather than diphones or other concatenated units. This is often useful for testing a prosody module or for altering the pitch or duration of a natural waveform for an experiment. (As such, this function should really be thought of as a very simple unit selection module)

In addition to the speech waveform itself, the function requires a set of pitchmarks in the standard form, and a set of labels which mark segment boundaries. The Segment relation must already exist in the utterance prior to calling this function.

First, the function creates aUnit relation with a single item containing the waveform and the pitchmarks. Next it adds a set of source_end features to each item in the Segment relation. It does this by calculating a mapping between the Segment relation and the input labels. This mapping is performed by dynamic programming, as often the two sets of labels don't match exactly.

The final result, therefore is a Unit relation and Segment relation with source_end features. As this is exactly the same output of the standard concantenative synthesis modules, from here on the utterance can be processed as if the units were from a genuine synthesizer.

Copy synthesis itself can be performed by ....

Require:. Segment

Provide:. Unit

Parameters
utt

utterance

source_sig

waveform

source_pm

pitchmarks belonging to waveform

source_seg

set of items with end times referring to points in the waveform

window_factor

This specifies how large the analysis window is in relation to the local pitch period. A value of 1.0 is often used as this means each frame approximately extends from the previous pitch mark to the next.

window_name

This specifies the type of window used. "hanning" is standard but any window type available from the signal processing library can be used.

us_unit_raw_concat

void us_unit_raw_concat(EST_Utterance &utt)

This function produces a waveform from the Unit relation without prosodic modification. In effect, this function simply concatenates the waveform parts of the units in the unit relation. An overlap add operation is performed at unit boundaries so that waveform discontinuities don't occur.

us_energy_normalise

void us_energy_normalise(EST_Relation &unit)

Items in the Unit relation can take an optional flagenergy_factor, which scales the amplitude of the unit waveform. This is useful because units often have different energy levels due to different recording circumstances. An energy_factor of 1.0 leaves the waveform unchanged.

Functions for Producing Mappings

us_mapping

void us_mapping(EST_Utterance &utt, const EST_String &method)

This function produces the mapping between the SourceCoef track and TargetCoef track. The mapping is controlled by two types of modification, duration and pitch.

Duration is specified by the Segment relation. Each item in this relation has two features source_end and target_end.source_end is marks the end point of that segment in the concatenated set of source coefficients, while target_end marks the desired end of that segment.

Pitch modification is specified by the patterns of pitchmarks in the SourceCoef track and TargetCoef track. While these tracks actually represent periods, their reciprocal represents the source and target F0 contours.

The mapping is a integer array with one element for every pitchmark in the TargetCoef track. Therefore every target pitchmark has a mapping element, and the value of that element is the frame number in the SourceCoef track which should be used to generate the frame of speech for that target pitchmark. Depending on the mapping, source frames can be duplicated or skipped.

If the duration is constant, a higher target pitch will mean source frames are duplicated. If the pitch is constant, a longer target duration will also mean source frames are duplicated. The duration and pitch modifications are calculated at the same time, leading to a single mapping.

Require:. SourceCoef, TargetCoef, Segment

Provide:. US_Map

Pitchmark Functions

f0_to_pitchmarks

void f0_to_pitchmarks(EST_Track &fz, EST_Track &pm, int num_channels=0, float default_f0=100.0, float target_end=-1)

This function generates the target pitchmarks from the target F0 contour. The pitchmarks are generated by reading a value, off the f0 contour at time , calculating the local pitch period , and placing a pitchmark at time . The process is then repeated by reading the F0 value at this new point and so on.

The F0 contour must be continuous in all regions, that is unvoiced regions must have pseudo f0 values also. Although artificial contours are best generated in this way to begin with, the function \ref{**} can be used to interpolate through unvoiced regions for non-continuous contours.

As the last F0 value in the contour may not be the end of the utterance (for example if the last phone is unvoiced), the pitchmarks may be extended past the end of the contour.

After processing, the generated track only contains the target pitchmarks, but later functions may fill the amplitude array of the track with target coefficients, and hence the space for these can be allocated at this stage.

Parameters
fz

input F0 contour.

pm

set of pitchmarks to be generated. These are set to the correct size in the function.

num_channels

(optional) number of coefficients used in further processing.

default_f0

(optional) f0 value for interpolated end values

target_end

(optional) fill from the end of the contour to this point with default f0 values.

pitchmarks_to_f0

void pitchmarks_to_f0(EST_Track &pm, EST_Track &fz, float shift)

This is a utility function for converting a set of pitchmarks back to an F0 contour and is usually used in system development etc. The generated F0 is evenly spaced.

Parameters
pm

input set of pitchmarks to be generated

fz

otuput F0 contour.

shift

frame shift of generated contour in seconds.

Functions for Generating Waveforms

us_generate_wave

void us_generate_wave(EST_Utterance &utt, const EST_String &filter_method, const EST_String &ola_method)

Standard waveform generation function. This function genrates the actual synthetic speech waveform, using information in the SourceCoef, TargetCoef and US_map relations.

The first stage involves time domain processing, whereby a speech waveform or residual waveform is generated. The second (optional) stage passes this waveform through the set of filter coefficients specified in the TargetCoef track. The output synthetic waveform is put in the Wave relation.

LPC resynthesis is performed by the lpc_filter_1 function.

Require:. SourceCoef, TargetCoef, US_map

Provide:. Wave

Parameters
utt

utterance

filter_method

type of filter used - normally "lpc" or none ("")

td_method

type of time domain synthesis.

map_coefs

void map_coefs(EST_Track &source_coef, EST_Track &target_coef, EST_IVector &map)

This copies coefficients from source_coef into target_coef according to the frame mapping specified by map. target_coef should already have been allocated, and the pitchmarks in the time array set to appropriate values. (this can be done by the f0_to_pitchmarks function).

(EST_FrameVector &frames, EST_Track &target_pm, EST_Wave &target_sig, EST_IVector &map)

void td_synthesis(EST_FrameVector &frames, EST_Track &target_pm, EST_Wave &target_sig, EST_IVector &map)(EST_FrameVector &frames, EST_Track &target_pm, EST_Wave &target_sig, EST_IVector &map)

Time domain resynthesis.

Generate a speech waveform by copying frames into a set of time positions given by target_pm. The frame used for each time position is given by map, and the frames themselves are stored individually as waveforms in frames.

Parameters
target_sig

output waveform

target_pm

new pitchmark positions

frames

array containing waveforms, each representing a single analysis frame

map

mapping between target_pm and frames.