Good practices for extracting pitch using script


Katrina Li
 

Hi all,

I'm interested in extracting dynamic pitch pattern for tone/intonation analysis, primarily using time-normalised measures( 10 points for each syllable), but potentially also values at fixed time interval (one value for every 0.01s). I've encountered a few solutions, but I am curious about your recommendations on some choices; specifically, I'm interested in the following questions:
--------------

1. Pitch extraction is commonly combined with a segmenting textgrid, indicating the syllables/vowels that we want to extract f0 from. However, the boundary of segments/syllables might not be suitable for the purpose of pitch extraction. For example, the first few cycles of vibration is not regular, and the end of syllable (especially the end of utterance) might get some peculiar behaviour. Dealing with this problem, would you recommend:

A. discard the first and last 10% of the pitch values for all intervals
B. specify the analysis window in the script, e.g. 'pitch_window_threshold' in Pitch Dynamics Script for Praat, version 6.2, 'pertubation_length' and 'final_offset' in ProsodyPro
C. manually adjust the textgrid boundaries to exclude the irregular period

Concern over A is that 10%+10% of data can be a lot of data to lose, and some interesting changes might be ignored as well.
Concern over C is that the pitch analysis window will still go beyond the assigned boundaries. Then it's also difficult to predict where is the optimal part for pitch extraction
I haven't tried with B, but there's a question of how to determine these values......
--------------

2. What is the recommended object to extract pitch measurements?

ProsodyPro allows people to modify vocal pulses in PointProcess object, and then converting it to PitchTier to get pitch value. This is indeed helpful for period containing creaky voice, when it's not possible for algorithm to correctly identify f0.
Many other scripts (e.g. Pitch Dynamics Script for Praat, version 6.2, better-f0, this praat demo) directly generate the 'pitch' object. They do not seem to allow change of pitch values though.

I previously asked the question about the pitch and pitchtier (link). Thanks very much for the answer, but I'm not sure if this consideration is relevant for tone/intonation analysis.
--------------

3. if inevitably some human inspections and adjustments need to be done, when should it happen?

A. in PointProcess, like in ProsodyPro
B. in Pitch Object
C. no modifications of any praat objects; get the value and then clean the data later on(e.g. delete the extreme value)
--------------

4.  If I'm going to examine each utterance separately in the workflow, does it benefit to adjust the following settings for each utterance, even for the files from the same speaker? Or is a universal setting is good enough?

A. octave jump
B. voicing threshold
C. the analysis window (as in choice B in Q2)
D. pitch floor and pitch ceiling

I tend to say yes to D, but not knowing others.
--------------

5. Do you recommend using any smoothing algorithms?
--------------
I know these are a LOT to ask abd probably very broad too. I listed them all here since they are related. Any comment or suggestion will be very appreciated!


Christian DiCanio
 

Hi there,

There is a bit to unpack here, but here it goes.

1. The pitch window threshold in my script (Pitch Dynamics Script for Praat, version 6.2) is simply a timepoint over which the script begins to track F0. You can think of this as getting a running start on f0 estimation because autocorrelation requires it for accuracy. It's set to 30 ms before the interval and 30 ms after as a "default", but there are really no defaults here. You could set it to 50 ms if you wish.

2. How much user-specific f0 selection you wish to do is always up to the researcher. Given the amount of data we typically analyze, my take is that it is far better to do either post-script filtering or to perhaps run subsets of one's data with different parameters set. One advantage of these approaches is that they are replicable. If you decide to re-run data and you did hand-correction with lots of experimental data, you have to redo all the hand-corrected analyses. So, identify the problematic cases via data visualization and then re-run the script thinking about what your parameters are. The second advantage to this approach is that it works with corpus data well - you probably do not end up wanting to reselect f0 targets when you are extracting 10,000 - 20,000 or more trajectories.

3. Do C, but see above (2).

4. There is no universal setting for what works best for f0 estimation. The major issues that arise are either (a) no estimable pitch value or (b) pitch-doubling/halving. The first issue is addressed by adjusting with the voicing threshold, but honestly, if you have good recordings, it's not often an issue. (It can be an issue when recordings are made with the gain set too low though.) The second issue can be addressed with the octave jump setting but really, I think that it comes up less when you have speaker-specific f0 ranges set (my recommendation). So, look across recordings to estimate what the max f0 seems to be and find a reasonable minimum via visual inspection of the recordings. Then set floor/ceiling values to be 10-20 Hz higher/lower than these values.

5. It depends on what you wish to achieve. This is really a statistical question though. R uses a default LOESS smoothing for curves in plotting (specifically in ggplot2), but one could also model f0 curves using GAMMS or other models (growth curves, for instance, but they have a model selection issue). 

Best,

Christian DiCanio
cdicanio@...

Associate Professor
Director of Graduate Admissions
Department of Linguistics
University at Buffalo


Piet Mertens
 

Hi K.L.,

To deal with the problem of pitch perturbations at the boundaries of a syllable, a syllabic nucleus, a segment, or some other time interval taken from a TextGrid, I would suggest the approach (used in Prosogram v3) which detects the undefined (unvoiced) pitch frames and the pitch perturbations at such (nucleus) boundaries. 
(a) First synchronise (i.e. adjust) the interval boundaries with the pitch frames in the Pitch object (Prosogram uses a fixed time step, by default 5 ms, for high temporal resolution).
(b) Next, for the updated time interval, trim undefined pitch frames at the start and the end of the interval. This results in a new interval without undefined pitch frames at the borders (avoiding invalid pitch values near frames with undefined pitch).  
(c) Then, if the resulting interval contains additional undefined frames, split it into smaller parts at these frames. 

To deal with octave jumps or pitch perturbations at transitions between unvoiced and voiced sounds: 
(d) First pitch discontinuities are detected within a candidate interval, on the basis of the frequency change between successive pitch frames. 
(e) Among the obtained continuous parts one part (considered representative) is selected on the basis of the following (heuristic) criteria: (1) the mean intensity of the corresponding speech signal, (2) its duration, and (3) the distance from the median F0 of the speaker. In this way parts are ranked to avoid those with low intensity, short duration and large deviation from the median pitch. Note. Rather than selecting one part as the representative one (i.e. discarding those with a pitch one octave higher or lower than the perceived one), it would be possible to keep all continuous parts and postpone decisions.

This procedure does not require manual changes to the pitch object.

As for the choice of pitch floor and pitch ceiling, I suggest using two-pass pitch extraction: pass 1 obtains the median pitch for a very broad pitch range covering both low and high voices; pass 2 uses this central value to set the pitch floor and ceiling of the final pitch measurement. 

For full details see the Prosogram script.

Piet Mertens




On Wed, Nov 16, 2022 at 1:56 AM <kl502@...> wrote:
Hi all,

I'm interested in extracting dynamic pitch pattern for tone/intonation analysis, primarily using time-normalised measures( 10 points for each syllable), but potentially also values at fixed time interval (one value for every 0.01s). I've encountered a few solutions, but I am curious about your recommendations on some choices; specifically, I'm interested in the following questions:
--------------

1. Pitch extraction is commonly combined with a segmenting textgrid, indicating the syllables/vowels that we want to extract f0 from. However, the boundary of segments/syllables might not be suitable for the purpose of pitch extraction. For example, the first few cycles of vibration is not regular, and the end of syllable (especially the end of utterance) might get some peculiar behaviour. Dealing with this problem, would you recommend:

A. discard the first and last 10% of the pitch values for all intervals
B. specify the analysis window in the script, e.g. 'pitch_window_threshold' in Pitch Dynamics Script for Praat, version 6.2, 'pertubation_length' and 'final_offset' in ProsodyPro
C. manually adjust the textgrid boundaries to exclude the irregular period

Concern over A is that 10%+10% of data can be a lot of data to lose, and some interesting changes might be ignored as well.
Concern over C is that the pitch analysis window will still go beyond the assigned boundaries. Then it's also difficult to predict where is the optimal part for pitch extraction
I haven't tried with B, but there's a question of how to determine these values......
--------------

2. What is the recommended object to extract pitch measurements?

ProsodyPro allows people to modify vocal pulses in PointProcess object, and then converting it to PitchTier to get pitch value. This is indeed helpful for period containing creaky voice, when it's not possible for algorithm to correctly identify f0.
Many other scripts (e.g. Pitch Dynamics Script for Praat, version 6.2, better-f0, this praat demo) directly generate the 'pitch' object. They do not seem to allow change of pitch values though.

I previously asked the question about the pitch and pitchtier (link). Thanks very much for the answer, but I'm not sure if this consideration is relevant for tone/intonation analysis.
--------------

3. if inevitably some human inspections and adjustments need to be done, when should it happen?

A. in PointProcess, like in ProsodyPro
B. in Pitch Object
C. no modifications of any praat objects; get the value and then clean the data later on(e.g. delete the extreme value)
--------------

4.  If I'm going to examine each utterance separately in the workflow, does it benefit to adjust the following settings for each utterance, even for the files from the same speaker? Or is a universal setting is good enough?

A. octave jump
B. voicing threshold
C. the analysis window (as in choice B in Q2)
D. pitch floor and pitch ceiling

I tend to say yes to D, but not knowing others.
--------------

5. Do you recommend using any smoothing algorithms?
--------------
I know these are a LOT to ask abd probably very broad too. I listed them all here since they are related. Any comment or suggestion will be very appreciated!


Katrina Li
 

Thank you very much for the explanations! They are very helpful suggestions. I still have some follow-up questions/clarifications:

1. For pitch window length
If I see the pitch contour shows some irregular pattern (e.g. perturbation), and I want to trim these values, should I modify the boundaries of interval so that the interval 'shrinks', and/or choose a smaller pitch window length?

2 & 4. User- specific f0 parameters
I totally agree with the 'replication' considerations. However, for my own dataset, it's not a huge amount of data so in a way every measurement counts.

I have the semi-automatic approach in mind like this one. For each utterance, if the visualisation looks correct, then accept, and if wrong, then adjust some settings (e.g. f0 floor/f0 ceiling, or the pitch window length). Taking into consideration of 'replication' issues, I think the compromise is not only noting down the collected pitch value, but also the settings that are adopted for each file. Based on your explanation, perhaps the octave jump and voicing threshold are less helpful parameters in this approach to adjust?

5. Smoothing algorhithm
Sorry I wasn't being clear. In ProsodyPro there is 'a trimming algorithm that removes spikes and sharp edges (cf. Appendix 1 in Xu 1999), and a triangular smoothing function'. I was a bit unsure whether in this way we are notting getting the 'raw' data - but perhaps there's not 'raw' data after all.


Katrina Li
 

Thank you for introducing the ideas behind Prosogram! I did have a look, the visuals are very cool, but I do need output f0 measurements (and perhaps normalised) for many utterances (with repetitions) to conduct statistical analysis. Is Prosogram able to do that?


Daniel McCloy
 

On Thu, Nov 17, 2022 at 10:04 AM, Katrina Li wrote:

I have the semi-automatic approach in mind like this one ( https://github.com/drammock/praat-semiauto ).

Do let me know if you end up using that approach and run into any difficulties (I wrote that one). Since I'm responding anyway now, I'll second what Christian has said already: voicing threshold is usually not a problem for reasonable-quality recordings, and using speaker-specific pitch floor and ceiling numbers (in my experience) gives better quality results than trying to find one floor & ceiling that works for all talkers.


Yi Xu
 

Hi Katrina,

You may find the answers to many of your questions about ProsodyPro in this presentation, e.g., the effects of trimming and smoothing can be seen in slide 20.

Regardomg pitch window length, it depends on your purpose. The default setting in ProsodyPro, for example, is not the most ideal for showing all the detailed consonantal perturbation effects, as discussed in this paper.

Yi

On 17 Nov 2022, at 18:04, Katrina Li via groups.io <kl502@...> wrote:

⚠ Caution: External sender


Thank you very much for the explanations! They are very helpful suggestions. I still have some follow-up questions/clarifications:

1. For pitch window length
If I see the pitch contour shows some irregular pattern (e.g. perturbation), and I want to trim these values, should I modify the boundaries of interval so that the interval 'shrinks', and/or choose a smaller pitch window length?

2 & 4. User- specific f0 parameters
I totally agree with the 'replication' considerations. However, for my own dataset, it's not a huge amount of data so in a way every measurement counts.

I have the semi-automatic approach in mind like this one. For each utterance, if the visualisation looks correct, then accept, and if wrong, then adjust some settings (e.g. f0 floor/f0 ceiling, or the pitch window length). Taking into consideration of 'replication' issues, I think the compromise is not only noting down the collected pitch value, but also the settings that are adopted for each file. Based on your explanation, perhaps the octave jump and voicing threshold are less helpful parameters in this approach to adjust?

5. Smoothing algorhithm
Sorry I wasn't being clear. In ProsodyPro there is 'a trimming algorithm that removes spikes and sharp edges (cf. Appendix 1 in Xu 1999), and a triangular smoothing function'. I was a bit unsure whether in this way we are notting getting the 'raw' data - but perhaps there's not 'raw' data after all.


------------------------------
Yi Xu, Ph.D.
Professor of Speech Sciences
Department of Speech, Hearing and Phonetic Sciences

University College London
Chandler House
2 Wakefield Street
London WC1N 1PF
UK

Tel:  020 7679 4082 (internal: 24082)
email: yi.xu@...
http://www.homepages.ucl.ac.uk/~uclyyix/
------------------------------



Piet Mertens
 

Hi K.L.,

For information on the output files generated by the Prosogram script, please consult the User's Guide (in PDF, 54 p, tutorial and reference manual) available at the Prosogram web page ( https://sites.google.com/site/prosogram ), more specifically sections 
- 5.1 Overview of data files 
- 5.2 Table with prosodic features per syllable 
- 9.3 Exporting the stylisation to another program
The output file referred to in 5.2 contains a set of several F0 (and other) values per syllable, before and after stylisation. This implies a prior segmentation into syllables (either from a TextGrid tier or automatically by Prosogram).
The script can also output F0 values (at a fixed time step) obtained from the stylised pitch contour (not documented in the User's Guide). 

By "f0 measurements (and perhaps normalised)" you mean time-normalised values or normalised pitch values? Prosogram also provides PRNP (pitch range normalised pitch) values, where the value shown (at a given time) is a fraction of the speaker's pitch range (calculated by the script, on semitone values).

Time (duration) is a crucial factor in pitch perception: a F0 change on a sound will be perceived either as a changing pitch or a flat pitch, depending on the duration that sound. The presence of a pause also has an impact on perception. 
Applying time normalisation on sentences from repetitions by the same speaker and/or multiple speakers will result in a comparison of stimuli with different durations and hence not necessarily comparable from a perceptual point of view.