Good practices for extracting pitch using script
Katrina Li
Hi all,
I'm interested in extracting dynamic pitch pattern for tone/intonation analysis, primarily using time-normalised measures( 10 points for each syllable), but potentially also values at fixed time interval (one value for every 0.01s). I've encountered a few solutions, but I am curious about your recommendations on some choices; specifically, I'm interested in the following questions: -------------- 1. Pitch extraction is commonly combined with a segmenting textgrid, indicating the syllables/vowels that we want to extract f0 from. However, the boundary of segments/syllables might not be suitable for the purpose of pitch extraction. For example, the first few cycles of vibration is not regular, and the end of syllable (especially the end of utterance) might get some peculiar behaviour. Dealing with this problem, would you recommend: A. discard the first and last 10% of the pitch values for all intervals B. specify the analysis window in the script, e.g. 'pitch_window_threshold' in Pitch Dynamics Script for Praat, version 6.2, 'pertubation_length' and 'final_offset' in ProsodyPro C. manually adjust the textgrid boundaries to exclude the irregular period Concern over A is that 10%+10% of data can be a lot of data to lose, and some interesting changes might be ignored as well. Concern over C is that the pitch analysis window will still go beyond the assigned boundaries. Then it's also difficult to predict where is the optimal part for pitch extraction I haven't tried with B, but there's a question of how to determine these values...... -------------- 2. What is the recommended object to extract pitch measurements? ProsodyPro allows people to modify vocal pulses in PointProcess object, and then converting it to PitchTier to get pitch value. This is indeed helpful for period containing creaky voice, when it's not possible for algorithm to correctly identify f0. Many other scripts (e.g. Pitch Dynamics Script for Praat, version 6.2, better-f0, this praat demo) directly generate the 'pitch' object. They do not seem to allow change of pitch values though. I previously asked the question about the pitch and pitchtier (link). Thanks very much for the answer, but I'm not sure if this consideration is relevant for tone/intonation analysis. -------------- 3. if inevitably some human inspections and adjustments need to be done, when should it happen? A. in PointProcess, like in ProsodyPro B. in Pitch Object C. no modifications of any praat objects; get the value and then clean the data later on(e.g. delete the extreme value) -------------- 4. If I'm going to examine each utterance separately in the workflow, does it benefit to adjust the following settings for each utterance, even for the files from the same speaker? Or is a universal setting is good enough? A. octave jump B. voicing threshold C. the analysis window (as in choice B in Q2) D. pitch floor and pitch ceiling I tend to say yes to D, but not knowing others. -------------- 5. Do you recommend using any smoothing algorithms? -------------- I know these are a LOT to ask abd probably very broad too. I listed them all here since they are related. Any comment or suggestion will be very appreciated! |
|
Hi there, Christian DiCanio
cdicanio@... Associate Professor Director of Graduate Admissions Department of Linguistics University at Buffalo |
|
Piet Mertens
Hi K.L., To deal with the problem of pitch perturbations at the boundaries of a syllable, a syllabic nucleus, a segment, or some other time interval taken from a TextGrid, I would suggest the approach (used in Prosogram v3) which detects the undefined (unvoiced) pitch frames and the pitch perturbations at such (nucleus) boundaries. (a) First synchronise (i.e. adjust) the interval boundaries with the pitch frames in the Pitch object (Prosogram uses a fixed time step, by default 5 ms, for high temporal resolution). (b) Next, for the updated time interval, trim undefined pitch frames at the start and the end of the interval. This results in a new interval without undefined pitch frames at the borders (avoiding invalid pitch values near frames with undefined pitch). (c) Then, if the resulting interval contains additional undefined frames, split it into smaller parts at these frames. To deal with octave jumps or pitch perturbations at transitions between unvoiced and voiced sounds: (d) First pitch discontinuities are detected within a candidate interval, on the basis of the frequency change between successive pitch frames. (e) Among the obtained continuous parts one part (considered representative) is selected on the basis of the following (heuristic) criteria: (1) the mean intensity of the corresponding speech signal, (2) its duration, and (3) the distance from the median F0 of the speaker. In this way parts are ranked to avoid those with low intensity, short duration and large deviation from the median pitch. Note. Rather than selecting one part as the representative one (i.e. discarding those with a pitch one octave higher or lower than the perceived one), it would be possible to keep all continuous parts and postpone decisions. This procedure does not require manual changes to the pitch object. As for the choice of pitch floor and pitch ceiling, I suggest using two-pass pitch extraction: pass 1 obtains the median pitch for a very broad pitch range covering both low and high voices; pass 2 uses this central value to set the pitch floor and ceiling of the final pitch measurement. For full details see the Prosogram script. Piet Mertens On Wed, Nov 16, 2022 at 1:56 AM <kl502@...> wrote: Hi all, |
|
Katrina Li
Thank you very much for the explanations! They are very helpful suggestions. I still have some follow-up questions/clarifications:
1. For pitch window length If I see the pitch contour shows some irregular pattern (e.g. perturbation), and I want to trim these values, should I modify the boundaries of interval so that the interval 'shrinks', and/or choose a smaller pitch window length? 2 & 4. User- specific f0 parameters I totally agree with the 'replication' considerations. However, for my own dataset, it's not a huge amount of data so in a way every measurement counts. I have the semi-automatic approach in mind like this one. For each utterance, if the visualisation looks correct, then accept, and if wrong, then adjust some settings (e.g. f0 floor/f0 ceiling, or the pitch window length). Taking into consideration of 'replication' issues, I think the compromise is not only noting down the collected pitch value, but also the settings that are adopted for each file. Based on your explanation, perhaps the octave jump and voicing threshold are less helpful parameters in this approach to adjust? 5. Smoothing algorhithm Sorry I wasn't being clear. In ProsodyPro there is 'a trimming algorithm that removes spikes and sharp edges (cf. Appendix 1 in Xu 1999), and a triangular smoothing function'. I was a bit unsure whether in this way we are notting getting the 'raw' data - but perhaps there's not 'raw' data after all. |
|
Katrina Li
Thank you for introducing the ideas behind Prosogram! I did have a look, the visuals are very cool, but I do need output f0 measurements (and perhaps normalised) for many utterances (with repetitions) to conduct statistical analysis. Is Prosogram able to do that?
|
|
On Thu, Nov 17, 2022 at 10:04 AM, Katrina Li wrote:
Do let me know if you end up using that approach and run into any difficulties (I wrote that one). Since I'm responding anyway now, I'll second what Christian has said already: voicing threshold is usually not a problem for reasonable-quality recordings, and using speaker-specific pitch floor and ceiling numbers (in my experience) gives better quality results than trying to find one floor & ceiling that works for all talkers. |
|
Yi Xu
Hi Katrina,
You may find the answers to many of your questions about ProsodyPro in this
presentation, e.g., the effects of trimming and smoothing can be seen in slide 20.
Regardomg pitch window length, it depends on your purpose. The default setting in ProsodyPro, for example, is not the most ideal for showing all the detailed consonantal perturbation effects, as discussed in this
paper.
Yi
------------------------------
Yi Xu, Ph.D. Professor of Speech Sciences Department of Speech, Hearing and Phonetic Sciences University College London Chandler House 2 Wakefield Street London WC1N 1PF UK Tel: 020 7679 4082 (internal: 24082) email: yi.xu@... http://www.homepages.ucl.ac.uk/~uclyyix/ ------------------------------ |
|
Piet Mertens
Hi K.L., - 5.1 Overview of data files - 5.2 Table with prosodic features per syllable - 9.3 Exporting the stylisation to another program The output file referred to in 5.2 contains a set of several F0 (and other) values per syllable, before and after stylisation. This implies a prior segmentation into syllables (either from a TextGrid tier or automatically by Prosogram). The script can also output F0 values (at a fixed time step) obtained from the stylised pitch contour (not documented in the User's Guide). By "f0 measurements (and perhaps normalised)" you mean time-normalised values or normalised pitch values? Prosogram also provides PRNP (pitch range normalised pitch) values, where the value shown (at a given time) is a fraction of the speaker's pitch range (calculated by the script, on semitone values). Time (duration) is a crucial factor in pitch perception: a F0 change on a sound will be perceived either as a changing pitch or a flat pitch, depending on the duration that sound. The presence of a pause also has an impact on perception. Applying time normalisation on sentences from repetitions by the same speaker and/or multiple speakers will result in a comparison of stimuli with different durations and hence not necessarily comparable from a perceptual point of view. |
|