#### Inter-rater reliability on start/end times

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,

Eun Ae

PhD Candidate

Speech and Hearing Sciences Dept.

University of Washington

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,

Eun Ae

PhD Candidate

Speech and Hearing Sciences Dept.

University of Washington

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,

Eun Ae

PhD Candidate

Speech and Hearing Sciences Dept.

University of Washington

**Paul Boersma**

**Professor of Phonetic Sciences**

Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

**University of Amsterdam**Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

Start and end times are interval variables (their differences are meaningful) but not ratio variables (a global time shift, which is irrelevant, would change their ratio).

For continuous variables like times you indeed wouldn't use discrete-category-related reliability measures, i.e. nothing with "class" in the name. Instead, you could measure the standard deviation of the annotated times across raters.

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,

Eun Ae

PhD Candidate

Speech and Hearing Sciences Dept.

University of Washington

_____

Paul BoersmaProfessor of Phonetic SciencesUniversity of Amsterdam

Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

On top of Paul's proposal, in order to visualise differences
among raters, I would draw and compare histograms of intervals
(end time - start time differences) with mean, median and standard
deviations marked on those histograms.

Bogdan Rozborski, ArsDigita

*Proceedings of Meetings on Acoustics*2013.

Start and end times are interval variables (their differences are meaningful) but not ratio variables (a global time shift, which is irrelevant, would change their ratio).

For continuous variables like times you indeed wouldn't use discrete-category-related reliability measures, i.e. nothing with "class" in the name. Instead, you could measure the standard deviation of the annotated times across raters.

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89=uw.edu@groups.io> wrote:

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,

Eun Ae

PhD Candidate

Speech and Hearing Sciences Dept.

University of Washington

_____

Paul BoersmaProfessor of Phonetic SciencesUniversity of Amsterdam

Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

On 15 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Thank you for your response Dr. Boersma! To find the standard deviation of the annotated times across two raters, would each rater need to rate a reference audio snippet multiple times to establish the standard deviations? And from there, should I run the ANOVA to make sure that the differences between the annotated times are not significant?

**Paul Boersma**

**Professor of Phonetic Sciences**

Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

**University of Amsterdam**Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

Your question was about inter-rater reliability, i.e. about variability *between* raters, not about variability *within* raters (although that could be measured as well).

To measure variability between raters, having one value per rater (per annotated time) would be enough. If you have 10 raters, you would have 10 values per annotated time, and those 10 values have one observed mean (the sum of the 10 values, divided by 10) and one observed standard deviation (the square root of { { the sum of the 10 squared { values minus the observed mean } } divided by 9 }).

No p-value test can be done on these 10 values, because there is no null hypothesis to compare the values to. (Besides, in general, in the realm of p-value testing, including Anovas, finding no statistical significance for an effect doesn't provide any evidence at all that the effect doesn't exist.)

If you have N points in time, take the root-mean-square of the N standard deviations, to obtain a single estimate for the between-rate standard deviation. That's something in milliseconds, probably.

On 15 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Thank you for your response Dr. Boersma! To find the standard deviation of the annotated times across two raters, would each rater need to rate a reference audio snippet multiple times to establish the standard deviations? And from there, should I run the ANOVA to make sure that the differences between the annotated times are not significant?

_____

Paul BoersmaProfessor of Phonetic SciencesUniversity of Amsterdam

Spuistraat 134, room 632

1012VB Amsterdam, The Netherlands

http://www.fon.hum.uva.nl/paul/

Well, I wouldn't deny using a class/group reliability measure.
Let's assume there are N raters (each treated as a separate
group). Now, each rater produces a sequence of time point (data
within a group) further defining some intervals. Let's then
further assume that it's not only an interval length that matters,
but also their absolute position on time axis. Furthermore, we
may allow for raters to skip some intervals, such that number of
data points may differ from rater to rater. It seems correct to
use interclass correlation as a measure of similarity among all
raters.

Bogean Rozborski.