Inter-rater reliability on start/end times


eachoi89@...
 

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,
Eun Ae

PhD Candidate
Speech and Hearing Sciences Dept.
University of Washington 


Stefan Werner
 

Hi Eun Ae,

would the Intraclass correlation coefficient ICC be useful for you? It's available in R for example via the irr package and its icc function.

Best,
Stefan

---
Stefan Werner, PhD
Joensuu, Finland

On Tue, Feb 14, 2023 at 5:46 AM <eachoi89@...> wrote:
Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,
Eun Ae

PhD Candidate
Speech and Hearing Sciences Dept.
University of Washington 


Boersma Paul
 

Start and end times are interval variables (their differences are meaningful) but not ratio variables (a global time shift, which is irrelevant, would change their ratio).

For continuous variables like times you indeed wouldn't use discrete-category-related reliability measures, i.e. nothing with "class" in the name. Instead, you could measure the standard deviation of the annotated times across raters.

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,
Eun Ae

PhD Candidate
Speech and Hearing Sciences Dept.
University of Washington 

_____

Paul Boersma
Professor of Phonetic Sciences
University of Amsterdam
Spuistraat 134, room 632
1012VB Amsterdam, The Netherlands
http://www.fon.hum.uva.nl/paul/


eachoi89@...
 

Thank you for your response Dr. Boersma! To find the standard deviation of the annotated times across two raters, would each rater need to rate a reference audio snippet multiple times to establish the standard deviations? And from there, should I run the ANOVA to make sure that the differences between the annotated times are not significant? 


Bogdan Rozborski
 

On 14.02.2023 23:28, Boersma Paul via groups.io wrote:
Start and end times are interval variables (their differences are meaningful) but not ratio variables (a global time shift, which is irrelevant, would change their ratio).

For continuous variables like times you indeed wouldn't use discrete-category-related reliability measures, i.e. nothing with "class" in the name. Instead, you could measure the standard deviation of the annotated times across raters.

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,
Eun Ae

PhD Candidate
Speech and Hearing Sciences Dept.
University of Washington 

_____

Paul Boersma
Professor of Phonetic Sciences
University of Amsterdam
Spuistraat 134, room 632
1012VB Amsterdam, The Netherlands
http://www.fon.hum.uva.nl/paul/

On top of Paul's proposal, in order to visualise differences among raters, I would draw and compare histograms of intervals (end time - start time differences) with mean, median and standard deviations marked on those histograms.

Bogdan Rozborski, ArsDigita


Mark Liberman
 

For another approach to this sort of question, see  "Automating phonetic measurement: The case of voice onset time", Proceedings of Meetings on Acoustics 2013.



On Tue, Feb 14, 2023 at 5:28 PM Boersma Paul via groups.io <p.p.g.boersma=uva.nl@groups.io> wrote:
Start and end times are interval variables (their differences are meaningful) but not ratio variables (a global time shift, which is irrelevant, would change their ratio).

For continuous variables like times you indeed wouldn't use discrete-category-related reliability measures, i.e. nothing with "class" in the name. Instead, you could measure the standard deviation of the annotated times across raters.

On 14 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89=uw.edu@groups.io> wrote:

Hello,

I'm looking for a way to calculate the inter-rater reliability of the start/end times segmented by two independent coders using Praat Textgrid. For the segment labels (e.g., vocalization, pause, etc.,), I calculated the Cohen's Kappa, but I don't think I can use that for the start/end times because they are ratio variables. I'm not looking for an exact agreement between the two coders, since it's almost impossible for our segmentations to align down to the decimals. However, I'm wondering if we are consistently setting our start/end times roughly within a 0.5 -1 second window. What is the best method to test for reliability, also taking into consideration chance agreements?

Any help would be much appreciated!

Best regards,
Eun Ae

PhD Candidate
Speech and Hearing Sciences Dept.
University of Washington 

_____

Paul Boersma
Professor of Phonetic Sciences
University of Amsterdam
Spuistraat 134, room 632
1012VB Amsterdam, The Netherlands
http://www.fon.hum.uva.nl/paul/


Boersma Paul
 

Your question was about inter-rater reliability, i.e. about variability *between* raters, not about variability *within* raters (although that could be measured as well).

To measure variability between raters, having one value per rater (per annotated time) would be enough. If you have 10 raters, you would have 10 values per annotated time, and those 10 values have one observed mean (the sum of the 10 values, divided by 10) and one observed standard deviation (the square root of { { the sum of the 10 squared { values minus the observed mean }  } divided by 9 }).

No p-value test can be done on these 10 values, because there is no null hypothesis to compare the values to. (Besides, in general, in the realm of p-value testing, including Anovas, finding no statistical significance for an effect doesn't provide any evidence at all that the effect doesn't exist.)

If you have N points in time, take the root-mean-square of the N standard deviations, to obtain a single estimate for the between-rate standard deviation. That's something in milliseconds, probably.

On 15 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Thank you for your response Dr. Boersma! To find the standard deviation of the annotated times across two raters, would each rater need to rate a reference audio snippet multiple times to establish the standard deviations? And from there, should I run the ANOVA to make sure that the differences between the annotated times are not significant? 

_____

Paul Boersma
Professor of Phonetic Sciences
University of Amsterdam
Spuistraat 134, room 632
1012VB Amsterdam, The Netherlands
http://www.fon.hum.uva.nl/paul/


Bogdan Rozborski
 

On 15.02.2023 12:51, Boersma Paul via groups.io wrote:
Your question was about inter-rater reliability, i.e. about variability *between* raters, not about variability *within* raters (although that could be measured as well).

To measure variability between raters, having one value per rater (per annotated time) would be enough. If you have 10 raters, you would have 10 values per annotated time, and those 10 values have one observed mean (the sum of the 10 values, divided by 10) and one observed standard deviation (the square root of { { the sum of the 10 squared { values minus the observed mean }  } divided by 9 }).

No p-value test can be done on these 10 values, because there is no null hypothesis to compare the values to. (Besides, in general, in the realm of p-value testing, including Anovas, finding no statistical significance for an effect doesn't provide any evidence at all that the effect doesn't exist.)

If you have N points in time, take the root-mean-square of the N standard deviations, to obtain a single estimate for the between-rate standard deviation. That's something in milliseconds, probably.

On 15 Feb 2023, at 00:44, eachoi89 via groups.io <eachoi89@...> wrote:

Thank you for your response Dr. Boersma! To find the standard deviation of the annotated times across two raters, would each rater need to rate a reference audio snippet multiple times to establish the standard deviations? And from there, should I run the ANOVA to make sure that the differences between the annotated times are not significant? 

_____

Paul Boersma
Professor of Phonetic Sciences
University of Amsterdam
Spuistraat 134, room 632
1012VB Amsterdam, The Netherlands
http://www.fon.hum.uva.nl/paul/

Well, I wouldn't deny using a class/group reliability measure. Let's assume there are N raters (each treated as a separate group). Now, each rater produces a sequence of time point (data within a group) further defining some intervals. Let's then further assume that it's not only an interval length that matters, but also their absolute position  on time axis. Furthermore, we may allow for raters to skip some intervals, such that number of data points may differ from rater to rater. It seems correct to use interclass correlation as a measure of similarity among all raters.

Bogean Rozborski.


eachoi89@...
 

Thank you everyone for your recommendations. Much appreciated!!!