Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating speaker diarization systems with γ inter-annotator agreement #16

Open
hbredin opened this issue Nov 11, 2020 · 3 comments
Open
Labels

Comments

@hbredin
Copy link

hbredin commented Nov 11, 2020

Nice package 👍

I am wondering whether it would make sense to use γ inter-annotator agreement for evaluation speaker diarization systems (in place of good old diarization error rate, aka DER):

  • one annotator would be the (manual) reference annotation,
  • another annotator would be the annotation hypothesized by the (automatic) diarization system.

I understand (maybe incorrectly) that both annotators need to use the same set of speaker labels.

How would you handle the case where both annotators use different sets of labels?
Would you need to match them first (like what is already done in DER)?

How would you choose (temporal) alpha and (categorical) beta weights?

@hadware
Copy link
Collaborator

hadware commented Nov 12, 2020

Thanks sensei!

It would indeed make sense, and we've intensely thought about it! We're just not entirely sure yet...

  • Regarding the reference/hypothesis problem, we kind of have a very experimental solution already implemented (that we did not document yet because it hasn't really been tested or proven to work correctly). This solution is, when computing the gamma, to set one (or more) annotators as "ground truth", which will then tell pygamma-agreement to only sample random annotations from these ground truth annotators.

  • regarding the case where annotators use a different set of labels, we would indeed have to match them first (in the same way as DER does it). As a side note, it would still work pretty nicely with the current CombinedCategoricalDissimilarity implementation, using the cat_dissimilarity_matrix.

  • regarding alpha and beta, I'm leaving it to @Rachine 🤔 :

@Rachine
Copy link
Collaborator

Rachine commented Nov 12, 2020

Hello Hervé!
Thank you! This is a question we want to explore and we discuss a lot!

We tried to apply the γ to replace IER, the behaviors were not consistent at all. I think the framework are very similar, but there are differences to take into account. I think the gamma has some limitations and need adaptations.

  • To use the gamma as a metric, we do not want the gamma chance correction to use the hypothesized annotations. It means that we do not want the agreement to vary across diarization systems. There is one option to specify this https://github.com/bootphon/pygamma-agreement/blob/master/pygamma_agreement/continuum.py#L173
  • alpha/beta are parameters to set manually. For instance, if we have two segments of duration of one second in the two annotations, to set alpha=beta=1 means that we attribute the same weight to a mistake to displace one of the segment of 1s or to make a category mistake.
  • There is one main difference with classic DER it is the splitting of units. it is not penalized at all (except for Missed Detection) to split a speech unit. Yet, as the γ finds an alignement, Speech Diarization systems are a lot penalized. We can extend the gamma to take into account multiple alignement paths (hard way). Or Diarization Systems need to have the same biases as human to improve the gamma... (second hard way).
  • for the problem of assignement in Diarization, if you do not have many classes you can rotate all the matching and take the smallest gamma I imagine?

@hbredin
Copy link
Author

hbredin commented Nov 17, 2020

Thank you both for your detailed answers.

To summarize my understanding: using this metric for speaker diarization is not that obvious and remains an open research question.

Thinking out loud: maybe its use for combining multiple speaker diarization systems would be something to look at, as well (in the same spirit as in https://github.com/desh2608/dover-lap/ by @desh2608)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants