diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..e645833 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +*~ +*.pyc \ No newline at end of file diff --git a/README.md b/README.md index 0d410dd..9f8b16c 100644 --- a/README.md +++ b/README.md @@ -1 +1,257 @@ -Please check back soon for the scoring tools for the [First DIHARD Challenge](https://coml.lscp.ens.fr/dihard/). +I. Overview +========= +This suite supports evaluation of diarization system output relative +to a reference diarization subject to the following conditions: + +- both the reference and system diarizations are saved within [Rich Transcription Time Marked (RTTM)](#rttm) files +- for any pair of recordings, the sets of speakers are disjoint + + +II. Dependencies +========== +The following Python packages are required to run this software: + +- Python >= 2.7.1 (https://www.python.org/) +- NumPy >= 1.6.1 (https://github.com/numpy/numpy) +- SciPy >= 0.10.0 (https://github.com/scipy/scipy) +- intervaltree >= 2.1.0 (https://pypi.python.org/pypi/intervaltree) +- tabulate >= 0.5.0 (https://pypi.python.org/pypi/tabulate) + + +III. Metrics +====== +Diarization error rate +--------------------------- +Following tradition in this area, we report diarization error rate (DER), which +is the sum of + +- speaker error -- percentage of scored time for which the wrong speaker id + is assigned within a speech region +- false alarm speech -- percentage of scored time for which a nonspeech + region is incorrectly marked as containing speech +- missed speech -- percentage of scored time for which a speech region is + incorrectly marked as not containing speech + +As with word error rate, a score of zero indicates perfect performance and +higher scores (which may exceed 100) indicate poorer performance. For more +details, consult section 6.1 of the [NIST RT-09 evaluation plan](https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf). + + +Clustering metrics +--------------------------------- +An alternate approach to system evaluation is convert both the reference and +system outputs to frame-level labels, then evaluate using one of many +well-known approaches for evaluating clustering performance. Each recording +is converted to a sequence of 10 ms frames, each of which is assigned a single +label corresponding to one of the following cases: + +- the frame contains no speech +- the frame contains speech from a single speaker (one label per speaker + indentified) +- the frame contains overlapping speech (one label for each element in the + powerset of speakers) + +These frame-level labelings are then scored with the following metrics: + +### Goodman-Kruskal tau +Goodman-Kruskal tau is an asymmetric association measure dating back to work +by Leo Goodman and William Kruskal in the 1950s (Goodman and Kruskal, 1954). +For a reference labeling ``ref`` and a system labeling ``ref``, +``GKT(ref, sys)`` corresponds to the fraction of variability in ``sys`` that +can be explained by ``ref``. Consequently, ``GKT(ref, sys)`` is 1 when ``ref`` +is perfectly predictive of ``sys`` and 0 when it is not predictive at all. +Correspondingly, ``GKT(sys, ref)`` is 1 when ``sys`` is perfectly predictive +of ``ref`` and 0 when lacking any predictive power. + +### B-cubed precision, recall, and F1 +The B-cubed precision for a single frame assigned speaker ``S`` in the +reference diarization and ``C`` in the system diarization is the proportion of +frames assigned ``C`` that are also assigned ``S``. Similarly, the B-cubed +recall for a frame is the proportion of all frames assigned ``S`` that are +also assigned ``C``. The overall precision and recall, then, are just the mean +of the frame-level precision and recall measures and the overall F-1 their +harmonic mean. For additional details see Bagga and Baldwin (1998). + +### Information theoretic measures +We report four information theoretic measures: + +- ``H(ref|sys)`` -- conditional conditional entropy in bits of the reference + labeling given the system labeling +- ``H(sys|ref)`` -- conditional conditional entropy in bits of the system + labeling given the reference labeling +- ``MI`` -- mutual information in bits between the reference and system + labelings +- ``NMI`` -- normalized mutual information between the reference and system + labelings; that is, ``MI`` scaled to the interval [0, 1]. In this case, the + normalization term used is ``sqrt(H(ref)*H(sys))``. + +``H(ref|sys)`` is the number of bits needed to describe the reference +labeling given that the system labeling is known and ranges from 0 in +the case that the system labeling is perfectly predictive of the reference +labeling to ``H(ref)`` in the case that the system labeling is not at +all predictive of the reference labeling. Similarly, ``H(sys|ref)`` measure +the number of bits required to describe the system labeling given the +reference labeling and ranges from 0 to ``H(sys)``. + +``MI`` is the number of bits shared by the reference and system labeling and +indicates the degree to which knowing either reduces uncertainty in the other. +It is related to conditional entropy and entropy as follows: +``MI(ref, sys) = H(ref) - H(ref|sys) = H(sys) - H(sys|ref)``. ``NMI`` is +derived from ``MI`` by normalizing it to the interval [0, 1]. Multiple +normalizations are possible depending on the upper-bound for ``MI`` that is +used, but we report ``NMI`` normalized by ``sqrt(H(ref)*H(sys))``. + + +IV. Scoring +====== +To evaluate system output stored in [RTTM](#rttm) files ``sys1.rttm``, +``sys2.rttm``, ... against a corresponding reference diarization stored in RTTM +files ``ref1.rttm``, ``ref2.rttm``, ...: + + python score.py -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ... + + which will calculate and report the following metrics both overall and on + a per-file basis: + +- ``DER`` -- diarization error rate +- ``B3-Precision`` -- B-cubed precision +- ``B3-Recall`` -- B-cubed recall +- ``B3-F1`` -- B-cubed F1 +- ``GKT(ref, sys)`` -- Goodman-Kruskal tau in the direction of the reference + diarization to the system diarization +- ``GKT(sys, ref)`` -- Goodman-Kruskal tau in the direction of the system + diarization to the reference diarization +- ``H(ref|sys)`` -- conditional entropy in bits of the reference diarization + given the system diarization +- ``H(sys|ref)`` -- conditional entropy in bits of the system diarization + given the reference diarization +- ``MI`` -- mutual information in bits +- ``NMI`` -- normalized mutual information + +Alternately, we could have specified the reference and system RTTM files via +script files of paths (one per line) using the ``-R`` and ``-S`` flags: + + python score.py -R ref.scp -S sys.scp + +By default the scoring regions for each file will be determined automatically +from the reference and speaker turns. However, it is possible to specify +explicit scoring regions using a NIST [un-partitioned evaluation map (UEM)](#uem) file and the ``-u`` flag. For instance, the following: + + python score.py -u all.uem -R ref.scp -S sys.scp + +will load the files to be scored plus scoring regions from ``all.uem``, filter +out and warn about any speaker turns not present in those files, and trim the +remaining turns to the relevant scoring regions before computing the metrics +as before. + +DER is scored using the NIST ``md-eval.pl`` tool with +a default collar size of 0 ms and explicitly including regions that contain +overlapping speech in the reference diarization. If desired, this behavior +can be altered using the ``--collar`` and ``--ignore_overlaps`` flags. For +instance + + python score.py --collar 0.100 --ignore_overlaps -R ref.scp -S sys.scp + +would compute DER using a 100 ms collar and with overlapped speech ignored. +All other metrics are computed off of frame-level labelings generated from the +reference and system speaker turns **WITHOUT** any use of collars. The default +frame step is 10 ms, which may be altered via the ``--step`` flag. For more +details, consult the docstrings within the ``scorelib.metrics`` module. + +The overall and per-file results will be printed to STDOUT as a table; for instance + + File DER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI + --------------------------- ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- ----- + CMU_20020319-1400_d01_NONE 6.10 0.91 1.00 0.95 1.00 0.88 0.22 0.00 2.66 0.96 + ICSI_20000807-1000_d05_NONE 17.37 0.72 1.00 0.84 1.00 0.68 0.65 0.00 2.79 0.90 + ICSI_20011030-1030_d02_NONE 13.06 0.80 0.95 0.87 0.95 0.80 0.54 0.11 5.10 0.94 + LDC_20011116-1400_d06_NONE 5.64 0.95 0.89 0.92 0.85 0.93 0.10 0.27 1.87 0.91 + LDC_20011116-1500_d07_NONE 1.69 0.96 0.96 0.96 0.95 0.95 0.14 0.12 2.39 0.95 + NIST_20020305-1007_d01_NONE 42.05 0.51 0.95 0.66 0.93 0.44 1.58 0.11 2.13 0.74 + *** TOTAL *** 14.31 0.81 0.96 0.88 0.96 0.80 0.55 0.10 5.45 0.94 + +Some basic control of the formatting of this table is possible via the ``--n_digits`` and +``--table_format`` flags. The former controls the number of decimal places printed for floating +point numbers, while the latter controls the table format. For a list of valid table formats plus example +outputs, consult the [documentation](https://pypi.python.org/pypi/tabulate) for the ``tabulate`` package. + +For additional details consult the docstring of ``score.py``. + + +V. File formats +======== +RTTM +------- +Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields: + +- ``Type`` -- segment type; should always by ``SPEAKER`` +- ``File ID`` -- file name; basename of the recording minus extension (e.g., + ``rec1_a``) +- ``Channel ID`` -- channel (1-indexed) that turn is on; should always be + ``1`` +- ``Turn Onset`` -- onset of turn in seconds from beginning of recording +- ``Turn Duration`` -- duration of turn in seconds +- ``Orthography Field`` -- should always by ```` +- ``Speaker Type`` -- should always be ```` +- ``Speaker Name`` -- name of speaker of turn; should be unique within scope + of each file +- ``Confidence Score`` -- system confidence (probability) that information + is correct; should always be ```` +- ``Signal Lookahead Time`` -- should always be ```` + +For instance: + + SPEAKER CMU_20020319-1400_d01_NONE 1 130.430000 2.350 juliet + SPEAKER CMU_20020319-1400_d01_NONE 1 157.610000 3.060 tbc + SPEAKER CMU_20020319-1400_d01_NONE 1 130.490000 0.450 chek + +If you would like to confirm that a set of RTTM files are valid, use the +included ``validate_rttm.py`` script. For instance, if you have RTTMs +``fn1.rttm``, ``fn2.rttm``, ..., then + + python validate_rttm.py fn1.rttm fn2.rttm ... + +will iterate over each line of each file and warn on any that do not match the +spec. + +UEM +------ +Un-partitioned evaluation map (UEM) files are used to specify the scoring +regions within each recording. For each scoring region, the UEM file contains +a line with the following four space-delimited fields + +- ``File ID`` -- file name; basename of the recording minus extension (e.g., + ``rec1_a``) +- ``Channel ID`` -- channel (1-indexed) that scoring region is on; ignored by + ``score.py`` +- ``Onset`` -- onset of scoring region in seconds from beginning of recording +- ``Offset`` -- offset of scoring region in seconds from beginning of + recording + +For instance: + + CMU_20020319-1400_d01_NONE 1 125.000000 727.090000 + CMU_20020320-1500_d01_NONE 1 111.700000 615.330000 + ICSI_20010208-1430_d05_NONE 1 97.440000 697.290000 + + +VI. References +========= +- Bagga, A. and Baldwin, B. (1998). "Algorithms for scoring coreference + chains." Proceedings of LREC 1998. +- Cover, T.M. and Thomas, J.A. (1991). Elements of Information Theory. +- Goodman, L.A. and Kruskal, W.H. (1954). "Measures of association for + cross classifications." Journal of the American Statistical Association. +- NIST. (2009). The 2009 (RT-09) Rich Transcription Meeting Recognition + Evaluation Plan. https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf +- Nguyen, X.V., Epps, J., and Bailey, J. (2010). "Information theoretic + measures for clustering comparison: Variants, properties, normalization + and correction for chance." Journal of Machine Learning Research. +- Pearson, R. (2016). GoodmanKruskal: Association Analysis for Categorical + Variables. https://CRAN.R-project.org/package=GoodmanKruskal. +- Rosenberg, A. and Hirschberg, J. (2007). "V-Measure: A conditional + entropy-based external cluster evaluation measure." Proceedings of + EMNLP 2007. +- Strehl, A. and Ghosh, J. (2002). "Cluster ensembles -- A knowledge + reuse framework for combining multiple partitions." Journal of Machine + Learning Research. \ No newline at end of file diff --git a/score.py b/score.py new file mode 100755 index 0000000..f9a3d96 --- /dev/null +++ b/score.py @@ -0,0 +1,303 @@ +#!/usr/bin/env python +"""Score diarization system output. + +To evaluate system output stored in RTTM files ``sys1.rttm``, ``sys2.rttm``, +... against a corresponding reference diarization stored in RTTM files +``ref1.rttm``, ``ref2.rttm``, ...: + + python score.py -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ... + +which will calculate and report the following metrics both overall and on +a per-file basis: + +- diarization error rate (DER) +- B-cubed precision (B3-Precision) +- B-cubed recall (B3-Recall) +- B-cubed F1 (B3-F1) +- Goodman-Kruskal tau in the direction of the reference diarization to the + system diarization (GKT(ref, sys)) +- Goodman-Kruskal tau in the direction of the system diarization to the + reference diarization (GKT(sys, ref)) +- conditional entropy of the reference diarization given the system + diarization in bits (H(ref|sys)) +- conditional entropy of the system diarization given the reference + diarization in bits (H(sys|ref)) +- mutual information in bits (MI) +- normalized mutual information (NMI) + +Alternately, we could have specified the reference and system RTTM files via +script files of paths (one per line) using the ``-R`` and ``-S`` flags: + + python score.py -R ref.scp -S sys.scp + +By default the scoring regions for each file will be determined automatically +from the reference and speaker turns. However, it is possible to specify +explicit scoring regions using a NIST un-partitioned evaluation map (UEM) file +and the ``-u`` flag. For instance, the following: + + python score.py -u all.uem -R ref.scp -S sys.scp + +will load the files to be scored + scoring regions from ``all.uem``, filter out +and warn about any speaker turns not present in those files, and trim the +remaining turns to the relevant scoring regions before computing the metrics +as before. + +Diarization error rate (DER) is scored using the NIST ``md-eval.pl`` tool with +a default collar size of 0 ms and explicitly including regions that contain +overlapping speech in the reference diarization. If desired, this behavior +can be altered using the ``--collar`` and ``--ignore_overlaps`` flags. For +instance + + python score.py --collar 0.100 --ignore_overlaps -R ref.scp -S sys.scp + +would compute DER using a 100 ms collar and with overlapped speech ignored. +All other metrics are computed off of frame-level labelings generated from the +reference and system speaker turns **WITHOUT** any use of collars. The default +frame step is 10 ms, which may be altered via the ``--step`` flag. For more +details, consult the docstrings within the ``scorelib.metrics`` module. + +The overall and per-file results will be printed to STDOUT as a table formatted +using the ``tabulate`` package. Some basic control of the formatting of this +table is possible via the ``--n_digits`` and ``--table_format`` flags. The +former controls the number of decimal places printed for floating point +numbers, while the latter controls the table format. For a list of valid +table formats plus example outputs, consult the documentation for the +``tabulate`` package: + + https://pypi.python.org/pypi/tabulate +""" +from __future__ import print_function +from __future__ import unicode_literals +import argparse +import os +import sys + +from tabulate import tabulate + +from scorelib import __version__ as VERSION +from scorelib.argparse import ArgumentParser +from scorelib.rttm import load_rttm +from scorelib.turn import merge_turns, trim_turns +from scorelib.score import score +from scorelib.six import iterkeys +from scorelib.uem import gen_uem, load_uem +from scorelib.utils import error, info, warn, xor + + +class RefRTTMAction(argparse.Action): + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, values) + if not xor(namespace.ref_rttm_fns, namespace.ref_rttm_scpf): + parser.error('Exactly one of -r and -R must be set.') + + +class SysRTTMAction(argparse.Action): + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, values) + if not xor(namespace.sys_rttm_fns, namespace.sys_rttm_scpf): + parser.error('Exactly one of -s and -S must be set.') + + +def load_rttms(rttm_fns): + """Load speaker turns from RTTM files. + + Parameters + ---------- + rttm_fns : list of str + Paths to RTTM files. + + Returns + ------- + turns : list of Turn + Speaker turns. + + file_ids : set + File ids found in ``rttm_fns``. + """ + turns = [] + file_ids = set() + for rttm_fn in rttm_fns: + if not os.path.exists(rttm_fn): + error('Unable to open RTTM file: %s' % rttm_fn) + sys.exit(1) + try: + turns_, _, file_ids_ = load_rttm(rttm_fn) + turns.extend(turns_) + file_ids.update(file_ids_) + except IOError as e: + error('Invalid RTTM file: %s. %s' % (rttm_fn, e)) + sys.exit(1) + return turns, file_ids + + +def check_for_empty_files(ref_turns, sys_turns, uem): + """Warn on files in UEM without reference or speaker turns.""" + ref_file_ids = set([turn.file_id for turn in ref_turns]) + sys_file_ids = set([turn.file_id for turn in sys_turns]) + for file_id in sorted(iterkeys(uem)): + if file_id not in ref_file_ids: + warn('File "%s" missing in reference RTTMs.' % file_id) + if file_id not in sys_file_ids: + warn('File "%s" missing in system RTTMs.' % file_id) + # TODO: Clarify below warnings; this indicates that there are no + # ELIGIBLE reference/system turns. + if not ref_turns: + warn('No reference speaker turns found within UEM scoring regions.') + if not sys_turns: + warn('No system speaker turns found within UEM scoring regions.') + + +def load_script_file(fn): + """Load file names from ``fn``.""" + with open(fn, 'rb') as f: + return [line.decode('utf-8').strip() for line in f] + + +def print_table(file_to_scores, global_scores, n_digits=2, + table_format='simple'): + """Pretty print scores as table. + + Parameters + ---------- + file_to_scores : dict + Mapping from file ids in ``uem`` to ``Scores`` instances. + + global_scores : Scores + Global scores. + + n_digits : int, optional + Number of decimal digits to display. + (Default: 3) + + table_format : str, optional + Table format. Passed to ``tabulate.tabulate``. + (Default: 'simple') + """ + col_names = ['File', + 'DER', # Diarization error rate. + 'B3-Precision', # B-cubed precision. + 'B3-Recall', # B-cubed recall. + 'B3-F1', # B-cubed F1. + 'GKT(ref, sys)', # Goodman-Krustal tau (ref, sys). + 'GKT(sys, ref)', # Goodman-Kruskal tau (sys, ref). + 'H(ref|sys)', # Conditional entropy of ref given sys. + 'H(sys|ref)', # Conditional entropy of sys given ref. + 'MI', # Mutual information. + 'NMI', # Normalized mutual information. + ] + rows = [] + for file_id in sorted(iterkeys(file_to_scores)): + scores = file_to_scores[file_id] + row = [file_id, scores.der, scores.bcubed_precision, + scores.bcubed_recall, scores.bcubed_f1, scores.tau_ref_sys, + scores.tau_sys_ref, scores.ce_ref_sys, scores.ce_sys_ref, + scores.mi, scores.nmi] + rows.append(row) + rows.append(['*** OVERALL ***', global_scores.der, global_scores.bcubed_precision, + global_scores.bcubed_recall, global_scores.bcubed_f1, + global_scores.tau_ref_sys, global_scores.tau_sys_ref, + global_scores.ce_ref_sys, global_scores.ce_sys_ref, + global_scores.mi, global_scores.nmi]) + floatfmt = '.%df' % n_digits + tbl = tabulate( + rows, headers=col_names, floatfmt=floatfmt, tablefmt=table_format) + print(tbl) + + + +if __name__ == '__main__': + # Parse command line arguments. + parser = ArgumentParser( + description='Score diarization from RTTM files.', add_help=True, + usage='%(prog)s [options]') + parser.add_argument( + '-r', nargs='+', default=[], metavar='STR', dest='ref_rttm_fns', + action=RefRTTMAction, + help='reference RTTM files (default: %(default)s)') + parser.add_argument( + '-R', nargs=None, metavar='STR', dest='ref_rttm_scpf', + action=RefRTTMAction, + help='reference RTTM script file (default: %(default)s)') + parser.add_argument( + '-s', nargs='+', default=[], metavar='STR', dest='sys_rttm_fns', + action=SysRTTMAction, + help='system RTTM files (default: %(default)s)') + parser.add_argument( + '-S', nargs=None, metavar='STR', dest='sys_rttm_scpf', + action=SysRTTMAction, + help='system RTTM script file (default: %(default)s)') + parser.add_argument( + '-u,--uem', nargs=None, metavar='STR', dest='uemf', + help='un-partitioned evaluation map file (default: %(default)s)') + parser.add_argument( + '--collar', nargs=None, default=0.0, type=float, metavar='FLOAT', + help='collar size in seconds for DER computaton ' + '(default: %(default)s)') + parser.add_argument( + '--ignore_overlaps', action='store_true', default=False, + help='ignore overlaps when computing DER') + parser.add_argument( + '--step', nargs=None, default=0.010, type=float, metavar='FLOAT', + help='step size in seconds (default: %(default)s)') + parser.add_argument( + '--n_digits', nargs=None, default=2, type=int, metavar='INT', + help='number of decimal places to print (default: %(default)s)') + parser.add_argument( + '--table_fmt', nargs=None, dest='table_format', default='simple', + metavar='STR', + help='tabulate table format (default: %(default)s)') + parser.add_argument( + '--version', action='version', + version='%(prog)s ' + VERSION) + if len(sys.argv) == 1: + parser.print_help() + sys.exit(1) + args = parser.parse_args() + + # Check that at least one reference RTTM and at least one system RTTM + # was specified. + if args.ref_rttm_scpf is not None: + args.ref_rttm_fns = load_script_file(args.ref_rttm_scpf) + if args.sys_rttm_scpf is not None: + args.sys_rttm_fns = load_script_file(args.ref_rttm_scpf) + if len(args.ref_rttm_fns) < 1: + error('No reference RTTMs specified.') + sys.exit(1) + if len(args.sys_rttm_fns) < 1: + error('No system RTTMs specified.') + sys.exit(1) + + # Load speaker/reference speaker turns and UEM. If no UEM specified, + # determine it automatically. + info('Loading speaker turns from reference RTTMs...', file=sys.stderr) + ref_turns, ref_file_ids = load_rttms(args.ref_rttm_fns) + info('Loading speaker turns from system RTTMs...', file=sys.stderr) + sys_turns, sys_file_ids = load_rttms(args.sys_rttm_fns) + if args.uemf is not None: + info('Loading universal evaluation map...', file=sys.stderr) + uem = load_uem(args.uemf) + else: + warn('No universal evaluation map specified. Approximating from ' + 'reference and speaker turn extents...') + uem = gen_uem(ref_turns, sys_turns) + + # Trim turns to UEM scoring regions and merge any that overlap. + info('Trimming reference speaker turns to UEM scoring regions...', + file=sys.stderr) + ref_turns = trim_turns(ref_turns, uem) + info('Trimming system speaker turns to UEM scoring regions...', + file=sys.stderr) + sys_turns = trim_turns(sys_turns, uem) + info('Checking for overlapping reference speaker turns...', + file=sys.stderr) + ref_turns = merge_turns(ref_turns) + info('Checking for overlapping system speaker turns...', + file=sys.stderr) + sys_turns = merge_turns(sys_turns) + + # Score. + check_for_empty_files(ref_turns, sys_turns, uem) + file_to_scores, global_scores = score( + ref_turns, sys_turns, uem, args.collar, args.ignore_overlaps, + args.step) + print_table(file_to_scores, global_scores, args.n_digits, args.table_format) diff --git a/scorelib/__init__.py b/scorelib/__init__.py new file mode 100644 index 0000000..4ecf640 --- /dev/null +++ b/scorelib/__init__.py @@ -0,0 +1,2 @@ +"""Diarization system scoring.""" +__version__ = '1.0.0' diff --git a/scorelib/argparse.py b/scorelib/argparse.py new file mode 100644 index 0000000..909c371 --- /dev/null +++ b/scorelib/argparse.py @@ -0,0 +1,17 @@ +"""Custom argument parser and action classes.""" +from __future__ import absolute_import +from __future__ import print_function +from __future__ import unicode_literals +import argparse +import sys + + +__all__ = ['ArgumentParser'] + + +class ArgumentParser(argparse.ArgumentParser): + """Sub-class of ``ArgumentParser`` that write errors to STDERR.""" + def error(self, message): + sys.stderr.write('error: %s\n' % message) + self.print_help() + sys.exit(2) diff --git a/scorelib/md-eval-22.pl b/scorelib/md-eval-22.pl new file mode 100755 index 0000000..27b7bc9 --- /dev/null +++ b/scorelib/md-eval-22.pl @@ -0,0 +1,2906 @@ +#!/usr/bin/perl -w +use strict; + +my $version = "22"; + +################################# +# History: +# +# version 22: * JGF: added an option '-m FILE' to hold a CSV speaker map file. +# +# version 21: * JGF: added a flag '-n' to not remove the directory paths from the source +# files in the UEM file. +# +# version 20: * change metadata discard rule: rather than discard if the midpoint +# (or endpoint) of the metadata object lies in a no-eval zone, discard +# if there is ANY overlap whatsoever between the metadata object and +# a no-eval zone. This holds for system output objects only if the +# system output metadata object is not mapped to a ref object. +# * optimize IP and SU mapping by giving a secondary bonus mapping score +# to candidate ref-sys MD map pairs if the end-words of both coincide. +# +# version 19: * bug fix in subroutine speakers_match +# * bug fix in tag_ref_words_with_metadata_info +# +# version 18: * cosmetic fix to error message in eval_condition +# * added conditional output options for word coverage performance +# * added secondary MD word coverage optimization to word alignment +# * further optimize word alignment by considering MD subtypes +# * further optimize MD alignment by considering MD subtypes +# * add a new SU discard rule: discard if TEND in no-eval zone +# * enforce legal values for su_extent_limit +# +# version 17: create_speaker_segs modified to accommodate the same speaker +# having multiple overlapping speaker segments. (This is an +# error and pathological condition, but the system must either +# disallow (abort on) the condition, or perform properly under +# the pathological condition. The second option is chosen.) +# +# version 16: * If neither -w nor -W is specified, suppress warnings about +# ref SPEAKER records subsuming no lexemes. +# * Output the overall speaker diarization stats after the +# stats for the individual files +# * Do not alter the case of alphabetic characters in the filename +# field from the ref rttm file +# * Made the format of the overall speaker error line more similar to +# the corresponding line of output from SpkrSegEval, to facilitate +# use of existing "grep" commands in existing scripts. +# +# version 15: * bug fix in create_speaker_segs to accommodate +# contiguous same-speaker segments +# * added conditional file/channel scoring to +# speaker diarization evaluation +# +# version 14: bug fix in md_score +# +# version 13: add DISCOURSE_RESPONSE as a FILLER subtype +# +# version 12: make REF LEXEMES optional if they aren't required +# +# version 11: change default for noscore MD regions +# +# version 10: bug fix +# +# version 09: +# * avoid crash when metadata discard yields no metadata +# * make evaluated ref_wds sensitive to metadata type +# * defer discarding of system output metadata until after +# metadata mapping, then discard only unmapped events. +# * extend 1-speaker scoring inhibition to metadata +# * eliminate demand for SPKR-INFO subtype for speakers +# * correct ref count of IP and SU exact boundary words +# * add official RT-04F scores +# * add conditional analyses for file/chnl/spkr/gender +# +# version 08: +# * bug fixes speaker diarization scoring +# - count of EVAL_WORDS corrected +# - no-score extended to nearest SPEAKER boundary +# +# version 07: +# * warning issued when discarding metadata events +# that cover LEXEMEs in the evaluation region +# +# version 06: +# * eliminated unused speakers from speaker scoring +# * changed discard algorithm for unannotated SU's and +# complex EDIT's to discard sys SU's and EDIT's when +# their midpoints overlap (rather than ANY overlap). +# * fixed display_metadata_mapping +# +# version 05: +# * upgraded display_metadata_mapping +# +# version 04: +# * diagnostic metadata mapping output added +# * uem_from_rttm bug fix +# +# version 03: +# * adjusted times used for speaker diarization +# * changed usage of max_extend to agree with cookbook +# +# version 02: speaker diarization evaluation added +# +# version 01: a merged version of df-eval-v14 and su-eval-v16 +# +################################# + +#global data +my $epsilon = 1E-8; +my $miss_name = " MISS"; +my $fa_name = " FALSE ALARM"; +my %rttm_datatypes = (SEGMENT => {eval => 1, "" => 1}, + NOSCORE => {"" => 1}, + NO_RT_METADATA => {"" => 1}, + LEXEME => {lex => 1, fp => 1, frag => 1, "un-lex" => 1, + "for-lex" => 1, alpha => 1, acronym => 1, + interjection => 1, propernoun => 1, other => 1}, + "NON-LEX" => {laugh => 1, breath => 1, lipsmack => 1, + cough => 1, sneeze => 1, other => 1}, + "NON-SPEECH" => {noise => 1, music => 1, other => 1}, + FILLER => {filled_pause => 1, discourse_marker => 1, + discourse_response => 1, explicit_editing_term => 1, + other => 1}, + EDIT => {repetition => 1, restart => 1, revision => 1, + simple => 1, complex => 1, other => 1}, + IP => {edit => 1, filler => 1, "edit&filler" => 1, + other => 1}, + SU => {statement => 1, backchannel => 1, question => 1, + incomplete => 1, unannotated => 1, other => 1}, + CB => {coordinating => 1, clausal => 1, other => 1}, + "A/P" => {"" => 1}, + SPEAKER => {"" => 1}, + "SPKR-INFO" => {adult_male => 1, adult_female => 1, child => 1, unknown => 1}); +my %md_subtypes = (FILLER => $rttm_datatypes{FILLER}, + EDIT => $rttm_datatypes{EDIT}, + IP => $rttm_datatypes{IP}, + SU => $rttm_datatypes{SU}); +my %spkr_subtypes = (adult_male => 1, adult_female => 1, child => 1, unknown => 1); + +my $noeval_mds = { + DEFAULT => { + NOSCORE => {"" => 1}, + NO_RT_METADATA => {"" => 1}, + }, +}; +my $noscore_mds = { + DEFAULT => { + NOSCORE => {"" => 1}, + LEXEME => {"un-lex" => 1}, + SU => {unannotated => 1}, + }, + MIN => { + NOSCORE => {"" => 1}, + SU => {unannotated => 1}, + }, + FRAG_UNLEX => { + NOSCORE => {"" => 1}, + LEXEME => {frag => 1, "un-lex" => 1}, + SU => {unannotated => 1}, + }, + FRAG => { + NOSCORE => {"" => 1}, + LEXEME => {frag => 1}, + SU => {unannotated => 1}, + }, + NONE => { + }, +}; +my $noeval_sds = { + DEFAULT => { + NOSCORE => {"" => 1}, + }, +}; +my $noscore_sds = { + DEFAULT => { + NOSCORE => {"" => 1}, + "NON-LEX" => {laugh => 1, breath => 1, lipsmack => 1, + cough => 1, sneeze => 1, other => 1}, + }, +}; + +my %speaker_map; + +my $default_extend = 0.50; #the maximum time (in seconds) to extend a no-score zone +my $default_collar = 0.00; #the no-score collar (in +/- seconds) to attach to SPEAKER boundaries +my $default_tgap = 1.00; #the max gap (in seconds) between matching ref/sys words +my $default_Tgap = 1.00; #the max gap (in seconds) between matching ref/sys metadata events +my $default_Wgap = 0.10; #the max gap (in words) between matching ref/sys metadata events +my $default_su_time_limit = 0.50; #the max extent (in seconds) to match for SU's +my $default_su_word_limit = 2.00; #the max extent (in words) to match for SU's +my $default_word_delta_score = 10.0; #the max delta score for word-based DP alignment of ref/sys words +my $default_time_delta_score = 1.00; #the max delta score for time-based DP alignment of ref/sys words + +my $usage = "\n\nUsage: $0 [-h] -r -s \n\n". + "Description: md-eval evaluates EARS metadata detection performance\n". + " by comparing system metadata output data with reference data\n". + "INPUT:\n". + " -R A file containing a list of the reference metadata files\n". + " being evaluated, in RTTM format. If the word-mediated alignment\n". + " option is used then this data must include reference STT data\n". + " in addition to the metadata being evaluated.\n". + " OR\n". + " -r A file containing reference metadata, in RTTM format\n\n". + " -S A file containing a list of the system output metadata\n". + " files to be evaluated, in RTTM format. If the word-mediated\n". + " alignment option is used then this data must include system STT\n". + " output data in addition to the metadata to be evaluated.\n". + " OR\n". + " -s A file containin system output metadata, in RTTM format\n\n". + " input options:\n". + " -x to include complex edits in the analysis and scoring.\n". + " -w for word-mediated alignment.\n". + " * The default (time-mediated) alignment aligns ref and sys metadata\n". + " according to the time overlap of the original ref and sys metadata\n". + " time intervals.\n". + " * Word-mediated alignment aligns ref and sys metadata according to\n". + " the alignment of the words that are subsumed within the metadata\n". + " time intervals.\n". + " -W for word-optimized mapping.\n". + " * The default (time-optimized) mapping maps ref and sys metadata\n". + " so as to maximize the time overlap of mapped metadata events.\n". + " * Word-optimized mapping maps ref and sys metadata so as to\n". + " maximize the overlap in terms of the number of reference words\n". + " that are subsumed within the overlapping time interval.\n". + " -a Conditional analysis options for metadata detection performance:\n". + " c for performance versus channel,\n". + " f for performance versus file,\n". + " g for performance versus gender, and\n". + " s for performance versus speaker.\n". + " -A Conditional analysis options for word coverage performance:\n". + " c for performance versus channel,\n". + " f for performance versus file,\n". + " -t