diff --git a/README.md b/README.md index d54c739b..dcc9eeb7 100644 --- a/README.md +++ b/README.md @@ -3,38 +3,42 @@ [![CircleCI](https://circleci.com/gh/facebookresearch/wav2letter.svg?style=svg)](https://circleci.com/gh/facebookresearch/wav2letter) [![Join the chat at https://gitter.im/wav2letter/community](https://badges.gitter.im/wav2letter/community.svg)](https://gitter.im/wav2letter/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -wav2letter++ is a [highly efficient](https://arxiv.org/abs/1812.07625) end-to-end automatic speech recognition (ASR) toolkit written entirely in C++, leveraging [ArrayFire](https://github.com/arrayfire/arrayfire) and [flashlight](https://github.com/facebookresearch/flashlight). +## Important Note: +### wav2letter has been moved and consolidated [into Flashlight](https://github.com/facebookresearch/flashlight) in the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr). -The toolkit started from models predicting letters directly from the raw waveform, and now evolved as an all-purpose end-to-end ASR research toolkit, supporting a wide range of models and learning techniques. It also embarks a very efficient modular beam-search decoder, for both structured learning (CTC, ASG) and seq2seq approaches. +Future wav2letter development will occur in Flashlight. -**Important disclaimer**: as a number of models from this repository could be used for other modalities, we moved most of the code to flashlight. +*To build the old, pre-consolidation version of wav2letter*, checkout the [wav2letter v0.2](https://github.com/facebookresearch/wav2letter/releases/tag/v0.2) release, which depends on the old [Flashlight v0.2](https://github.com/facebookresearch/flashlight/releases/tag/v0.2) release. The [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) project can be fonud on the `wav2letter-lua` branch, accordingly. +For more information on wav2letter++, see or cite [this arXiv paper](https://arxiv.org/abs/1812.07625). + +## Recipes This repository includes recipes to reproduce the following research papers as well as **pre-trained** models: -- [NEW] [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/) -- [NEW SOTA] [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019) +- [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/) +- [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019) - [Kahn et al. (2020): Self-Training for End-to-End Speech Recognition](recipes/self_training) - [Likhomanenko et al. (2019): Who Needs Words? Lexicon-free Speech Recognition](recipes/lexicon_free/) - [Hannun et al. (2019): Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions](recipes/seq2seq_tds/) -Data preparation for our training and evaluation can be found in [data](data) folder. +Data preparation for training and evaluation can be found in [data](data) directory. -The previous iteration of wav2letter can be found in the: -- (before merging codebases for wav2letter and flashlight) [wav2letter-v0.2](https://github.com/facebookresearch/wav2letter/tree/v0.2) branch. -- (written in Lua) [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) branch. +### Building the Recipes -## Build recipes -First, isntall [flashlight](https://github.com/facebookresearch/flashlight) with all its dependencies. Then +First, install [Flashlight](https://github.com/facebookresearch/flashlight) with the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr). Then, after cloning the project source: +```shell +mkdir build && cd build +cmake .. && make -j8 ``` -mkdir build && cd build && cmake .. && make -j8 +If Flashlight or ArrayFire are installed in nonstandard paths via a custom `CMAKE_INSTALL_PREFIX`, they can be found by passing +```shell +-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake ``` -If flashlight or ArrayFire are installed in nonstandard paths via `CMAKE_INSTALL_PREFIX`, they can be found by passing `-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake` when running `cmake`. +when running `cmake`. ## Join the wav2letter community * Facebook page: https://www.facebook.com/groups/717232008481207/ * Google group: https://groups.google.com/forum/#!forum/wav2letter-users * Contact: vineelkpratap@fb.com, awni@fb.com, qiantong@fb.com, jacobkahn@fb.com, antares@fb.com, avidov@fb.com, gab@fb.com, vitaliy888@fb.com, locronan@fb.com -See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out. - ## License wav2letter++ is BSD-licensed, as found in the [LICENSE](LICENSE) file. diff --git a/data/ami/README.md b/data/ami/README.md new file mode 100644 index 00000000..448abf0d --- /dev/null +++ b/data/ami/README.md @@ -0,0 +1,64 @@ +# A Recipe for the AMI corpus. + +"The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers." See http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml for more details. + +We use the individual headset microphone (IHM) setting for preparing train, dev and test sets. The recipe here is heavily inspired from the preprocessing scripts in Kaldi - https://github.com/kaldi-asr/kaldi/tree/master/egs/ami . + +## Steps to download and prepare the audio and text data + +Prepare train, dev and test sets as list files to be used for training with wav2letter. Replace `[...]` with appropriate paths + +``` +python prepare.py -dst [...] +``` + +The above scripts download the AMI data, segments them into shorter `.flac` audio files based on word timestamps. Limited supervision training set for 10min, 1hr and 10hr will be generated as well. + +The following structure will be generated +``` +>tree -L 4 +. +├── audio +│   ├── EN2001a +│   │   ├── EN2001a.Headset-0.wav +│   │   ├── ... +│   │   └── EN2001a.Headset-4.wav +│   ├── EN2001b +│   ├── ... +│   ├── ... +│   ├── IS1009d +│   │   ├── ... +│   │   └── IS1009d.Headset-3.wav +│   └── segments +│ ├── ES2005a +│ │ ├── ES2005a_H00_MEE018_0.75_1.61.flac +│ │ ├── ES2005a_H00_MEE018_13.19_16.05.flac +│ │ ├── ... +│ │ └── ... +│      ├── ... +│      └── IS1009d +│      ├── ... +│ └── ... +├── lists +│ ├── dev.lst +│ ├── test.lst +│ ├── train_10min_0.lst +│ ├── train_10min_1.lst +│ ├── train_10min_2.lst +│ ├── train_10min_3.lst +│ ├── train_10min_4.lst +│ ├── train_10min_5.lst +│ ├── train_9hr.lst +│ └── train.lst +│ +└── text + ├── ami_public_manual_1.6.1.zip + └── annotations + ├── 00README_MANUAL.txt + ├── ... + ├── transcripts0 + ├── transcripts1 + ├── transcripts2 + ├── words + └── youUsages +``` diff --git a/data/ami/ami_split_segments.pl b/data/ami/ami_split_segments.pl new file mode 100644 index 00000000..db9ad083 --- /dev/null +++ b/data/ami/ami_split_segments.pl @@ -0,0 +1,218 @@ +#!/usr/bin/env perl + +# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski) + +# The script - based on punctuation times - splits segments longer than #words (input parameter) +# and produces bit more more normalised form of transcripts, as follows +# MeetID Channel Spkr stime etime transcripts + +#use List::MoreUtils 'indexes'; +use strict; +use warnings; + +sub split_transcripts; +sub normalise_transcripts; + +sub merge_hashes { + my ($h1, $h2) = @_; + my %hash1 = %$h1; my %hash2 = %$h2; + foreach my $key2 ( keys %hash2 ) { + if( exists $hash1{$key2} ) { + warn "Key [$key2] is in both hashes!"; + next; + } else { + $hash1{$key2} = $hash2{$key2}; + } + } + return %hash1; +} + +sub print_hash { + my ($h) = @_; + my %hash = %$h; + foreach my $k (sort keys %hash) { + print "$k : $hash{$k}\n"; + } +} + +sub get_name { + #no warnings; + my $sname = sprintf("%07d_%07d", $_[0]*100, $_[1]*100) || die 'Input undefined!'; + #use warnings; + return $sname; +} + +sub split_on_comma { + + my ($text, $comma_times, $btime, $etime, $max_words_per_seg)= @_; + my %comma_hash = %$comma_times; + + print "Btime, Etime : $btime, $etime\n"; + + my $stime = ($etime+$btime)/2; #split time + my $skey = ""; + my $otime = $btime; + foreach my $k (sort {$comma_hash{$a} cmp $comma_hash{$b} } keys %comma_hash) { + print "Key : $k : $comma_hash{$k}\n"; + my $ktime = $comma_hash{$k}; + if ($ktime==$btime) { next; } + if ($ktime==$etime) { last; } + if (abs($stime-$ktime)/20) { + $st=$comma_hash{$skey}; + $et = $etime; + } + my (@utts) = split (' ', $utts1[$i]); + if ($#utts < $max_words_per_seg) { + my $nm = get_name($st, $et); + print "SplittedOnComma[$i]: $nm : $utts1[$i]\n"; + $transcripts{$nm} = $utts1[$i]; + } else { + print 'Continue splitting!'; + my %transcripts2 = split_on_comma($utts1[$i], \%comma_hash, $st, $et, $max_words_per_seg); + %transcripts = merge_hashes(\%transcripts, \%transcripts2); + } + } + return %transcripts; +} + +sub split_transcripts { + @_ == 4 || die 'split_transcripts: transcript btime etime max_word_per_seg'; + + my ($text, $btime, $etime, $max_words_per_seg) = @_; + my (@transcript) = @$text; + + my (@punct_indices) = grep { $transcript[$_] =~ /^[\.,\?\!\:]$/ } 0..$#transcript; + my (@time_indices) = grep { $transcript[$_] =~ /^[0-9]+\.[0-9]*/ } 0..$#transcript; + my (@puncts_times) = delete @transcript[@time_indices]; + my (@puncts) = @transcript[@punct_indices]; + + if ($#puncts_times != $#puncts) { + print 'Ooops, different number of punctuation signs and timestamps! Skipping.'; + return (); + } + + #first split on full stops + my (@full_stop_indices) = grep { $puncts[$_] =~ /[\.\?]/ } 0..$#puncts; + my (@full_stop_times) = @puncts_times[@full_stop_indices]; + + unshift (@full_stop_times, $btime); + push (@full_stop_times, $etime); + + my %comma_puncts = (); + for (my $i=0, my $j=0;$i<=$#punct_indices; $i++) { + my $lbl = "$transcript[$punct_indices[$i]]$j"; + if ($lbl =~ /[\.\?].+/) { next; } + $transcript[$punct_indices[$i]] = $lbl; + $comma_puncts{$lbl} = $puncts_times[$i]; + $j++; + } + + #print_hash(\%comma_puncts); + + print "InpTrans : @transcript\n"; + print "Full stops: @full_stop_times\n"; + + my @utts1 = split (/[\.\?]/, uc join(' ', @transcript)); + my %transcripts = (); + for (my $i=0; $i<=$#utts1; $i++) { + my (@utts) = split (' ', $utts1[$i]); + if ($#utts < $max_words_per_seg) { + print "ReadyTrans: $utts1[$i]\n"; + $transcripts{get_name($full_stop_times[$i], $full_stop_times[$i+1])} = $utts1[$i]; + } else { + print "TransToSplit: $utts1[$i]\n"; + my %transcripts2 = split_on_comma($utts1[$i], \%comma_puncts, $full_stop_times[$i], $full_stop_times[$i+1], $max_words_per_seg); + print "Hash TR2:\n"; print_hash(\%transcripts2); + print "Hash TR:\n"; print_hash(\%transcripts); + %transcripts = merge_hashes(\%transcripts, \%transcripts2); + print "Hash TR_NEW : \n"; print_hash(\%transcripts); + } + } + return %transcripts; +} + +sub normalise_transcripts { + my $text = $_[0]; + + #DO SOME ROUGH AND OBVIOUS PRELIMINARY NORMALISATION, AS FOLLOWS + #remove the remaining punctation labels e.g. some text ,0 some text ,1 + $text =~ s/[\.\,\?\!\:][0-9]+//g; + #there are some extra spurious puncations without spaces, e.g. UM,I, replace with space + $text =~ s/[A-Z']+,[A-Z']+/ /g; + #split words combination, ie. ANTI-TRUST to ANTI TRUST (None of them appears in cmudict anyway) + #$text =~ s/(.*)([A-Z])\s+(\-)(.*)/$1$2$3$4/g; + $text =~ s/\-/ /g; + #substitute X_M_L with X. M. L. etc. + $text =~ s/\_/. /g; + #normalise and trim spaces + $text =~ s/^\s*//g; + $text =~ s/\s*$//g; + $text =~ s/\s+/ /g; + #some transcripts are empty with -, nullify (and ignore) them + $text =~ s/^\-$//g; + $text =~ s/\s+\-$//; + # apply few exception for dashed phrases, Mm-Hmm, Uh-Huh, etc. those are frequent in AMI + # and will be added to dictionary + $text =~ s/MM HMM/MM\-HMM/g; + $text =~ s/UH HUH/UH\-HUH/g; + + return $text; +} + +if (@ARGV != 2) { + print STDERR "Usage: ami_split_segments.pl \n"; + exit(1); +} + +my $meet_file = shift @ARGV; +my $out_file = shift @ARGV; +my %transcripts = (); + +open(W, ">$out_file") || die "opening output file $out_file"; +open(S, "<$meet_file") || die "opening meeting file $meet_file"; + +while() { + + my @A = split(" ", $_); + if (@A < 9) { print "Skipping line @A"; next; } + + my ($meet_id, $channel, $spk, $channel2, $trans_btime, $trans_etime, $aut_btime, $aut_etime) = @A[0..7]; + my @transcript = @A[8..$#A]; + my %transcript = split_transcripts(\@transcript, $aut_btime, $aut_etime, 30); + + for my $key (keys %transcript) { + my $value = $transcript{$key}; + my $segment = normalise_transcripts($value); + my @times = split(/\_/, $key); + if ($times[0] >= $times[1]) { + print "Warning, $meet_id, $spk, $times[0] > $times[1]. Skipping. \n"; next; + } + if (length($segment)>0) { + print W join " ", $meet_id, "H0${channel2}", $spk, $times[0]/100.0, $times[1]/100.0, $segment, "\n"; + } + } + +} +close(S); +close(W); + +print STDERR "Finished." diff --git a/data/ami/ami_xml2text.sh b/data/ami/ami_xml2text.sh new file mode 100644 index 00000000..b5d4fd63 --- /dev/null +++ b/data/ami/ami_xml2text.sh @@ -0,0 +1,47 @@ +#!/usr/bin/env bash + +# Copyright, University of Edinburgh (Pawel Swietojanski and Jonathan Kilgour) + +if [ $# -ne 1 ]; then + echo "Usage: $0 " + exit 1; +fi + +adir=$1 +wdir=$1/annotations + +[ ! -f $adir/annotations/AMI-metadata.xml ] && echo "$0: File $adir/annotations/AMI-metadata.xml no found." && exit 1; + +mkdir -p $wdir/log + +JAVA_VER=$(java -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q') + +if [ "$JAVA_VER" -ge 15 ]; then + if [ ! -d $wdir/nxt ]; then + echo "Downloading NXT annotation tool..." + wget -O $wdir/nxt.zip http://sourceforge.net/projects/nite/files/nite/nxt_1.4.4/nxt_1.4.4.zip + [ ! -s $wdir/nxt.zip ] && echo "Downloading failed! ($wdir/nxt.zip)" && exit 1 + unzip -d $wdir/nxt $wdir/nxt.zip &> /dev/null + fi + + if [ ! -f $wdir/transcripts0 ]; then + echo "Parsing XML files (can take several minutes)..." + nxtlib=$wdir/nxt/lib + java -cp $nxtlib/nxt.jar:$nxtlib/xmlParserAPIs.jar:$nxtlib/xalan.jar:$nxtlib \ + FunctionQuery -c $adir/annotations/AMI-metadata.xml -q '($s segment)(exists $w1 w):$s^$w1' -atts obs who \ + '@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent,global_name, 0)'\ + '@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent, channel, 0)' \ + transcriber_start transcriber_end starttime endtime '$s' '@extract(($w w):$s^$w & $w@punc="true", starttime,0,0)' \ + 1> $wdir/transcripts0 2> $wdir/log/nxt_export.log + fi +else + echo "$0. Java not found. Will download exported version of transcripts." + annots=ami_manual_annotations_v1.6.1_export + wget -O $wdir/$annots.gzip http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/$annots.gzip + gunzip -c $wdir/${annots}.gzip > $wdir/transcripts0 +fi + +#remove NXT logs dumped to stdio +grep -e '^Found' -e '^Obs' -i -v $wdir/transcripts0 > $wdir/transcripts1 + +exit 0; diff --git a/data/ami/prepare.py b/data/ami/prepare.py new file mode 100644 index 00000000..721a2a91 --- /dev/null +++ b/data/ami/prepare.py @@ -0,0 +1,120 @@ +""" +Copyright (c) Facebook, Inc. and its affiliates. + +This source code is licensed under the BSD-style license found in the +LICENSE file in the root directory of this source tree. + +---------- + +Script to package original AMI dataset into a form readable in +wav2letter++ pipelines + +Command : python3 prepare.py --dst [...] + +Replace [...] with appropriate path +""" + +from __future__ import absolute_import, division, print_function, unicode_literals + +import argparse +import os +from multiprocessing import Pool + +from tqdm import tqdm +from utils import split_audio, create_limited_sup + + +LOG_STR = " To regenerate this file, please, remove it." + +MIN_DURATION_MSEC = 50 # 50 msec +MAX_DURATION_MSEC = 30000 # 30 sec + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="AMI Dataset creation.") + parser.add_argument( + "--dst", + help="destination directory where to store data", + default="./ami", + ) + parser.add_argument( + "-p", + "--process", + help="number of process for multiprocessing", + default=8, + type=int, + ) + + args = parser.parse_args() + + splits = {"train": [], "dev": [], "test": []} + audio_path = os.path.join(args.dst, "audio") + text_path = os.path.join(args.dst, "text") + lists_path = os.path.join(args.dst, "lists") + os.makedirs(audio_path, exist_ok=True) + os.makedirs(text_path, exist_ok=True) + os.makedirs(lists_path, exist_ok=True) + audio_http = "http://groups.inf.ed.ac.uk/ami" + + # Download the audio data + print("Downloading the AMI audio data...", flush=True) + cmds = [] + for split in splits.keys(): + with open(os.path.join("splits", f"split_{split}.orig")) as f: + for line in f: + line = line.strip() + splits[split].append(line) + cur_audio_path = os.path.join(audio_path, line) + os.makedirs(cur_audio_path, exist_ok=True) + num_meetings = 5 if line in ["EN2001a", "EN2001d", "EN2001e"] else 4 + for meetid in range(num_meetings): + cmds.append( + f"wget -nv --continue -o /dev/null -P {cur_audio_path} {audio_http}/AMICorpusMirror/amicorpus/{line}/audio/{line}.Headset-{meetid}.wav" + ) + + for i in tqdm(range(len(cmds))): + os.system(cmds[i]) + + print("Downloading the text data ...", flush=True) + annotver = "ami_public_manual_1.6.1.zip" + cmd = f"wget -nv --continue -o /dev/null -P {text_path} {audio_http}/AMICorpusAnnotations/{annotver};" + cmd = cmd + f"mkdir -p {text_path}/annotations;" + cmd = cmd + f"unzip -q -o -d {text_path}/annotations {text_path}/{annotver} ;" + os.system(cmd) + + print("Parsing the transcripts ...", flush=True) + cmd = f"sh ami_xml2text.sh {text_path};" + os.system(cmd) + cmd = f"perl ami_split_segments.pl {text_path}/annotations/transcripts1 {text_path}/annotations/transcripts2 2>&1 > {text_path}/annotations/split_segments.log" + os.system(cmd) + + # Prepare the audio data + print("Segmenting audio files...", flush=True) + with open(f"{text_path}/annotations/transcripts2") as f: + lines = f.readlines() + lines = [audio_path + " " + line for line in lines] + os.makedirs(os.path.join(audio_path, "segments"), exist_ok=True) + with Pool(args.process) as p: + samples = list( + tqdm( + p.imap(split_audio, lines), + total=len(lines), + ) + ) + samples = [s for s in samples if s is not None] # filter None values + print("Wrote {} audio segment samples".format(len(samples))) + + print("Writing to list files...", flush=True) + for split, meetings in splits.items(): + cur_samples = [s for s in samples if s[0] in meetings] + with open(os.path.join(lists_path, f"{split}.lst"), "w") as fout: + for sample in cur_samples: + if ( + float(sample[3]) > MIN_DURATION_MSEC + and float(sample[3]) < MAX_DURATION_MSEC + ): + fout.write("\t".join(sample[1:]) + "\n") + + print("Preparing limited supervision subsets", flush=True) + create_limited_sup(lists_path) + + print("Done!", flush=True) diff --git a/data/ami/splits/split_dev.orig b/data/ami/splits/split_dev.orig new file mode 100644 index 00000000..90e6ae26 --- /dev/null +++ b/data/ami/splits/split_dev.orig @@ -0,0 +1,18 @@ +ES2011a +ES2011b +ES2011c +ES2011d +IB4001 +IB4002 +IB4003 +IB4004 +IB4010 +IB4011 +IS1008a +IS1008b +IS1008c +IS1008d +TS3004a +TS3004b +TS3004c +TS3004d diff --git a/data/ami/splits/split_test.orig b/data/ami/splits/split_test.orig new file mode 100644 index 00000000..5c89342d --- /dev/null +++ b/data/ami/splits/split_test.orig @@ -0,0 +1,16 @@ +EN2002a +EN2002b +EN2002c +EN2002d +ES2004a +ES2004b +ES2004c +ES2004d +IS1009a +IS1009b +IS1009c +IS1009d +TS3003a +TS3003b +TS3003c +TS3003d diff --git a/data/ami/splits/split_train.orig b/data/ami/splits/split_train.orig new file mode 100644 index 00000000..2a66efa4 --- /dev/null +++ b/data/ami/splits/split_train.orig @@ -0,0 +1,137 @@ +EN2001a +EN2001b +EN2001d +EN2001e +EN2003a +EN2004a +EN2005a +EN2006a +EN2006b +EN2009b +EN2009c +EN2009d +ES2002a +ES2002b +ES2002c +ES2002d +ES2003a +ES2003b +ES2003c +ES2003d +ES2005a +ES2005b +ES2005c +ES2005d +ES2006a +ES2006b +ES2006c +ES2006d +ES2007a +ES2007b +ES2007c +ES2007d +ES2008a +ES2008b +ES2008c +ES2008d +ES2009a +ES2009b +ES2009c +ES2009d +ES2010a +ES2010b +ES2010c +ES2010d +ES2012a +ES2012b +ES2012c +ES2012d +ES2013a +ES2013b +ES2013c +ES2013d +ES2014a +ES2014b +ES2014c +ES2014d +ES2015a +ES2015b +ES2015c +ES2015d +ES2016a +ES2016b +ES2016c +ES2016d +IB4005 +IN1001 +IN1002 +IN1005 +IN1007 +IN1008 +IN1009 +IN1012 +IN1013 +IN1014 +IN1016 +IS1000a +IS1000b +IS1000c +IS1000d +IS1001a +IS1001b +IS1001c +IS1001d +IS1002b +IS1002c +IS1002d +IS1003a +IS1003b +IS1003c +IS1003d +IS1004a +IS1004b +IS1004c +IS1004d +IS1005a +IS1005b +IS1005c +IS1006a +IS1006b +IS1006c +IS1006d +IS1007a +IS1007b +IS1007c +IS1007d +TS3005a +TS3005b +TS3005c +TS3005d +TS3006a +TS3006b +TS3006c +TS3006d +TS3007a +TS3007b +TS3007c +TS3007d +TS3008a +TS3008b +TS3008c +TS3008d +TS3009a +TS3009b +TS3009c +TS3009d +TS3010a +TS3010b +TS3010c +TS3010d +TS3011a +TS3011b +TS3011c +TS3011d +TS3012a +TS3012b +TS3012c +TS3012d diff --git a/data/ami/utils.py b/data/ami/utils.py new file mode 100644 index 00000000..cb671070 --- /dev/null +++ b/data/ami/utils.py @@ -0,0 +1,210 @@ +""" +Copyright (c) Facebook, Inc. and its affiliates. + +This source code is licensed under the BSD-style license found in the +LICENSE file in the root directory of this source tree. +""" + + +from __future__ import absolute_import, division, print_function, unicode_literals + +import copy +import os +import random +from collections import namedtuple + +import sox + +Speaker = namedtuple("Speaker", ["id", "gender"]) +FileRecord = namedtuple("FileRecord", ["fid", "length", "speaker"]) + + +def split_audio(line): + apath, meetid, hset, spk, start, end, transcript = line.strip().split(" ", 6) + key = "_".join([meetid, hset, spk, start, end]) + os.makedirs(os.path.join(apath, "segments", meetid), exist_ok=True) + idx = hset[-1] + fn = f"{meetid}.Headset-{idx}.wav" + infile = os.path.join(apath, meetid, fn) + assert os.path.exists(infile), f"{infile} doesn't exist" + new_path = os.path.join(apath, "segments", meetid, key + ".flac") + sox_tfm = sox.Transformer() + sox_tfm.set_output_format( + file_type="flac", encoding="signed-integer", bits=16, rate=16000 + ) + start = float(start) + end = float(end) + sox_tfm.trim(start, end) + sox_tfm.build(infile, new_path) + sx_dur = sox.file_info.duration(new_path) + if sx_dur is not None and abs(sx_dur - end + start) < 0.5: + return [meetid, key, new_path, str(round(sx_dur * 1000, 2)), transcript.lower()] + + +def do_split(all_records, spkrs, total_seconds, handles_chosen=None): + """ + Greedily selecting speakers, provided we don't go over budget + """ + time_taken = 0.0 + records_filtered = [] + idx = 0 + speakers = copy.deepcopy(spkrs) + current_speaker_time = {spk: 0 for spk in speakers} + current_speaker_idx = {spk: 0 for spk in speakers} + while True: + if len(speakers) == 0: + break + speaker = speakers[idx % len(speakers)] + idx += 1 + tocontinue = False + while True: + cur_spk_idx = current_speaker_idx[speaker] + if cur_spk_idx == len(all_records[speaker]): + speakers.remove(speaker) + tocontinue = True + break + cur_record = all_records[speaker][cur_spk_idx] + current_speaker_idx[speaker] += 1 + if handles_chosen is None or cur_record.fid not in handles_chosen: + break + if tocontinue: + continue + records_filtered.append(cur_record) + time_taken += cur_record.length + current_speaker_time[speaker] += cur_record.length + if abs(time_taken - total_seconds) < 10: + break + + return records_filtered, time_taken + + +def get_speakers(train_file): + cache = {} + all_speakers = [] + with open(train_file) as f: + for line in f: + spl = line.split() + speaker_id = spl[0].split("_")[2] + gender = speaker_id[0] + if gender not in ["M", "F"]: + continue + if speaker_id not in cache: + cache[speaker_id] = 1 + speaker = Speaker(id=speaker_id, gender=gender) + all_speakers.append(speaker) + return all_speakers + + +def get_fid2length(train_file): + fids = [] + lengths = [] + with open(train_file) as f: + for line in f: + spl = line.split() + fids.append(spl[0]) + lengths.append(float(spl[2]) / 1000) + return list(zip(fids, lengths)) + + +def full_records(speakers, fid2length, subset_name=None): + all_records = [] + speakers = {(speaker.id, speaker) for speaker in speakers} + + for fid, length in fid2length: + speaker = fid.split("_")[2] + assert speaker in speakers, f"Unknown speaker! {speaker}" + + speaker = speakers[speaker] + + if subset_name is not None: + assert subset_name == speaker.subset + frecord = FileRecord(speaker=speaker, length=length, fid=fid) + all_records.append(frecord) + return all_records + + +def get_speaker2time(records, lambda_key, lambda_value): + from collections import defaultdict + + key_value = defaultdict(int) + + for record in records: + key = lambda_key(record) + value = lambda_value(record) + key_value[key] += value + + return key_value + + +def create_limited_sup(list_dir): + random.seed(0) + train_file = os.path.join(list_dir, "train.lst") + assert os.path.exists(train_file) + + speakers = get_speakers(train_file) + print("Found speakers", len(speakers)) + + write_records = {} + chosen_records = {} + + fid2length = get_fid2length(train_file) + all_records = full_records(speakers, fid2length) + + for gender in ["M", "F"]: + print(f"Selecting from gender {gender}") + records = [rec for rec in all_records if rec.speaker.gender == gender] + + speaker2time = get_speaker2time( + records, lambda_key=lambda r: r.speaker.id, lambda_value=lambda r: r.length + ) + + # select 15 random speakers + min_minutes_per_speaker = 15 + speakers_10hr = { + r.speaker.id + for r in records + if speaker2time[r.speaker.id] >= min_minutes_per_speaker * 60 + } + speakers_10hr = sorted(speakers_10hr) + random.shuffle(speakers_10hr) + speakers_10hr = speakers_10hr[:15] + + print(f"Selected speakers from gender {gender} ", speakers_10hr) + + cur_records = {} + for speaker in speakers_10hr: + cur_records[speaker] = [r for r in records if r.speaker.id == speaker] + random.shuffle(cur_records[speaker]) + + # 1 hr as 6 x 10min splits + key = "10min_" + gender + write_records[key] = {} + for i in range(6): + speakers_10min = random.sample(set(speakers_10hr), 3) + write_records[key][i], _ = do_split( + cur_records, speakers_10min, 10 * 60 / 2, chosen_records + ) + for kk in write_records[key][i]: + chosen_records[kk.fid] = 1 + + # 9 hr + key = "9hr_" + gender + write_records[key], _ = do_split( + cur_records, speakers_10hr, (9 * 60 * 60) / 2, chosen_records + ) + + train_lines = {} + with open(train_file) as f: + for line in f: + train_lines[line.split()[0]] = line.strip() + + print("Writing 6 x 10min list files...") + for i in range(6): + with open(os.path.join(list_dir, f"train_10min_{i}.lst"), "w") as fo: + for record in write_records["10min_M"][i] + write_records["10min_F"][i]: + fo.write(train_lines[record.fid]) + + print("Writing 9hr list file...") + with open(os.path.join(list_dir, "train_9hr.lst"), "w") as fo: + for record in write_records["9hr_M"] + write_records["9hr_F"]: + fo.write(train_lines[record.fid]) diff --git a/recipes/mls/README.md b/recipes/mls/README.md index e54ad793..8692f06b 100644 --- a/recipes/mls/README.md +++ b/recipes/mls/README.md @@ -69,7 +69,7 @@ Follow the steps [here](../../data/mls/) to download and prepare the datset for #### Viterbi ``` -[...]/flashlight/build/bin/asr/fl_asr_test --am=[...]/am.bin --lexicon=[...]/train_lexicon.txt --datadir=[...] --test=test.lst --tokensdir=[...] --tokens=[...]/tokens.txt --emission_dir='' --nouselexicon --show +[...]/flashlight/build/bin/asr/fl_asr_test --am=[...]/am.bin --lexicon=[...]/train_lexicon.txt --datadir=[...] --test=test.lst --tokens=[...]/tokens.txt --emission_dir='' --nouselexicon --show ``` #### Beam search with language model diff --git a/recipes/mls/decode/dutch.cfg b/recipes/mls/decode/dutch.cfg index a5012c39..b176e2cf 100644 --- a/recipes/mls/decode/dutch.cfg +++ b/recipes/mls/decode/dutch.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/english.cfg b/recipes/mls/decode/english.cfg index 03502f4e..058f8ff7 100644 --- a/recipes/mls/decode/english.cfg +++ b/recipes/mls/decode/english.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/french.cfg b/recipes/mls/decode/french.cfg index 6ec22a77..cf1ebb61 100644 --- a/recipes/mls/decode/french.cfg +++ b/recipes/mls/decode/french.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/german.cfg b/recipes/mls/decode/german.cfg index fe4cc9b7..2d0acad8 100644 --- a/recipes/mls/decode/german.cfg +++ b/recipes/mls/decode/german.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/italian.cfg b/recipes/mls/decode/italian.cfg index e75a3c27..2b804e9d 100644 --- a/recipes/mls/decode/italian.cfg +++ b/recipes/mls/decode/italian.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/polish.cfg b/recipes/mls/decode/polish.cfg index e6876f6c..0e428297 100644 --- a/recipes/mls/decode/polish.cfg +++ b/recipes/mls/decode/polish.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/portuguese.cfg b/recipes/mls/decode/portuguese.cfg index e8647f5e..e8986140 100644 --- a/recipes/mls/decode/portuguese.cfg +++ b/recipes/mls/decode/portuguese.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/decode/spanish.cfg b/recipes/mls/decode/spanish.cfg index 8d762517..dceaf3e0 100644 --- a/recipes/mls/decode/spanish.cfg +++ b/recipes/mls/decode/spanish.cfg @@ -1,6 +1,5 @@ --am=[...]/am.bin ---tokensdir=[...] ---tokens=[...]/tokens.txt +--tokens=[...]/[...]/tokens.txt --lm=[...]/5-gram_lm.arpa --lexicon=[...]/joint_lexicon.txt --datadir=[...] diff --git a/recipes/mls/train/dutch.cfg b/recipes/mls/train/dutch.cfg index e8af7fae..17ae45b2 100644 --- a/recipes/mls/train/dutch.cfg +++ b/recipes/mls/train/dutch.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/english.cfg b/recipes/mls/train/english.cfg index d3c98333..7362b446 100644 --- a/recipes/mls/train/english.cfg +++ b/recipes/mls/train/english.cfg @@ -28,10 +28,8 @@ --saug_tmaskn=10 --reportiters=5000 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/french.cfg b/recipes/mls/train/french.cfg index e8af7fae..17ae45b2 100644 --- a/recipes/mls/train/french.cfg +++ b/recipes/mls/train/french.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/german.cfg b/recipes/mls/train/german.cfg index e8af7fae..17ae45b2 100644 --- a/recipes/mls/train/german.cfg +++ b/recipes/mls/train/german.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/italian.cfg b/recipes/mls/train/italian.cfg index 2563a0e9..1a25973c 100644 --- a/recipes/mls/train/italian.cfg +++ b/recipes/mls/train/italian.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/polish.cfg b/recipes/mls/train/polish.cfg index 2563a0e9..1a25973c 100644 --- a/recipes/mls/train/polish.cfg +++ b/recipes/mls/train/polish.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/portuguese.cfg b/recipes/mls/train/portuguese.cfg index 2563a0e9..1a25973c 100644 --- a/recipes/mls/train/portuguese.cfg +++ b/recipes/mls/train/portuguese.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/mls/train/spanish.cfg b/recipes/mls/train/spanish.cfg index e8af7fae..17ae45b2 100644 --- a/recipes/mls/train/spanish.cfg +++ b/recipes/mls/train/spanish.cfg @@ -27,10 +27,8 @@ --saug_tmaskp=0.1 --saug_tmaskn=10 --datadir=[...] ---archdir=[...] ---arch=arch.txt ---tokensdir=[...] ---tokens=tokens.txt +--arch=[...]/arch.txt +--tokens=[...]/tokens.txt --lexicon=[...]/train_lexicon.txt --train=train.lst --valid=dev.lst diff --git a/recipes/rasr/README.md b/recipes/rasr/README.md new file mode 100644 index 00000000..2a7af51a --- /dev/null +++ b/recipes/rasr/README.md @@ -0,0 +1,86 @@ +# RASR release + +This is a repository sharing pre-trained acoustic models and language models for our new paper [Rethinking Evaluation in ASR: Are Our Models Robust Enough?](https://arxiv.org/abs/2010.11745). + + +## Dependencies + +* [`Flashlight`](https://github.com/facebookresearch/flashlight) +* [`Flashlight` ASR app](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr) + +## Models + +### Acoustic Model + +All the acoustic models are retrained using `Flashlight` with [wav2letter++](https://github.com/facebookresearch/wav2letter) consolidated. `Tedlium` is not used as training data here due to license issue. All the training data has more standardized sample rate 16kHz rather than 8kHz used in the paper. + +Here, we are releasing models with different architecture and different sizes. Note that the models may not fully reproduce results in the paper because of both data and toolkit implementation discrepancies. + +|Achitecture |# Param |Arch File |Path | +| :---: | :---: | :---: | :---: | +|Transformer |300 mil |[am_transformer_ctc_stride3_letters_300Mparams.arch](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_300Mparams.arch) |[am_transformer_ctc_stride3_letters_300Mparams.bin](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_300Mparams.bin) | +|Transformer |70 mil |[am_transformer_ctc_stride3_letters_70Mparams.arch](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.arch) |[am_transformer_ctc_stride3_letters_70Mparams.bin](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.bin) | +|Conformer |300 mil |[am_conformer_ctc_stride3_letters_300Mparams.arch](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_300Mparams.arch) |[am_conformer_ctc_stride3_letters_300Mparams.bin](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_300Mparams.bin) | +|Conformer |87 mil |[am_conformer_ctc_stride3_letters_87Mparams.arch](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_87Mparams.arch) |[am_conformer_ctc_stride3_letters_87Mparams.bin](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_87Mparams.bin) | +|Conformer |28 mil |[am_conformer_ctc_stride3_letters_25Mparams.arch](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_25Mparams.arch) |[am_conformer_ctc_stride3_letters_25Mparams.bin](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_conformer_ctc_stride3_letters_25Mparams.bin) | + + + +### Language Model + +Language models are trained on Common Crawl corpus as mentioned in paper. We are providing 4-gram LMs with different pruning here with [200k-top words](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_200kvocab.txt). All the LMs are trained with [KenLM toolkit](https://kheafield.com/code/kenlm/). + +| Pruning Param |Size (GB) |Path | +| :---: | :---: | :---: | +|0 0 5 5 |8.4 |[large](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_large_4gram_prun0-0-5_200kvocab.bin) | +|0 6 15 15 |2.5 |[small](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin) | + +The perplexities of the LMs on different development sets are listed below. + +| LM |nov93dev |TL-dev |CV-dev |LS-dev-clean |LS-dev-other |RT03 | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Large |313 |158 |243 |303 |304 |227 | +| Small |331 |178 |262 |330 |325 |226 | + + +### WER + +Here we summarize the decoding WER for all releasing models. All the numbers in the table are in format `viterbi WER -> beam search WER (small beam/large beam)`. + +|Achitecture |# Param |nov92 |TL-test |CV-test |LS-test-clean |LS-test-other |Hub05-SWB |Hub05-CH | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | +|Transformer |300 mil |3.4 → 2.9/2.9 |7.6 → 5.5/5.4 |15.5 → 11.6/11.2 |3.0 → 3.2/3.2 |7.2 → 6.4/6.4 |6.8 → 6.2/6.2 |11.6 → 10.8/10.7 | +|Transformer |70 mil |4.5 → 3.7/3.5 |9.4 → 6.2/6.1 |19.8 →13.8/13.0 |4 → 3.6 /3.6 |9.7 → 7.7/7.5 |7.5 → 6.6/6.5 |13 → 11.8/11.7 | +|Conformer |300 mil |3.5 → 3.3/3.3 |8.4 → 6.2/6.0 |17 → 12.7/12.0 |3.2 → 3.4/3.4 |8 → 7/6.8 |7 → 6.4/6.5 |11.9 → 10.7/10.5 | +|Conformer |87 mil |4.3 → 3.3/3.3 |8.7 → 6.1/5.9 |18.2 →13.1/12.4 |3.7 → 3.5/3.5 |8.6 → 7.4/7.2 |7.3 → 6.7/6.7 |12.2 → 11.7/11.5 | +|Conformer |28 mil |5 → 3.9/3.8 |10.5 → 6.9/6.6 |22.2 → 15.4/14.4 |4.7 → 4/3.9 |11.1 → 8.9/8.6 |8.8 → 7.8/7.7 |13.7 → 12.4/12.2 | + +Decoding is done with lexicon-based beam-search decoder using 200k common crawl lexicon and small common crawl lm. +* [tokens](https://[dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt](http://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/tokens.txt)) +* [inference lexicon](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt) +* Decoding parameters (`beamthreshold=100, beamsizetoken=30`): + +|Achitecture |# Param |LM Weight |Word Score |Beam Size | +| :---: | :---: | :---: | :---: | :---: | +|Transformer |300 mil |1.5 |0 |50/500 | +|Transformer |70 mil |1.7 |0 |50/500 | +|Conformer |300 mil |1.8 |2 |50/500 | +|Conformer |87 mil |2 |0 |50/500 | +|Conformer |28 mil |2 |0 |50/500 | + +## Tutorial + +To simply serialize all the models and interact with them, please refer to the [`Flashlight` ASR app tutorials](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial). + + + +## Citation + +``` +@article{likhomanenko2020rethinking, + title={Rethinking Evaluation in ASR: Are Our Models Robust Enough?}, + author={Likhomanenko, Tatiana and Xu, Qiantong and Pratap, Vineel and Tomasello, Paden and Kahn, Jacob and Avidov, Gilad and Collobert, Ronan and Synnaeve, Gabriel}, + journal={arXiv preprint arXiv:2010.11745}, + year={2020} +} +```