WER evaluation conditions #1853

yassine-ajaaoun · 2021-04-21T15:09:54Z

yassine-ajaaoun
Apr 21, 2021

Hello,

I'm currently investigating STT open source solutions for live transcription of company meetings. I'm comparing the 0.9.3 english DeepSpeech model with Vosk-Kaldi. The idea is to compare WERs for simple .wav's (meeting, sentences read...).
Before automating my tests on a larger database I wanted to look for the best conditions for deepspeech to properly transcribe.
What I've seen so far is that the best conditions were (besides having 16khz 16-bit PCM wavs), having enough sound amplitude, not exceeding 15s of audio, not having to much noise...
I've collected such wavs and proceeded to transcribe them with DeepSpeech and Vosk.
For the DeepSpeech part, I've used this code :

from deepspeech import Model
import sys
sys.path.insert(1, './vad_transcriber')
import wavSplit
import wavTranscriber
import numpy as np
import os

model_retval = wavTranscriber.load_model('Modeles_deepspeech/en/deepspeech-0.9.3-models.pbmm','Modeles_deepspeech/en/deepspeech-0.9.3-models.scorer')

waveFile = 'audio_path.wav'
segments, sample_rate, audio_length = wavTranscriber.vad_segment_generator(waveFile, 1)

for segment in segments:
  # Run deepspeech on the chunk that just completed VAD
  audio = np.frombuffer(segment, dtype=np.int16)
  output = wavTranscriber.stt(model_retval[0], audio, sample_rate)
  print(output[0])

With wavTranscriber functions available here : https://github.com/mozilla/DeepSpeech-examples/tree/r0.9/vad_transcriber

I'm not using evaluate.py for evaluation but a personal algorithm that takes the transcription from the code I showed for all the wavs and compare WERs with Vosk-Kaldi ones.

As a conclusion Vosk-Kaldi appeared to have way better transcription in general, and DeepSpeech mean-wer often appears to exceed 60+% which isn't really good.
I would like to know if I was doing things wrong, if you see any thing that dramatically makes the transcription bad, or if that was common wer for an average speech wav that has nothing to do with the training set, meaning personal training is mandatory to obtain acceptable results for real meetings use (with interruptions, noise etc...)

Thank you for your time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WER evaluation conditions #1853

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

WER evaluation conditions #1853

yassine-ajaaoun Apr 21, 2021

Replies: 0 comments

yassine-ajaaoun
Apr 21, 2021