Prosodically annotated parallel speech corpus generation using dubbed movies.
Dubbed media content is a valuable resource for generating prosodically rich parallel audio corpora. movie2parallelDB
automates this process taking in audio tracks and subtitles of a movie and outputting aligned voice segments with text and prosody annotation.
(Instructions below are for an earlier version of the library that is not currently maintained. For instructions of the latest version see README.md
.)
-
Inputs:
- Movie audio in language1 -
<audio_1>
- Movie audio in language2 -
<audio_2>
- Subtitles (.srt) in language1 -
<srt_1>
- Subtitles (.srt) in language2 -
<srt_2>
- Movie audio in language1 -
-
Outputs:
- Language 1 cropped sentences directory -
<lang1-sentence-segments-output-folder>
- Language 2 cropped sentences directory -
<lang2-sentence-segments-output-folder>
- Parallel text data -
<parallel-data-textdump>
- Parallel speech+prosodic parameters directory -
<parallel-db-folder>
- Language 1 cropped sentences directory -
-
Required installations on Linux system:
- Python 2.7, avconv, meteor, Praat
-
Required Python libraries:
- yandex_translate, numpy, nltk
-
Required library accesses:
- Scriber - credentials for word aligner software API (https://scribe.vocapia.com/) should be set on
src/credentials.py
for this step to run. If you don't have access credentials for this service, the word segmentation output should look likeexample/example-scriber-wordsegmentation.xml
- Scriber - credentials for word aligner software API (https://scribe.vocapia.com/) should be set on
In order to obtain mp3 audio from a multichannel video file, you can use ffmpeg:
ffmpeg -i <multichannel-movie-file> -map 0:1 -c:a libmp3lame -b:a:0 320k <audio_1.mp3>
Call segment_movie.py to extract sentences from audio and subtitle pair:
python src/segment_movie.py -a <audio> -s <srt> -o <sentence-segments-output-folder> -l <lang-code> [-d debug_en]
lang-code is ISO639-1 language code for languages [ara, fre, ger, pol, tur]
To extract prosodic parameters (run from main directory):
./src/batch_f0_parametrization.sh <sentence-segments-output-folder> <lang-code>
Until here, a monolingual prosodically annotated corpora is created. Repeat these steps for audio and subtitle pair of each language. Then, to create a parallel corpus execute the following:
To find parallel sentences between two monolingual data:
python src/sentenceMapper.py -e <lang1-sentence-segments-output-folder>/<lang1-code>_sentenceData.csv -s <lang2-sentence-segments-output-folder>/<lang2-code>_sentenceData.csv -o <sentence-mappings-file>
To reindex and store only parallel sentences:
./src/createParallelCorpus.sh <lang1-sentence-segments-output-folder> <lang2-sentence-segments-output-folder> <sentence-mappings-file> <parallel-db-folder> <parallel-data-textdump>
Sample data is placed in example
directory. Run beta_run.sh
to test the system on the example data.
lib/cmdautomated_ProsodyPro.praat
is developed by Yi Xu.
lib/xml2textgrid_v2.pl
is developed by Yvan Josse and revised by Iván Latorre & Mónica Domínguez.
Sample movie data in example directory is from the film "The Man Who Knew Too Much" (1956) from Universal Pictures.
This work is introduced in BUCC workshop under ACL 2017: Paper link
@inproceedings{movie2parallelDB,
author = {Alp Oktem and Mireia Farrus and Leo Wanner},
title = {Automatic extraction of parallel speech corpora from dubbed movies},
booktitle = {Proceedings of the 10th Workshop on Building and Using Comparable Corpora (BUCC)},
year = {2017},
address = {Vancouver, Canada}
}