Preprocess the prosodically annotated TED corpus. Annotations of talks are prepared using: https://github.com/laic/prosody
###Input files:
-
.word.txt
(cmd input as-w
) -
.word.txt.norm.align
(cmd input as-l
) -
.aggs.alignword.txt
for fundemental frequency and intensity (cmd input as-f
and-i
)
###Output file:
- CSV file with word aligned features (cmd input as
-o
)
python tedDataToPickle.py -w data/raw/txt-sent/0001.word.txt -l data/raw/txt-sent/0001.word.txt.norm.align -f data/raw/derived/segs/f0/0001.aggs.alignword.txt -i data/raw/derived/segs/i0/0001.aggs.alignword.txt -o 0001.csv
s
./processAllTedData.sh data/raw data/compiled
Obtaining punkProse processable corpus
To collect samples from talks into one corpus partitioned into training/development/testing sets:
python corpusMaker.py -i data/compiled/ -o data/corpus -r 0.7 -v 2 -l 50
(Training and development set are sampled into sequences of size 50 (-l
). Training set constitutes 0.7 (-r
) of all data. Word vocabulary is created with minimum word occurence 2 (-v
).)