ted_preprocess

Preprocess the prosodically annotated TED corpus. Annotations of talks are prepared using: https://github.com/laic/prosody

Processing a single talk with tedDataToPickle.py:

###Input files:

.word.txt (cmd input as -w)
.word.txt.norm.align (cmd input as -l)
.aggs.alignword.txt for fundemental frequency and intensity (cmd input as -f and -i)

###Output file:

CSV file with word aligned features (cmd input as -o)

Sample run:

python tedDataToPickle.py -w data/raw/txt-sent/0001.word.txt -l data/raw/txt-sent/0001.word.txt.norm.align -f data/raw/derived/segs/f0/0001.aggs.alignword.txt -i data/raw/derived/segs/i0/0001.aggs.alignword.txt -o 0001.csv s

Batch process talks:

./processAllTedData.sh data/raw data/compiled

Obtaining punkProse processable corpus

To collect samples from talks into one corpus partitioned into training/development/testing sets:

python corpusMaker.py -i data/compiled/ -o data/corpus -r 0.7 -v 2 -l 50

(Training and development set are sampled into sequences of size 50 (-l). Training set constitutes 0.7 (-r) of all data. Word vocabulary is created with minimum word occurence 2 (-v).)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
etc		etc
README.md		README.md
corpusMaker.py		corpusMaker.py
processAllTedData.sh		processAllTedData.sh
tedDataToCsv.py		tedDataToCsv.py
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ted_preprocess

Processing a single talk with tedDataToPickle.py:

Sample run:

Batch process talks:

Obtaining punkProse processable corpus

About

Releases

Packages

Languages

alpoktem/ted_preprocess

Folders and files

Latest commit

History

Repository files navigation

ted_preprocess

Processing a single talk with tedDataToPickle.py:

Sample run:

Batch process talks:

Obtaining punkProse processable corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages