Corts Valencianes Scrape

Processes to scrape cortsvalencianes.es and generate a speech dataset.

Pipeline (currently)

Scrape process currently works in steps, which seperate scripts are launched for each. The results are currently saved in a file rather than a db.

scrape_corts.py: Scrapes the list of sessions from the search page and puts it in items.json file
download.py: Uses the items.json information to choose the plenary sessions and download the relevant videos from the streaming source
generate_diaris.py: From the list of plenary sessions in items.json, generates the transcript links and saves it to items_diaris.json.

Install the requirements via virtualenv

virtualenv --python=python venv
sourve venv/bin/activate

Currently the scripts are launched one after the other without any parameters. The only requirement is that scrape_corts.py has to come first.

python scrape_corts.py
python download.py
python generate_diaris.py

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
cache.json		cache.json
diaris.json		diaris.json
download.py		download.py
generate_diaris.py		generate_diaris.py
generate_match.py		generate_match.py
items.json		items.json
items_diaris.json		items_diaris.json
items_diaris_metadata.json		items_diaris_metadata.json
items_diaris_metadata_text.json		items_diaris_metadata_text.json
match_diaris.py		match_diaris.py
parse_diaris.py		parse_diaris.py
requirements.txt		requirements.txt
scrape_corts.py		scrape_corts.py
scrape_plens.py		scrape_plens.py