dr-boulder

Script(s) for the data rescue -boulder event

Assumptions

Assumes you have virtualenv installed If not, follow the instructions here
Assumes you have git or Git shell installed Windows/Mac Git setup instructions

Disclaimer

These scripts have only been tested on a mac, and the use of virtualenv and pip should result in the same behavior on Windows/Linux systems - but it's possible you might need to tweak the code. Very open to suggestions and/or Pull Requests!
for Linux distributions with coexistant python2 and python3, use python2 version.

Set things up

In your terminal/iterm (mac/unix) or Command Line/Git Shell (Windows):

Clone the repo and create a python virtualenv:

git clone https://github.com/rchakra3/dr-boulder.git
cd dr-boulder
virtualenv env

Activate your virtualenv:

Mac/Unix:

source env/bin/activate

Windows:

.\env\Scripts\activate

** You should now see (env) in your terminal/command prompt before your folder structure **

Download all the requirements:

pip install -r requirements.txt

That should have everything setup in your virtualenv

Scripts

There are currently 3 important scripts in this repository:

Generate a list of all the files available on an FTP server:
1. To run:
```
python -W ignore ftp_utils/get_all_files_from_ftp_server.py --server=<server domain name or IP> --output_file=<output file name>
```
2. This will generate a list of all the files that are available for download at a particular domain
3. The name/IP of the server is required. If the output file is not specified, it will write to ftp_files.txt

Download all the URLs listed in a file [NON FTP]:

This helps download a huge list of URLs (pdfs, json, xmls, etc)
Put the list of URLs in a file - 1 URL per line
Help:
```
python download_data.py -h
```

To run:

python -W ignore download_data.py --filename=<name of file specified in 2> --max_space=<max disk space to use(Defaults to 5GB) --downloads_folder=<name of folder where you want to store the data>

Download a list of files at FTP endpoints:
1. Same as the previous script, but for FTP files
2. You can use the file generated by the ftp_utils/get_all_files_from_ftp_server.py script as the input file for this script or create a new file with one ftp file per line
3. FTP downloads seem to be much slower in general - Would recommend running the script over a small number of files at a time
4. Help:
```
python ftp_utils/download_ftp_files.py -h
```
5. Run:
```
python -W ignore ftp_utils/download_ftp_files.py --filename=<name of file specified in 2> --downloads_folder=<folder where you want to save the files>
```

Domain-specific scripts:

So far there's only one. For edg.epa.gov/data/public

Generate the list of files:

cd edg_epa_data_public
python -W ignore find_data_edg_epa.py

This script will generate 3 files:
1. edg_epa_file_list.txt: The list of all the files that aren't ftp://
2. edg_epa_ftp_file_list.txt: The list of all the files that are ftp://
3. edg_epa_skipped_file_list.txt: The list of files that weren't downloaded for various reasons, including running out of disk space, exceeding the space limit specified, 404s

Downloading the files:

Use the scripts described above to download the URLs in the edg_epa_file_list.txt and edg_epa_ftp_file_list.txt lists

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
edg_epa_data_public		edg_epa_data_public
ftp_utils		ftp_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_data.py		download_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dr-boulder

Assumptions

Disclaimer

Set things up

Scripts

Domain-specific scripts:

Generate the list of files:

Downloading the files:

About

Releases

Packages

Contributors 2

Languages

License

rohancme/dr-boulder

Folders and files

Latest commit

History

Repository files navigation

dr-boulder

Assumptions

Disclaimer

Set things up

Scripts

Domain-specific scripts:

Generate the list of files:

Downloading the files:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages