Script(s) for the data rescue -boulder event
- Assumes you have virtualenv installed If not, follow the instructions here
- Assumes you have git or Git shell installed Windows/Mac Git setup instructions
-
These scripts have only been tested on a mac, and the use of virtualenv and pip should result in the same behavior on Windows/Linux systems - but it's possible you might need to tweak the code. Very open to suggestions and/or Pull Requests!
-
for Linux distributions with coexistant python2 and python3, use python2 version.
- In your terminal/iterm (mac/unix) or Command Line/Git Shell (Windows):
Clone the repo and create a python virtualenv:
git clone https://github.com/rchakra3/dr-boulder.git
cd dr-boulder
virtualenv env
Activate your virtualenv:
- Mac/Unix:
source env/bin/activate
- Windows:
.\env\Scripts\activate
** You should now see (env)
in your terminal/command prompt before your folder structure **
- Download all the requirements:
pip install -r requirements.txt
That should have everything setup in your virtualenv
There are currently 3 important scripts in this repository:
-
Generate a list of all the files available on an FTP server:
-
To run:
python -W ignore ftp_utils/get_all_files_from_ftp_server.py --server=<server domain name or IP> --output_file=<output file name>
-
This will generate a list of all the files that are available for download at a particular domain
-
The name/IP of the server is required. If the output file is not specified, it will write to
ftp_files.txt
-
-
Download all the URLs listed in a file [NON FTP]:
-
This helps download a huge list of URLs (pdfs, json, xmls, etc)
-
Put the list of URLs in a file - 1 URL per line
-
Help:
python download_data.py -h
-
To run:
python -W ignore download_data.py --filename=<name of file specified in 2> --max_space=<max disk space to use(Defaults to 5GB) --downloads_folder=<name of folder where you want to store the data>
-
-
Download a list of files at FTP endpoints:
- Same as the previous script, but for FTP files
- You can use the file generated by the
ftp_utils/get_all_files_from_ftp_server.py
script as the input file for this script or create a new file with one ftp file per line - FTP downloads seem to be much slower in general - Would recommend running the script over a small number of files at a time
- Help:
python ftp_utils/download_ftp_files.py -h
- Run:
python -W ignore ftp_utils/download_ftp_files.py --filename=<name of file specified in 2> --downloads_folder=<folder where you want to save the files>
So far there's only one. For edg.epa.gov/data/public
cd edg_epa_data_public
python -W ignore find_data_edg_epa.py
- This script will generate 3 files:
- edg_epa_file_list.txt: The list of all the files that aren't ftp://
- edg_epa_ftp_file_list.txt: The list of all the files that are ftp://
- edg_epa_skipped_file_list.txt: The list of files that weren't downloaded for various reasons, including running out of disk space, exceeding the space limit specified, 404s
Use the scripts described above to download the URLs in the edg_epa_file_list.txt
and edg_epa_ftp_file_list.txt
lists