Fuel has a growing number of built-in datasets that simplify working on standard benchmark datasets, such as MNIST or CIFAR10.
These datasets are defined in the fuel.datasets
module. Some user
intervention is needed before they're used for the first time: a given
dataset has to be downloaded and converted into a format that is recognized by
its corresponding dataset class. Fortunately, Fuel also has built-in tools
to automate these operations.
In order for Fuel to know where to look for its data, the data_path
configuration variable has to be set inside ~/.fuelrc
. It's expected to be
a sequence of paths separated by an OS-specific delimiter (:
for Linux and
OSX, ;
for Windows):
# ~/.fuelrc
data_path: "/first/path/to/my/data:/second/path/to/my/data"
When looking for a specific file (e.g. mnist.hdf5
), Fuel will search each of
these paths in sequence, using the first matching file that it finds.
This configuration variable can be overridden by setting the FUEL_DATA_PATH
environment variable:
$ export FUEL_DATA_PATH="/first/path/to/my/data:/second/path/to/my/data"
Let's now change directory for the rest of this tutorial:
$ cd $FUEL_DATA_PATH
We're going to download the raw data files for the MNIST dataset with the
fuel-download
script that was installed with Fuel:
$ fuel-download mnist
The script is pretty simple: you call it and pass it the name of the dataset
you'd like to download. In order to know which datasets are available to
download via fuel-download
, type
$ fuel-download -h
You can pass dataset-specific arguments to the script. In order to know which
arguments are accepted, append -h
to your dataset choice:
fuel-download mnist -h
Two arguments are always accepted:
-d DIRECTORY
: define where the dataset files will be downloaded. By default,fuel-download
uses the current working directory.--clear
: delete the dataset files instead of downloading them, if they exist.
You should now have four new files in your directory:
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
Those are the original files that can be downloaded off Yann Lecun's website.
We now need to convert those files into a format that the MNIST
dataset
class will recognize. This is done through the fuel-convert
script:
$ fuel-convert mnist
This will generate an mnist.hdf5
file in your directory, which the
MNIST
class recognizes.
Once again, the script accepts dataset-specific arguments which you can discover
by appending -h
to your dataset choice:
fuel-convert mnist -h
Two arguments are always accepted:
-d DIRECTORY
: wherefuel-convert
should look for the input files.-o OUTPUT_FILE
: where to save the converted dataset.
Let's delete the raw input files, as we don't need them anymore:
$ fuel-download mnist --clear
Six months from now, you may have a bunch of dataset files lying on disk, each
with slight differences that you can't identify or reproduce. At that time,
you'll be glad that fuel-info
exists.
When a dataset is generated through fuel-convert
, the script tags it with
what command was issued to generate the file and what were the versions of
relevant parts of the library at that time.
You can inspect this metadata calling fuel-info
and passing an HDF5 file as
argument:
$ fuel-info mnist.hdf5
Metadata for mnist.hdf5
=======================
The command used to generate this file is
fuel-convert mnist
Relevant versions are
H5PYDataset 0.1
fuel.converters 0.1
By default, Fuel looks for downloaders and converters in the
fuel.downloaders
and fuel.converters
modules, respectively, but you're
not limited to that.
Fuel can be told to look into additional modules by setting the
extra_downloaders
and extra_converters
configuration variables in
~/.fuelrc
. These variables are expected to be lists of module names.
For instance, suppose you'd like to include the following modules:
package1.extra_downloaders
package2.extra_downloaders
package1.extra_converters
package2.extra_converters
You should include the following in your ~/.fuelrc
:
# ~/.fuelrc
extra_downloaders:
- package1.extra_downloaders
- package2.extra_downloaders
extra_converters:
- package1.extra_converters
- package2.extra_converters
These configuration variables can be overridden through the
FUEL_EXTRA_DOWNLOADERS
and FUEL_EXTRA_CONVERTERS
environment variables,
which are expected to be strings of space-separated module names, like so:
export FUEL_EXTRA_DOWNLOADERS="package1.extra_downloaders package2.extra_downloaders"
export FUEL_EXTRA_CONVERTERS="package1.extra_converters package2.extra_converters"
This feature lets external developers define their own Fuel dataset downloader/converter packages, and also makes working with private datasets more straightforward.