MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318
Labels
A: docs
Area: user documentation (gatsby-theme-iterative)
type: enhancement
Something is not clear, small updates, improvement suggestions
In our discussions with @shcheklein, he emphasized the importance of a stable and standard dataset for the whole documentation.
The example project uses a subset of Stack Overflow question tagging dataset. The data is still updated and difficult to present as a downloadable asset. The example project is based on Random Forests for classification. This dataset is trimmed to the first 12000 records in Katacoda scenarios due to RAM limitations and that amount of data is not adequate for a meaningful presentation for pipeline parameters. For example, increasing the feature size, n_grams, or predictors may or may not change the accuracy from 0.41-0.46 in Katacoda environment.
Also, reproducing the whole (non-trimmed) version requires at least 8GB of RAM. Although this is a modest requirement for Deep Learning workflows, aiming for 4GB seems rather sensible, considering the example may be run in a virtual environment for quick assessment.
We also have a use case based on Chollet's cats and dogs tutorial as a use case. It uses an older version of Keras. Although works on Katacoda, the single
python train.py
takes around 30 minutes. This is probably due to the feature generation part and having separate files for each image. This can probably be engineered to work faster.For experimentation features DVC 2.0, @dberenbaum has created several showcases in https://github.com/iterative/dvc-checkpoints-mnist . These use MNIST with PyTorch. I tested them on Katacoda without success and this is most probably due to PyTorch's memory requirements.
Yesterday I tested MNIST example of Tensorflow in Katacoda and it runs quickly and has a 0.97 accuracy. It's not a very advanced type of model, it has a single hidden Dense/128 layer, but I think it can be modified by adding two more CNN layers and a few parameters to modify these to improve performance.
What I propose is something like this:
A standard dataset based on MNIST 3 to replace
data.xml
files. This can be a copy of the TF MNIST dataset and can have multiple versions to simulate the changing data. A corresponding update in models, training, evaluation, parameters, etc. is necessary as well. These models may be modest and open to improvement with a sensible initial performance as our goal is to show how DVC can be used for this kind of problem.For each of the GS/UC documents, we can create a Docker container. These can be run in the user's machine with a simple
docker run -it dvc/get-started-versioning
and has all the code, data, requirements, artifacts to run them identically with the document's version on the site.These containers can be run in Katacoda as well. Currently, Katacoda environments each have custom startup scripts. This is a maintenance burden. (Most of them weren't even starting up until a few weeks ago.) These startup scripts may be replaced with
docker run
commands as well.These containers can be used to replay commands in docs (with a tool like rundoc or rundoc) and check the changes in output or data.
This issue is mainly for discussing these bullet points. Thanks.
@shcheklein @dberenbaum @jorgeorpinel @dmpetrov
The text was updated successfully, but these errors were encountered: