Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNIST dataset and Docker containers for Getting Started / Use Case documents and Katacoda scenarios #2318

Closed
iesahin opened this issue Mar 19, 2021 · 5 comments
Assignees
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@iesahin
Copy link
Contributor

iesahin commented Mar 19, 2021

In our discussions with @shcheklein, he emphasized the importance of a stable and standard dataset for the whole documentation.

The example project uses a subset of Stack Overflow question tagging dataset. The data is still updated and difficult to present as a downloadable asset. The example project is based on Random Forests for classification. This dataset is trimmed to the first 12000 records in Katacoda scenarios due to RAM limitations and that amount of data is not adequate for a meaningful presentation for pipeline parameters. For example, increasing the feature size, n_grams, or predictors may or may not change the accuracy from 0.41-0.46 in Katacoda environment.

Also, reproducing the whole (non-trimmed) version requires at least 8GB of RAM. Although this is a modest requirement for Deep Learning workflows, aiming for 4GB seems rather sensible, considering the example may be run in a virtual environment for quick assessment.

We also have a use case based on Chollet's cats and dogs tutorial as a use case. It uses an older version of Keras. Although works on Katacoda, the single python train.py takes around 30 minutes. This is probably due to the feature generation part and having separate files for each image. This can probably be engineered to work faster.

For experimentation features DVC 2.0, @dberenbaum has created several showcases in https://github.com/iterative/dvc-checkpoints-mnist . These use MNIST with PyTorch. I tested them on Katacoda without success and this is most probably due to PyTorch's memory requirements.

Yesterday I tested MNIST example of Tensorflow in Katacoda and it runs quickly and has a 0.97 accuracy. It's not a very advanced type of model, it has a single hidden Dense/128 layer, but I think it can be modified by adding two more CNN layers and a few parameters to modify these to improve performance.

What I propose is something like this:

  • A standard dataset based on MNIST 3 to replace data.xml files. This can be a copy of the TF MNIST dataset and can have multiple versions to simulate the changing data. A corresponding update in models, training, evaluation, parameters, etc. is necessary as well. These models may be modest and open to improvement with a sensible initial performance as our goal is to show how DVC can be used for this kind of problem.

  • For each of the GS/UC documents, we can create a Docker container. These can be run in the user's machine with a simple docker run -it dvc/get-started-versioning and has all the code, data, requirements, artifacts to run them identically with the document's version on the site.

  • These containers can be run in Katacoda as well. Currently, Katacoda environments each have custom startup scripts. This is a maintenance burden. (Most of them weren't even starting up until a few weeks ago.) These startup scripts may be replaced with docker run commands as well.

  • These containers can be used to replay commands in docs (with a tool like rundoc or rundoc) and check the changes in output or data.

This issue is mainly for discussing these bullet points. Thanks.

@shcheklein @dberenbaum @jorgeorpinel @dmpetrov

@dberenbaum
Copy link
Contributor

Nice synopsis, @iesahin! Before giving my substantive thoughts, what do you think about separating the dataset and the Docker proposals into separate issues?

@iesahin
Copy link
Contributor Author

iesahin commented Mar 20, 2021

Before giving my substantive thoughts, what do you think about separating the dataset and the Docker proposals into separate issues?

I'll do, thanks @dberenbaum.

@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Apr 3, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Apr 5, 2021

I have created the repository that contains Dockerfiles for Katacoda scenarios here.

Currently it has a script to build and push all containers.

We can discuss the naming convention for the containers in a separate issue #2354

I also built a markdown code runner after testing several other tools with (non-standard) Katacoda .md files. Executable Katacoda code blocks have {{execute}} as suffix, and other tools fail to recognize these.

Currently I'm testing it with the katacoda and documentation .md files. I'll transfer it to iterative when a 0.1 version seems appropriate.

BTW, I'm looking for a better name for this tool, any ideas are welcome 😄

@shcheklein @dberenbaum @jorgeorpinel

@iesahin
Copy link
Contributor Author

iesahin commented Apr 6, 2021

I have transferred Markdown Code Runner to iterative.

@iesahin
Copy link
Contributor Author

iesahin commented Apr 24, 2021

I have merged the new scenario and closed this issue. You can replay and review the scenario in https://katacoda.com/dvc/courses/get-started/experiments

@dberenbaum @shcheklein @jorgeorpinel

@iesahin iesahin closed this as completed Apr 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

3 participants