Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: use --no-cache-dir flag to pip in dockerfiles, to save space #11352

Conversation

Rajpratik71
Copy link

using "--no-cache-dir" flag in pip install ,make sure downloaded packages
by pip don't cached on system . This is a best practice which make sure
to fetch from repo instead of using local cached one . Further , in case
of Docker Containers , by restricting caching , we can reduce image size.
In term of stats , it depends upon the number of python packages
multiplied by their respective size . e.g for heavy packages with a lot
of dependencies it reduce a lot by don't caching pip packages.

Further , more detail information can be found at

https://medium.com/sciforce/strategies-of-docker-images-optimization-2ca9cc5719b6

Signed-off-by: Pratik Raj [email protected]

using "--no-cache-dir" flag in pip install ,make sure downloaded packages
by pip don't cached on system . This is a best practice which make sure
to fetch from repo instead of using local cached one . Further , in case
of Docker Containers , by restricting caching , we can reduce image size.
In term of stats , it depends upon the number of python packages
multiplied by their respective size . e.g for heavy packages with a lot
of dependencies it reduce a lot by don't caching pip packages.

Further , more detail information can be found at

https://medium.com/sciforce/strategies-of-docker-images-optimization-2ca9cc5719b6

Signed-off-by: Pratik Raj <[email protected]>
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 8, 2020

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@madison-ookla
Copy link
Contributor

Perhaps it might be better to add the no cache dir environment variable, so all pip instructions automatically refrain from using the cache without having to explicitly add it to each call?

ENV PIP_NO_CACHE_DIR="1"

@kaxil
Copy link
Member

kaxil commented Oct 8, 2020

Perhaps it might be better to add the no cache dir environment variable, so all pip instructions automatically refrain from using the cache without having to explicitly add it to each call?

ENV PIP_NO_CACHE_DIR="1"

That's a good idea actually

@madison-ookla
Copy link
Contributor

That's a good idea actually

It's what we use so we don't have to worry about doing adding that flag in all our dependant images 🙂

@kaxil
Copy link
Member

kaxil commented Oct 8, 2020

That's a good idea actually

It's what we use so we don't have to worry about doing adding that flag in all our dependant images 🙂

:) @Rajpratik71 Can you update the PR to use the env var instead

@Rajpratik71
Copy link
Author

That's a good idea actually

It's what we use so we don't have to worry about doing adding that flag in all our dependant images 🙂

:) @Rajpratik71 Can you update the PR to use the env var instead

On examining i noticed that it is multi stage docker build , with a build image and the main image . All the dependencies are getting installed in builder image , then there is no need of this as after build main image is used and pushed .

So , there is no need of this PR.

Hence , closing.

@Rajpratik71 Rajpratik71 closed this Oct 8, 2020
@Rajpratik71
Copy link
Author

Perhaps it might be better to add the no cache dir environment variable, so all pip instructions automatically refrain from using the cache without having to explicitly add it to each call?

ENV PIP_NO_CACHE_DIR="1"

For , this in old versions of pip has conflicts, which gives error mentioned at pypa/pip/issues/5385 and pypa/pip/issues/5735.

It is fixed in latest versions at

@potiuk
Copy link
Member

potiuk commented Oct 8, 2020

On examining i noticed that it is multi stage docker build , with a build image and the main image . All the dependencies are getting installed in builder image , then there is no need of this as after build main image is used and pushed .

@Rajpratik71 . Exactly. That is not a good idea. We have multi-segmented build and the "pip install" step is done in the "build" segment. Then only installed Python libraries from "${HOME}/.local" are copied to the final image using COPY --from. It's actually even better to leave pip --cache because then it causes much faster rebuilds of the image.

In the build segment we run the pip install twice - the first time to run the "current master" dependencies and then, when we build the image, with the actual dependencies from sources. This way we get faster rebuilds when setup.py changes, we do not have to re-install everything from scratch when we iterate on the image (for example when we are running kubernetes tests). So removing cache in this case is not a good idea at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants