-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: docker-compose to work off repo Dockerfile #27434
Conversation
Currently our docker-compose setup pulls images that have been built recently on the `master` branch. While this works in most cases, it's non-deterministic on not guaranteed to always work. For example if I merge a PR to master that removes a certain python library for instance, people in branches out there doing development that still have that dependencies are not going to work. In this PR, I change the docker-compose setup(s) to: - reference the local Dockerfile - point to the right cache location (apache/superset-cache:....) - make that DRY since it's repeated many times across the docker-compose files - touch up both docker-compose.yml and docker-compose-non-dev.yml with the same approach As far as testing goes, I made sure this builds and that the resulting setup is functional. I was also very fast in my experience, the cache was clearly leveraged here.
context: . | ||
target: dev | ||
cache_from: | ||
- apache/superset-cache:3.9-slim-bookworm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I was not aware of this, what process pushes to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that uses scripts/build_docker.py
(a cli that wraps the docker build CLI) will use the cache-from
, and cache-to
, but can only push if it's logged in (push or pull_request against the main repo). Currently I think all the GitHub actions that build images (pull_request
, push
on master and releases) will use this thing and hopefully use the cache.
Docker-compose can piggy backing on this cache here that should really speed up the builds since in most case most layers can be re-used from the master
builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it:
here: https://github.com/apache/superset/blob/master/scripts/build_docker.py#L27
and here: https://github.com/apache/superset/blob/master/scripts/build_docker.py#L193
pull_request
will only push to cache when not in forks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note - one thing I noticed is cache doesn't always seem to hit when I think it should, I'm guessing that we have some limits / intelligent cache pruning that's preventing cache hit from always working .... cache hit rate is still pretty decent, and build times not awful either when missing the cache
@@ -14,7 +14,6 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# | |||
x-superset-image: &superset-image apachesuperset.docker.scarf.sh/apache/superset:${TAG:-latest} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it's kind of cool to have non-dev
point to a pre built image TAG, also this docker-compose does not mount current code into the container like docker-compose.yaml
does, so non deterministic cases probably do not apply on this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About this, I think both use cases are valid. To me if I'm in a repo and on a specific ref (a branch, a release tag, or my own little branch with a feature), and I run some docker-related thing (whether it's docker build
or a docker-compose
related thing) I'm assuming that what I'm building is the particular ref I'm into right now.
I think the 2 options I want to provide here are really just "interactive" where we mount the code, and "non-interactive" where it's just immutable set of dockers that get me a fully working testable cluster that is lined up with the branch.
Now maybe we should ADD a new way to do docker-compose-any-image.yml
that would work along with a TAG env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine by me! makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'm making a bunch of changes here and re-writing the docs too...
Suggestions from our test run:
|
Notes about docker-compose viability as the main tool for development -> I just did a session with @rtexelm rtexelm, and we've found that his 8GB macbook M1 struggles quite a bit running Historically In any case, I think it'd be worthy to consider an alternative approach, the one where you run these two commands in the host as opposed to inside the docker. This has tradeoffs, but on @rtexelm's machine was MUCH faster. So.
|
oh! saw your comment after I posted mine. Let's get this done. |
@@ -0,0 +1,101 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is effectively the old docker-compose-non-dev.yml
just renamed to be more clear
@dpgaspar I evolved this PR quite a bit, taking a harder stance against using docker-compose in production. Curious to hear your thoughts. |
The slowness you're seeing is likely caused by the fact that we're running amd/linux containers on arm hardware. I would check the images that are being pulled down to ensure they're the "arm" variants. |
pre-built images from docker-hub | ||
|
||
More on these two approaches after setting up the requirements for either. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! ^^^
Superset (which is running in its docker container). Other databases may have slightly different | ||
configurations but gist would be same and boils down to 2 steps - | ||
|
||
1. **(Mac users may skip this step)** Configuring the local postgresql/database instance to accept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: before we had:
1. ** ......
2. ...
but now there's only 1.
does it still make sense?
Interestingly @rtexelm setup and mine were night and day in terms of build time. Both Apple silicon, but he has 8GB of ram, so I assumed he was memory constrained and swapping. Though it could be that we used different base images - like he's virtualizing amd64 and I'm on an arm base. No impossible. It's a bit buried but I added an option and documented it here ->
Also note the cache we have from the |
@rtexelm can you check whether you are/were virtualizing amd on your arm host? I'd be great to clarify. |
I was, I used the setting DOCKER_DEFAULT_ENV set to linux/amd64 in the past to get it to work on my system so it must still be in effect |
Alright, this is dandy. Mergin' |
Sorry I'm late to respond here. I am fine with the substance of these changes if they improve things for developers. I agree with "don't try to run this docker-compose file in production". I may reintroduce some content re: for people interested in running docker compose in production. I think Airflow has great language here:
I think "you'll need to modify this and know what you're doing" is more nuanced than "don't use docker compose in production" -- IMO that's for companies to decide, if the tradeoffs of extra complexity of Kubernetes are worth it for them. Me & my org could never have tried or deployed Superset if not for docker compose, fortunately that approach was explained in the docs and I could make the necessary modifications (e.g., use an Azure Postgres instance for my metadata db). |
It makes sense to me that there's a docker compose setup intended entirely for development. I also am of the opinion that for many people who would like to deploy Superset in a small production manner that it is easiest to do so using docker compose on a hosted VM that meets the specs for serving the entire stack (minus, perhaps, the production metadata store). The reality is that Superset is relatively difficult to deploy in production right now. I don't think that it has to be that way. One idea would be to have a It is true, of course, that people who aim to use Superset in production should understand the nuances of docker compose. However, if you contrast the ease of self-deploying Superset vs. Discourse, for example, I think that Superset has some room for improvement in supporting a preconfigured type of "default" production deployment. p.s. I'm new to the Superset community and still learning about the codebase, though I hope to contribute once I understand everything that's happening. |
Trying to list out what you'd need for a production use case on docker-compose:
All of them configuration hooks become hard for us to manage. To me that complexity belongs in your environment ideally in the form of terraform/helm/k8s constructs in a git repo of your own. And now that you need to do all this regardless, why not going k8s/helm so that you can evolve into supporting some elasticity and more resilience? |
Thinking about it some more, knowing that docker-compose is designed (AFAIK) as a single-host solution, I don't think it's fair to call this "production" by any standard around high availability. Knowing that in most setting you'd grow a need for multi-host support, either to support HA or as usage/demand would grow to where it can't be served by a single host, I think trying to support production-type use cases with docker-compose is doing a disservice to people in the community. What i'd suggest as a stance for maintainers is:
|
That list is a good resource for the pitfalls / customization needs of deploying with docker compose. In my case, we can address those shortcomings easily enough (e.g., point to image 3.1.1 in our docker-compose.yml, use an Azure Postgres server). The only feature we don't get vs. Kubernetes is scalability ... and we just don't experience significant growth or fluctuations that make this an issue. We wouldn't gain anything by adopting k8s. On the other hand, for a more old-school org like ours, the increased complexity of Kubernetes would make deploying Superset unfeasible. There are fewer people in general with that skill set vs. familiarity with docker and we don't have anyone on staff positioned to stand up a K8s deployment. If the Superset project stance was "docker compose is unacceptable", we would have gone with Metabase or PowerBI. I find docker compose the simplest option to install and maintain, moreso than PyPI, and feel like keeping it as an option -- while acknowledging the downsides -- is ultimately good for Superset as a project as it gets more people in the door. I mentioned above liking Airflow's approach. I just looked at more peer projects off the top of my head, they varied but none were explicitly "don't use docker":
I wonder if our Scarf telemetry tells us anything at least directionally useful about the share of installations that are docker compose vs. helm vs. pip. |
How about using something like MiniKube https://minikube.sigs.k8s.io/docs/start/ ? |
Regarding this statement:
I respect your opinion and understand your viewpoint here, but I have to disagree. "Production" and "High Availability" have drastically different meanings in different environments. In my 25+ year career working with data at about a dozen different companies (large, small, government, private, across industries), I have yet to be in an environment where it would be unacceptable to run a well-maintained instance of pretty much any type of software on a single server. In my current role, I work for a large and well-known entertainment company with > 20k employees and all of our internal tools are run on single servers on a private OpenStack cloud. There's not even a provision for easily deploying with k8. This is common in companies that have established enterprise infrastructure. We get by just fine. Is it ideal? Of course not, but definitely not a deal breaker. It worries me that the project would not provide a clear and documented route to individuals who want to run this in production on a single server. It doesn't matter if that is with MiniKube or Docker Compose or some other tech. However, like Sam mentioned, I do think this becomes somewhat of a gatekeeper to using Superset when there are other options available that may have a lower barrier to entry. I am not a long-term member of this community and realize that my opinion here carries less weight. I do hope you will consider how to make Superset a more welcoming project for people who do not need to define "production" and "high-availability" with multi-node k8 clusters that, in my opinion, simply are not feasible in a lot of environments. Edit: Let me provide slightly more detail about our use case in the aforementioned org... We are a team of about 50 people working on a specialized function in the larger organization. We will only ever need about 75 people, maximum, to be able to view Superset dashboards and/or receive email reports. There will only ever be 5-7 people actually connecting datasets and creating charts and dashboards for the others to view. Datasets for analysis through Superset contain from 1 to 10 Billion rows of data, are staged for Superset useage (by the same 5-7 people) and mostly reside on Clickhouse servers that exclusively serves this team's needs. Other datasets may reside in Snowflake. We self-host Clickhouse on our own dedicated hardware both for performance reasons and to avoid query costs associated with Snowflake. The company is quite siloed like this, sometimes for good reason, so it's common to see other teams of similar size taking similar approaches to their work. |
Loud and clear. Thank you taking the time to write this - sometimes coming out of large companies with very large infra team we forget the preferences and constraints of smaller environment. To be clear, we absolutely want to provide a clear path to production to as many orgs as possible, while providing the guarantees and flexibility that people need. The desire to run on a single-host is indeed totally reasonable - though here I'd love to be able to also offer an easy path to take that single-host setup towards a multi-host setup without having to switch the stack. Some more thoughts:
It's always difficult for communities like this one to support the variety of constraints people have. The matrix of possible environments is crazy-complicated. |
Thanks for the reply. I think this is one area where it's preferable if the suggested single-machine deployment is highly opinionated, with links to information about different possible configurations and technologies. It's reasonable to tell people interested in this type of deployment what specs to have on a machine (or VM), what distro of Linux to use if they want to follow the instructions exactly, and exactly which config values must be set for a minimally-customized deployment. For example, it may say "Use an Ubuntu VM with at least 16 gb of RAM and a 40 gb hard drive. Install microk8s, following steps 1-3 on these instructions. Put your .crt and .key files in /some/directory and then assign their paths to SOME_CONSTANTS in the config file.." Etc. The issue with a k8 backbone is that there are many additional options that can provide quality of life (automatic certificate renewal, etc) that are not core to getting Superset running. The only option I would provide is to suggest that users have a metadata database that is not part of the k8 setup, showing them how to include the connection string from an envvar, but also provide a pathway for running Postgres in the k8 cluster as part of the deployment. I would go so far as to say that even the security of envvars is not the core business of Superset, so perhaps instruct users to put them in an .env file for the rapid deployment and then link to resources for how to better secure them. As you mention, there are myriad possible configurations. That's why I personally feel like taking an opinionated approach here is best: it cuts through the noise and helps provide more of a quick start for a small single-machine production deployment. Incidentally, from what I've read the past few days, microk8s seems more suitable and tuned for a single-machine k8 cluster than minikube, which often is mentioned as being more appropriate for testing. This is, of course, a matter of opinion. However this ends up, it will be a benefit to the Superset project, in my opinion, because it will lower the entry barrier for smaller teams. I remain of the opinion that a docker compose setup is more simple, requires fewer resources, and can be less opinionated, but if that's antithetical to the high-availability goals of the project, then so be it. |
Ok, so assuming that postgres-on-docker is NOT viable, and we want to be opinionated, we would force the use to provide a postgres host/username/psw (in the form of a sqlalchemy url) AND a secret key at a minimum. Curious if:
Let's get off this PR and let me start a discussion on "Get Superset / Helm to work on MiniKube and document the process" -> #27570 |
I just tried minikube and was up and running in like 10 minutes, and I have little experience with k8s. Pretty much just |
I found it similarly easy on my laptop but struggled to get the Ingress working when I tried it on a VM. I also found it a bit of a challenge to debug when containers weren’t starting properly. |
curious what the issue was and workaround if you found any. Looks like between
Personally I thought this was fantastic with |
I have no qualms with the k8 ecosystem but I find it heavy for a single-machine deployment. In my specific case, I have to set up an SSL certificate and a variety of compliance logging tools and related security tech (such as an HCP Vault, etc), configure SSO with Active Directory, etc. It was unclear how to get this smoothly working with the kubernetes instructions on the Superset website and I don't have a ton of experience working with k8. With regards to Ingress, I tried about 30 combinations of different setups, verified the firewall, etc. and was never able to connect except on an insecure port. My schedule is busy right now, so I was not able to devote more time to it. However, I was able to get it all working through Docker compose in about 20 minutes, in a manner that will work well for my team. When I have more free time, I'll return to it and try to get it working with minikube. |
This sounds like it's specific to your internal policy / k8s setup (?) Also all these things seem virtuous and important, though maybe overkill for a sandbox/POC-type environment like you're seeking. I'm guessing a quick-and-dirty minkube-on-an-EC2-host wouldn't have those requirements, and emulate the Another thing to think about is that it may make sense for your organization to also have some infra for a more relaxed k8s setup for lower tier of internal applications that don't need the same level of rigor as top tier internal apps. I know that sounds like a bit of an investment, but if the org doesn't provide the right level of infra (in this case something lightweight), you end up with oddball services in the dusty corners of your cloud. In my experience, making it easy for people to do the right things goes a long way in terms of overall efficiency. Take the Everyone seems to think like k8s has to be this huge complicated thing, and has to big one huge cluster to rule them all with all of the policies and compliances enforced. But you know, your refrigerator probably runs it, your car, ... |
I don't mean to derail this to my specific use case, though I appreciate and agree with most of your comments. In fact, the org is working on an internal k8 deployment platform, though it may be a while before it is ready, and the features available on it are not yet published. The company handles a lot of PII. Like many companies, there is a tier of BI analysts who mostly handle scrubbed data, there are those who handle sensitive financial data, and there are teams who work with all of the data. Some of the latter are legal, customer support, and security, others are marketing, business dev, etc. in a targeting role. As a company based in the EU, there is a huge emphasis on data security because of the GDPR. In my experience, this is relatively common, even for companies based elsewhere. That is to say that there really is no such thing as a "lower tier" of internal application because all applications are views as exposed surfaces from a security standpoint and are considered to increase the potential risk for having malicious actors gain entry to then intranet (which is an ongoing usually-detectable threat). A sufficiently bad breach could threaten the company's existence. From a security hygiene point of view, this is a good thing because it encourages good habits. Were it easy enough to do, we should all strive to use valid SSL certificates on servers that are properly configured and have encrypted secrets. These are disparate pieces of tech and getting them all to work well together can be a challenge, as you know. With regards to Superset, my opinion largely has two simply layers: 1) It should be as easy as possible for as many people as possible to get it up and running 2) in a reasonably secure manner that has a clear upgrade path for patching security vulnerabilities. Getting it running often might initially be for prototype purposes, though hardening it for a simple production deployment ideally should be a straightforward and/or well-documented step. Whether that is with docker compose or k8 does not matter much as long as it can be achieved on accessible hardware. To hop back to what I said earlier in this thread, I don't think Superset is far away from what is needed to support these use cases. There are plenty of teams in the world that will be able to clone the repo and fully customize their setup without any handholding. There are teams that simply with sign up with Preset (or Tableau or PowerBI) and outsource the problem. Then there are the teams (of all sizes) who are going to want to self-deploy and need a well-documented (opinionated is fine) secure way to do it. Remember how Wordpress became all the rage in 2004 or so and then became the vector for countless hacks? There are basically two approaches to avoid that: make it so the software is too difficult for an average person to get running on their own or provide the clear route and documentation to deploy and maintain in a manner that reduces the chances of that happening. This has always been one of the lessons that Discourse took to heart, resulting in software that is remarkably easy both to deploy and to keep up-to-date. Over the years, I've worked in various roles in game development. I recall from the late 1990s, in my first role, when people would want to come in a pitch game ideas to us... one of the lead designers told me before a pitch meeting: "A document is fine, a picture is worth a thousand documents, but a working prototype is work a thousand pictures." That's been an excellent lesson for my entire career; instead of telling somebody what you're going to do, actually do it in a prototype capacity and then explain to them how to make it a reality. There's such a hunger for high-quality BI and data visualization these days. Superset is so close to being a platform that is easy to stand up and show as a prototype, though I'm of the opinion that it's just a little bit too hard to deploy right now. ...and then comes the biggest problem of them all, regardless of business size: software meant to be only a prototype ends up in production usage and suddenly you've got a problem because it wasn't deployed in a way to fit that intention. This is where it absolutely matters that it can be secured easily and that people who don't normally deploy production software have access to opinionated resources for making it happen. Apologies for this being too long and rambling. I haven't had enough coffee yet. :) It's a rainy weekend, so I'll see if I can work on documenting a more tangible approach to deploying in a secure manner with minikube. |
Richest comment I've ever seen in a PR. For real. 🏆! Love the idea of having a better version checker and show admin they need to upgrade. The more I think about things, the more I think that the approach I'm pushing forward of using docker-compose for sandboxing and development, and k8s for production is a good approach. For people who want to go lower level and run straight on metal or EC2 equivalents, they can take the helm chart as a recipe as to how to do this. |
SUMMARY
This PR improve docker-compose support for development, testing and staging, but is taking a stance against using docker-compose for production use cases. The reality is that the challenges around building a functional dev setup and a stable production environment are intricately similar yet intricately diverging. More segmentation here will lead to more sanity on both sides and more safety. docker-compose is now 100% focussed on supporting development workflows: fast startup, fully loaded builds with testing/dev tools where needed, ROOT access in-Docker so that you can bash into docker to debug, debug flags on, ...
For production support, see our helm chart and installation docs.
Now. Currently our docker-compose setup pulls images that have been built
recently on the
master
branch (apache/superset:${TAG:-latest
). While this works in most cases, it'snon-deterministic on not guaranteed to always work. For example if I
merge a PR to master that removes a certain python library for instance,
people in branches out there doing development that still have that
dependencies are not going to work.
In this PR, I change the docker-compose setup(s) to:
files
the same approach
superset-node
, that added significantload time at every
docker-compose up
when in most cases we don't need that headless browser.I think that's to support pupeteer, and can still be set with the env var
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD
that seemed declared, but orphaned as nothing reference it.docker/.env
anddocker/.env-non-dev
as it's all meant for development nowThis should work across platforms, though I could only validate on
linux/arm64
.TESTING INSTRUCTIONS
As far as testing goes, I made sure this builds and that the resulting
setup is functional. I was also very fast in my experience, the cache
was clearly leveraged here.