Argo Workflows is capable of running 1,000s of workflows a day, each with 10,000s of nodes. But you'll need to do some work to achieve this.
You must be running at least v3.1 for several recommendations to work. Upgrade to the very latest patch. Performance fixes often come in patches.
You'll need a big cluster, with a big Kubernetes master.
Users often encounter problems with Kubernetes needing to be configured for the scale. E.g. Kubernetes API server being too small. We recommend you test your cluster to make sure it can run the number of pods they need, even before installing Argo. Create pods at the rate you expect that it'll be created in production. Make sure Kubernetes can keep up with requests to delete pods at the same rate.
You'll need to GC data quickly. The less data that Kubernetes and Argo deal with, the less work they need to do. Use pod GC and workflow GC to achieve this.
Where Argo has a lot of work to do, the Kubernetes API can be overwhelmed. There are several strategies to reduce this:
- Use the Emissary executor (>= v3.1). This does not make any Kubernetes API requests (except for resources template).
- Limit the number of concurrent workflows using parallelism.
- Rate-limit pod creation configuration (>= v3.1).
- Set
DEFAULT_REQUEUE_TIME=1m
(see docs).
If you're running workflows with many nodes, you'll probably be offloading data to a database. Offloaded data is kept
for 5m. You can reduce the number of records created by setting DEFAULT_REQUEUE_TIME=1m
. This will slow reconciliation,
but will suit workflows where nodes run for over 1m.