Skip to content

Commit

Permalink
fix(metricprovider): support Datadog v2 API Fixes #2813 (#2997)
Browse files Browse the repository at this point in the history
* Add note in CONTRIBUTING.md that I would have found useful.

Signed-off-by: mitchell amihod <[email protected]>

* Fix gen-openapi to be more portable - make sure it includes the GOPATH in the call.

Signed-off-by: mitchell amihod <[email protected]>

* Update docs.

* Expand working with v2 information
* Contacted Datadog to get latest info re: v1 deprecation, api limits.
* Add tips about rate limits, using helm for templates
* Add more example templates

Signed-off-by: mitchell amihod <[email protected]>

* Update Datadog Analysis Type

* Make Query omitempty since now possible it won't exist
* Add some descriptions
* Add new properties we need for v2

Queries: We can pass in key:query for queries
Formula: Makes formulas using the keys from queries

* Defaults!
Use annotations to declare defaults for some fields. This lets us remove some guard rails from the code itself

Interval: 5m - Move this from code to here
ApiVersion: v1 - Move this from code to here

* Enums!
Much like defaults, having enums lets us make assumptions about the incoming metric so we dont need as many guardrails.

ApiVersion: Enum to restrict to v1 or v2
Signed-off-by: mitchell amihod <[email protected]>

* Output of make codegen

Everything looks ok.

Signed-off-by: mitchell amihod <[email protected]>

* Pass in metric to provider factory. Validate metric.

Validating the metric on initialization, rather than spread out throughout. You get earlier feedback if you have a bad metric defined. (Not perfect, but there's limitations with our annotation generator for the rules in the crd. eg: If we could use oneOf, we wouldn't need a lot of this validation)

We check all the mutually exclusive props.
The props where one requires another.
We don't have to check for defaults and set them anymore, since they are guaranteed by the crd.

rules:

- ensure we have only query OR queries
- restrict v1 to query only
- make sure you only provide a formula with queries
- make sure multiple queries are accompanied by a formula

Signed-off-by: mitchell amihod <[email protected]>

* Remove DefaultApiVersion, remove impossible  AnalysisPhaseError

ApiVersion is guaranteed to have value, and the enum ensures its v1/v2 when user provided.

Updated v1 tests to reflect some of these new realities

Signed-off-by: mitchell amihod <[email protected]>

* extract urlBuilding from run

run was getting a bit long according to the checks

Signed-off-by: mitchell amihod <[email protected]>

* Remove some unnecessary stuff for interval.

It is a straightline to initialize since default is set to 5m for incoming metrics where it is not set.

Signed-off-by: mitchell amihod <[email protected]>

* Update createRequest

Split into createRequest v1/v2
v1 : pretty much unchanged. just extracted
v2: support for v2/query/scalar

We don't need all the timeseries. I did some testing fetching both scalar and timeseries, and they pretty much lined up.

Also confirmed with DD: From support ticket with DD: "...I have also tested the scalar api endpoint with the last aggregator as well as the timeseries api endpoint. They do indeed return the same values when retrieving the values via the api endpoints. Observing the time it takes to retrieve the values, they remain relatively the same..."

re: query + v2: Keep backwards compat. if we get in a query, we turn it into a queries object to pass on to requestv2 queries into the QueriesPayload.
Signed-off-by: mitchell amihod <[email protected]>

* Handle v2 scalar responses

* update the datadogResponseV2 for scalar values
* handle no results so it has parity with v1 - empty will now usually result in `[]` unless something goes very wrong on dd side

Signed-off-by: mitchell amihod <[email protected]>

* update v1 no data tests to better reflect reality

Signed-off-by: mitchell amihod <[email protected]>

* update v2 tests

* add some new test cases
* update mock server to handle queries / formulas validation
* update no data tests to reflect reality
* stop all values being the same. it makes it difficult to know find which test case failed. move meaning from comments into the metric name.
* stop using deprecated ioutil

Signed-off-by: mitchell amihod <[email protected]>

* re-codegen

Signed-off-by: zachaller <[email protected]>

* fix lint

Signed-off-by: zachaller <[email protected]>

---------

Signed-off-by: mitchell amihod <[email protected]>
Signed-off-by: zachaller <[email protected]>
Co-authored-by: zachaller <[email protected]>
  • Loading branch information
meeech and zachaller authored Oct 14, 2023
1 parent 1d89c33 commit ebf1a3e
Show file tree
Hide file tree
Showing 20 changed files with 1,520 additions and 736 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ DEV_IMAGE ?= false

# E2E variables
E2E_INSTANCE_ID ?= argo-rollouts-e2e
E2E_TEST_OPTIONS ?=
E2E_TEST_OPTIONS ?=
E2E_PARALLEL ?= 1
E2E_WAIT_TIMEOUT ?= 120
GOPATH ?= $(shell go env GOPATH)
Expand Down Expand Up @@ -149,7 +149,7 @@ gen-mocks: install-go-tools-local ## generate mock files
# generates openapi_generated.go
.PHONY: gen-openapi
gen-openapi: $(DIST_DIR)/openapi-gen ## generate openapi files
PATH=${DIST_DIR}:$$PATH openapi-gen \
PATH=${DIST_DIR}:$$PATH GOPATH=${GOPATH} openapi-gen \
--go-header-file ${CURRENT_DIR}/hack/custom-boilerplate.go.txt \
--input-dirs github.com/argoproj/argo-rollouts/pkg/apis/rollouts/v1alpha1 \
--output-package github.com/argoproj/argo-rollouts/pkg/apis/rollouts/v1alpha1 \
Expand Down
6 changes: 6 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,12 @@ make start-e2e E2E_INSTANCE_ID=''
```


6. Working on CRDs? While editing them directly works when you are finding the shape of things you want, the final CRDs are autogenerated. Make sure to regenerate them before submitting PRs. They are controlled by the relevant annotations in the types file:

eg: Analysis Templates are controlled by annotations in `pkg/apis/rollouts/v1alpha1/analysis_types.go`.

Refer to https://book-v1.book.kubebuilder.io/beyond_basics/generating_crd.html and https://book.kubebuilder.io/reference/markers/crd-validation.html for more info on annotations you can use.

## Running Local Containers

You may need to run containers locally, so here's how:
Expand Down
137 changes: 133 additions & 4 deletions docs/analysis/datadog.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,7 @@ spec:
sum:requests.error.rate{service:{{args.service-name}}}
```
The field `apiVersion` refers to the API version of Datadog (v1 or v2). Default value is `v1` if this is omitted.

!!! note
Datadog is moving away from the legacy v1 API. Rate limits imposed by Datadog are therefore stricter when using v1. It is recommended to switch to v2 soon. If you switch to v2, you will not be able to use formulas (operations between individual queries).
The field `apiVersion` refers to the API version of Datadog (v1 or v2). Default value is `v1` if this is omitted. See "Working with Datadog API v2" below for more information.

Datadog api and app tokens can be configured in a kubernetes secret in argo-rollouts namespace.

Expand All @@ -46,3 +43,135 @@ stringData:
```

`apiVersion` here is different from the `apiVersion` from the Datadog configuration above.

### Working with Datadog API v2

#### Moving to v2

If your old v1 was just a simple metric query - no formula as part of the query - then you can just move to v2 by updating the `apiVersion` in your existing Analysis Template, and everything should work.

If you have a formula, you will need to update how you configure your metric. Here is a before/after example of what your Analysis Template should look like:

Before:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: log-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 30s
successCondition: default(result, 0) < 10
failureLimit: 3
provider:
datadog:
apiVersion: v1
interval: 5m
query: "moving_rollup(sum:requests.errors{service:{{args.service-name}}}.as_count(), 60, 'sum') / sum:requests{service:{{args.service-name}}}.as_count()"
```

After:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: loq-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
# Polling rate against the Datadog API
interval: 30s
successCondition: default(result, 0) < 10
failureLimit: 3
provider:
datadog:
apiVersion: v2
# The window of time we are looking at in DD. Basically we will fetch data from (now-5m) to now.
interval: 5m
queries:
a: sum:requests.errors{service:{{args.service-name}}}.as_count()
b: sum:requests{service:{{args.service-name}}}.as_count()
formula: "moving_rollup(a, 60, 'sum') / b"
```

#### Examples

Simple v2 query with no formula

```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-container-restarts
spec:
args:
# This is set in rollout using the valueFrom: podTemplateHashValue functionality
- name: canary-hash
- name: service-name
- name: restarts.initial-delay
value: "60s"
- name: restarts.max-restarts
value: "4"
metrics:
- name: kubernetes.containers.restarts
initialDelay: "{{ args.restarts.initial-delay }}"
interval: 15s
failureCondition: default(result, 0) > {{ args.restarts.max-restarts }}
failureLimit: 0
provider:
datadog:
apiVersion: v2
interval: 5m
queries:
# The key is arbitrary - you will use this key to refer to the query if you use a formula.
q: "max:kubernetes.containers.restarts{service-name:{{args.service-name}},rollouts_pod_template_hash:{{args.canary-hash}}}"
```

### Tips

#### Datadog Results

Datadog queries can return empty results if the query takes place during a time interval with no metrics. The Datadog provider will return a `nil` value yielding an error during the evaluation phase like:

```
invalid operation: < (mismatched types <nil> and float64)
```

However, empty query results yielding a `nil` value can be handled using the `default()` function. Here is a succeeding example using the `default()` function:

```yaml
successCondition: default(result, 0) < 0.05
```

#### Templates and Helm

Helm and Argo Rollouts both try to parse things between `{{ ... }}` when rendering templates. If you use Helm to deliver your manifests, you will need to escape `{{ args.whatever }}`. Using the example above, here it is set up for Helm:

```yaml
...<snip>
metrics:
- name: kubernetes.containers.restarts
initialDelay: "{{ `{{ args.restarts.initial-delay }}` }}"
interval: 15s
failureCondition: default(result, 0) > {{ `{{ args.restarts.max-restarts }}` }}
failureLimit: 0
provider:
datadog:
apiVersion: v2
interval: 5m
queries:
q: "{{ `max:kubernetes.containers.restarts{kube_app_name:{{args.kube_app_name}},rollouts_pod_template_hash:{{args.canary-hash}}}` }}"
```
#### Rate Limits
For the `v1` API, you ask for an increase on the `api/v1/query` route.

For the `v2` API, the Ratelimit-Name you ask for an increase in is the `query_scalar_public`.
54 changes: 45 additions & 9 deletions docs/features/kustomize/rollout_cr_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -244,18 +244,30 @@
"datadog": {
"properties": {
"apiVersion": {
"default": "v1",
"enum": [
"v1",
"v2"
],
"type": "string"
},
"formula": {
"type": "string"
},
"interval": {
"default": "5m",
"type": "string"
},
"queries": {
"additionalProperties": {
"type": "string"
},
"type": "object"
},
"query": {
"type": "string"
}
},
"required": [
"query"
],
"type": "object"
},
"graphite": {
Expand Down Expand Up @@ -4750,18 +4762,30 @@
"datadog": {
"properties": {
"apiVersion": {
"default": "v1",
"enum": [
"v1",
"v2"
],
"type": "string"
},
"formula": {
"type": "string"
},
"interval": {
"default": "5m",
"type": "string"
},
"queries": {
"additionalProperties": {
"type": "string"
},
"type": "object"
},
"query": {
"type": "string"
}
},
"required": [
"query"
],
"type": "object"
},
"graphite": {
Expand Down Expand Up @@ -9256,18 +9280,30 @@
"datadog": {
"properties": {
"apiVersion": {
"default": "v1",
"enum": [
"v1",
"v2"
],
"type": "string"
},
"formula": {
"type": "string"
},
"interval": {
"default": "5m",
"type": "string"
},
"queries": {
"additionalProperties": {
"type": "string"
},
"type": "object"
},
"query": {
"type": "string"
}
},
"required": [
"query"
],
"type": "object"
},
"graphite": {
Expand Down
13 changes: 11 additions & 2 deletions manifests/crds/analysis-run-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -179,13 +179,22 @@ spec:
datadog:
properties:
apiVersion:
default: v1
enum:
- v1
- v2
type: string
formula:
type: string
interval:
default: 5m
type: string
queries:
additionalProperties:
type: string
type: object
query:
type: string
required:
- query
type: object
graphite:
properties:
Expand Down
13 changes: 11 additions & 2 deletions manifests/crds/analysis-template-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -175,13 +175,22 @@ spec:
datadog:
properties:
apiVersion:
default: v1
enum:
- v1
- v2
type: string
formula:
type: string
interval:
default: 5m
type: string
queries:
additionalProperties:
type: string
type: object
query:
type: string
required:
- query
type: object
graphite:
properties:
Expand Down
13 changes: 11 additions & 2 deletions manifests/crds/cluster-analysis-template-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -175,13 +175,22 @@ spec:
datadog:
properties:
apiVersion:
default: v1
enum:
- v1
- v2
type: string
formula:
type: string
interval:
default: 5m
type: string
queries:
additionalProperties:
type: string
type: object
query:
type: string
required:
- query
type: object
graphite:
properties:
Expand Down
Loading

0 comments on commit ebf1a3e

Please sign in to comment.