Backwards Compatibility: Combine always_sample with a not #25190

sarincasm · 2023-08-14T16:23:03Z

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

We want to start implementing tail based sampling in our organization in a backwards compatible way.
Some services will use tail sampling, for the others we want to start by using the always_sample policy.

Is there a straightforward way to implement this policy:

if service.name not in ['list', 'of', 'services', 'already', 'using', 'tail', 'sampling']
then always_sample

Here are a few things I'm trying

Option 1

{
  name: backwards-compatibility-policy-v1,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  already,
                  using,
                  tail_sampling
                ],
              invert_match: true,
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

However, this might not work, because this will lead to an inverted not sample decision - meaning the actual policies for these services will have no impact. Did I understand this correctly ?

Option 2

{
  name: backwards-compatibility-policy-v2,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-not-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  not,
                  using,
                  tail_sampling
                ],
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

This might work - however, this means that the entire list of legacy services would need to be added here, and any new service that is not using tail sampling would also need to be added here. This unfortunately means we need to maintain a redundant list of services here, thus shifting the burden on teams that are not ready to use tail sampling, and onto those that are just trying out Otel in the first place.

Option 3

{
  name: backwards-compatibility-policy-v3,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: ottl_condition,
          ottl_condition:
            {
              span: [
                "attributes[\"service.name\"] != \"list\" ",
                "attributes[\"service.name\"] != \"of\" ",
                "attributes[\"service.name\"] != \"services\" ",
                "attributes[\"service.name\"] != \"already\" ",
                "attributes[\"service.name\"] != \"using\" ",
                "attributes[\"service.name\"] != \"tail_sampling\" ",
              ],
              spanevent: [
                "attributes[\"service.name\"] != \"list\" ",
                "attributes[\"service.name\"] != \"of\" ",
                "attributes[\"service.name\"] != \"services\" ",
                "attributes[\"service.name\"] != \"already\" ",
                "attributes[\"service.name\"] != \"using\" ",
                "attributes[\"service.name\"] != \"tail_sampling\" ",
              ],
            }
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

Does this approach work ? Where can I find reference to the syntax of OTTL conditionals. I'm hoping this will create an OR between the list of services - Is that correct ?

Could someone help out with this problem. Is there a more straightforward way of solving this ?
Thanks!

Describe the solution you'd like

Either some help would the above 3 approaches would be appreciated.

Or it would also be useful, if we could use a simple not statement with string_attribute policy - that leads to a sample decision for string values that do not match - rather than leading to an explicit inverted not sampled decision.

Something like

not: {
  string_attribute: {
    key: http.route,
    values: [/v1/providers/.+],
  },
},

Describe alternatives you've considered

Described above.

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-14T16:23:22Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

bryan-aguilar · 2023-08-21T23:04:11Z

Instead of trying to solve this in the tail sampling processor could you use a different collector functionality? I'm thinking that could reduce the complexity of a tail sampling processor configuration.

What about a separate trace pipeline for tail sampled data? You could use the filter processor to drop data in the tail sampled pipeline and vise versa.

jpkrohling · 2023-08-23T07:23:59Z

Your option 1 might work. Re-reading the doc, this is what I documented back then:

When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;

In your case, the services not included there will have an inverted sample decision and no other "no sample" decisions will be made. The services included in the list can still have their own policies listed before this one, and their "not sample" policies will be respected.

jpkrohling · 2023-08-23T07:25:43Z

@bryan-aguilar, isn't the filter processor dropping only individual spans? In this case here, the decision is made based on the entire trace, so that if one of the spans in the trace goes through the service "already", they want to use tail-sampling, otherwise, do not sample.

sarincasm · 2023-08-23T15:48:27Z

Yes!
Option 1 worked perfectly! 🎉

Here is what confused me:

I was sure that this approach would work for services not listed there - since they would meet the always_sampled condition.

However, I was afraid that it might not work for the services that are listed because of this condition:

When there's an "inverted not sample" decision, the trace is not sampled;

Perhaps I didn't understand what an inverted not sample decision is.

What I understood earlier:

{
  name: backwards-compatibility-policy-v1,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  already,
                  using,
                  tail_sampling
                ],
              invert_match: true,
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

If a span comes in where service.name = list - this would lead to an inverted not sample decision. Since it is an and policy, the overall decision would be not sampled. Therefore any other policies that might lead to a sampled decision would have no impact because there is an inverted not sample decision.

I suppose this is not correct, since the approach now works and is deployed to production 🚀

Though I would like to understand this a bit better.

bryan-aguilar · 2023-08-23T16:00:03Z

@jpkrohling you are absolutely right! Good call and thank you for the correction!

@sarincasm glad to see you got something to work!

jpkrohling · 2023-08-24T10:15:16Z

I have to admit I was confused myself as well, and agree that the document can be clearer on that. If you can come up with a way to make it clearer, perhaps with examples, I would be happy to review it!

To add more light to your case: the matching of the attribute itself is going to be inverted, not the decision.

sarincasm · 2023-09-06T06:25:36Z

Thank you very much!

I will close this issue as resolved, because the feature I wanted already exists and no further changes are needed.

I was thinking of how can we make the documentation clearer, and I think that the best way is through a practical example.

I created a PR here: #26480

Add documentation to clarify how the tail sampling processor can be used in a complex case, by combining multiple policies. Linked to #25190 Linked to #24606 --------- Co-authored-by: Alex Boten <[email protected]>

rpriyanshu9 · 2023-10-06T12:36:57Z

@sarincasm In the example you added,

Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team team_a.

^^ This means that team_a is the only team which is not ready to move to tail sampling. So we just need to not sample any spans sent by team_a, is it?

If yes, then how does it relate to Rule #1 in the example?

     {
        # Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
        name: backwards-compatibility-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: services-using-tail_sampling-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                      invert_match: true,
                    },
                },
                { name: sample-all-policy, type: always_sample },
              ],
          },
      }

^^ Does it mean that in the values, we need to pass in the names of service belonging to team_a + any additional service that wants to opt out of tail sampling?

Also if team_a is not using tail sampling, then why before the other policies it is written that:

# BEGIN: policies for team_a
      {
        # Rule 2: low sampling for readiness/liveness probes

I might not be able to make sense of the example here, could you please help?

rpriyanshu9 · 2023-10-06T12:41:26Z

@jpkrohling @sarincasm Since you guys have more clear understanding of this, can you add some more descriptive info on :

Each policy will result in a decision, and the processor will evaluate them to make a final decision:

- When there's an "inverted not sample" decision, the trace is not sampled;
- When there's a "sample" decision, the trace is sampled;
- When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;
- In all other cases, the trace is NOT sampled

Especially what is an inverted not sample and inverted sample decision.

sarincasm · 2023-10-07T14:04:35Z

@rpriyanshu9

This means that team_a is the only team which is not ready to move to tail sampling. So we just need to not sample any spans sent by team_a, is it?

It is the opposite. team_a is the one that wants to move to tail sampling. However, we don't want to disrupt the services from other teams.
Therefore, if the service does not belong to team_a -> we sample all of the traces - In effect meaning 100 percent of the traces will be sampled.

Also, to answer this specific statement.

So we just need to not sample any spans sent by team_a, is it?

Again, an important thing to understand is that for teams that are not ready to move to tail sampling - we want to sample all traces - which in effect means that tail sampling logic is not being applied to them.

Does it mean that in the values, we need to pass in the names of service belonging to team_a + any additional service that wants to opt out of tail sampling?

Again, it would be the opposite - Here you would pass in the list of services that want to opt-in to tail sampling.
(Since, it is an invert match, it will end up applying to the services not in the list - achieving the desired effect of not using tail sampling for services that are not yet ready).

Perhaps, it is most important to understand what this means:

Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team team_a.

Lets say there is a team_b which is not yet ready to move to tail sampling. For those services, we simply want to sample all traces. Hence we apply the always_sample policy to them - in effect meaning that tail sampling has no effect on those traces, and are simply forwarded by the collector.

For services belonging to team_a - we apply more complicated logic as described in the subsequent rules below # BEGIN: policies for team_a .

rpriyanshu9 · 2023-10-07T18:32:35Z

@sarincasm thanks for explaining.

Btw in the rule no. 2, is the sampling percentage correct?

From the description it says that only 1% of traces are to sampled, but the value given in the code is 0.1

Should it be 1? I checked the code and there it seems that the value is divided by 100.. Pls let me know if that is correct or of not, I can fix this by opening a PR.

sarincasm · 2023-10-09T06:21:39Z

@rpriyanshu9 you're right. I think there's a typo in the rule. Rule #2 should say that the percentage should be 0.1 percent. So, the correct rule should read:

Sample only 0.1 percent of Readiness/liveness probes

#27481) Linked to #25190

open-telemetry#27481) Linked to open-telemetry#25190

rahjesh-vunet · 2024-01-08T21:06:57Z

Consider this requirement.

For all services,
- sample all error traces
For all services,
- sample all traces which are having duration > 5000 ms
For service A
- Sample 10% of the remaining traces
For service B
- Sample 20% of the remaining traces
For all other services
- Sample only 5% of the remaining traces.

Is this achievable?

I can have the following policies.

status_codes policy for ERROR
latency policy with threshold_ms > 5000
AND policy for service.name=A and probabilistic=10%
AND policy for service.name=B and probabilistic=20%
AND policy for service.name=A,B inverted=true and probabilistic=5%

With this policies, due to the policy #5, the traces from Service A and Service B are always not sampled as it ended up "Inverted Not Sampled" there. Is that right?

Is there a way to achieve the tail sampling requirements mentioned above?

sarincasm added enhancement New feature or request needs triage New item requiring triage labels Aug 14, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Aug 14, 2023

bryan-aguilar added question Further information is requested and removed enhancement New feature or request needs triage New item requiring triage labels Aug 21, 2023

sarincasm mentioned this issue Sep 6, 2023

[processor/tailsampling] add practical usage example #26480

Merged

sarincasm closed this as completed Sep 6, 2023

rpriyanshu9 mentioned this issue Oct 9, 2023

[processor/tailsampling] [chore][docs]: Fix typo in processor's README #27481

Merged

codeboten pushed a commit that referenced this issue Oct 11, 2023

[processor/tailsampling] [chore][docs]: Fix typo in processor's README (

ae425ae

#27481) Linked to #25190

JaredTan95 pushed a commit to openinsight-proj/opentelemetry-collector-contrib that referenced this issue Oct 18, 2023

[processor/tailsampling] [chore][docs]: Fix typo in processor's README (

8fc10ab

open-telemetry#27481) Linked to open-telemetry#25190

jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023

[processor/tailsampling] [chore][docs]: Fix typo in processor's README (

73610b2

open-telemetry#27481) Linked to open-telemetry#25190

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backwards Compatibility: Combine always_sample with a not #25190

Backwards Compatibility: Combine always_sample with a not #25190

sarincasm commented Aug 14, 2023

github-actions bot commented Aug 14, 2023

bryan-aguilar commented Aug 21, 2023

jpkrohling commented Aug 23, 2023

jpkrohling commented Aug 23, 2023

sarincasm commented Aug 23, 2023

bryan-aguilar commented Aug 23, 2023

jpkrohling commented Aug 24, 2023

sarincasm commented Sep 6, 2023

rpriyanshu9 commented Oct 6, 2023 •

edited

Loading

rpriyanshu9 commented Oct 6, 2023

sarincasm commented Oct 7, 2023

rpriyanshu9 commented Oct 7, 2023

sarincasm commented Oct 9, 2023

rahjesh-vunet commented Jan 8, 2024

Backwards Compatibility: Combine always_sample with a not #25190

Backwards Compatibility: Combine always_sample with a not #25190

Comments

sarincasm commented Aug 14, 2023

Component(s)

Is your feature request related to a problem? Please describe.

Option 1

Option 2

Option 3

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Aug 14, 2023

bryan-aguilar commented Aug 21, 2023

jpkrohling commented Aug 23, 2023

jpkrohling commented Aug 23, 2023

sarincasm commented Aug 23, 2023

bryan-aguilar commented Aug 23, 2023

jpkrohling commented Aug 24, 2023

sarincasm commented Sep 6, 2023

rpriyanshu9 commented Oct 6, 2023 • edited Loading

rpriyanshu9 commented Oct 6, 2023

sarincasm commented Oct 7, 2023

rpriyanshu9 commented Oct 7, 2023

sarincasm commented Oct 9, 2023

rahjesh-vunet commented Jan 8, 2024

rpriyanshu9 commented Oct 6, 2023 •

edited

Loading