Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backwards Compatibility: Combine always_sample with a not #25190

Closed
sarincasm opened this issue Aug 14, 2023 · 14 comments
Closed

Backwards Compatibility: Combine always_sample with a not #25190

sarincasm opened this issue Aug 14, 2023 · 14 comments
Labels
processor/tailsampling Tail sampling processor question Further information is requested

Comments

@sarincasm
Copy link
Contributor

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

We want to start implementing tail based sampling in our organization in a backwards compatible way.
Some services will use tail sampling, for the others we want to start by using the always_sample policy.

Is there a straightforward way to implement this policy:

if service.name not in ['list', 'of', 'services', 'already', 'using', 'tail', 'sampling']
then always_sample 

Here are a few things I'm trying

Option 1

{
  name: backwards-compatibility-policy-v1,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  already,
                  using,
                  tail_sampling
                ],
              invert_match: true,
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

However, this might not work, because this will lead to an inverted not sample decision - meaning the actual policies for these services will have no impact. Did I understand this correctly ?


Option 2

{
  name: backwards-compatibility-policy-v2,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-not-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  not,
                  using,
                  tail_sampling
                ],
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

This might work - however, this means that the entire list of legacy services would need to be added here, and any new service that is not using tail sampling would also need to be added here. This unfortunately means we need to maintain a redundant list of services here, thus shifting the burden on teams that are not ready to use tail sampling, and onto those that are just trying out Otel in the first place.


Option 3

{
  name: backwards-compatibility-policy-v3,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: ottl_condition,
          ottl_condition:
            {
              span: [
                "attributes[\"service.name\"] != \"list\" ",
                "attributes[\"service.name\"] != \"of\" ",
                "attributes[\"service.name\"] != \"services\" ",
                "attributes[\"service.name\"] != \"already\" ",
                "attributes[\"service.name\"] != \"using\" ",
                "attributes[\"service.name\"] != \"tail_sampling\" ",
              ],
              spanevent: [
                "attributes[\"service.name\"] != \"list\" ",
                "attributes[\"service.name\"] != \"of\" ",
                "attributes[\"service.name\"] != \"services\" ",
                "attributes[\"service.name\"] != \"already\" ",
                "attributes[\"service.name\"] != \"using\" ",
                "attributes[\"service.name\"] != \"tail_sampling\" ",
              ],
            }
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

Does this approach work ? Where can I find reference to the syntax of OTTL conditionals. I'm hoping this will create an OR between the list of services - Is that correct ?


Could someone help out with this problem. Is there a more straightforward way of solving this ?
Thanks!

Describe the solution you'd like

Either some help would the above 3 approaches would be appreciated.

Or it would also be useful, if we could use a simple not statement with string_attribute policy - that leads to a sample decision for string values that do not match - rather than leading to an explicit inverted not sampled decision.

Something like

not: {
  string_attribute: {
    key: http.route,
    values: [/v1/providers/.+],
  },
},

Describe alternatives you've considered

Described above.

Additional context

No response

@sarincasm sarincasm added enhancement New feature or request needs triage New item requiring triage labels Aug 14, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Aug 14, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bryan-aguilar
Copy link
Contributor

Instead of trying to solve this in the tail sampling processor could you use a different collector functionality? I'm thinking that could reduce the complexity of a tail sampling processor configuration.

What about a separate trace pipeline for tail sampled data? You could use the filter processor to drop data in the tail sampled pipeline and vise versa.

@bryan-aguilar bryan-aguilar added question Further information is requested and removed enhancement New feature or request needs triage New item requiring triage labels Aug 21, 2023
@jpkrohling
Copy link
Member

Your option 1 might work. Re-reading the doc, this is what I documented back then:

When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;

In your case, the services not included there will have an inverted sample decision and no other "no sample" decisions will be made. The services included in the list can still have their own policies listed before this one, and their "not sample" policies will be respected.

@jpkrohling
Copy link
Member

@bryan-aguilar, isn't the filter processor dropping only individual spans? In this case here, the decision is made based on the entire trace, so that if one of the spans in the trace goes through the service "already", they want to use tail-sampling, otherwise, do not sample.

@sarincasm
Copy link
Contributor Author

Yes!
Option 1 worked perfectly! 🎉

Here is what confused me:

I was sure that this approach would work for services not listed there - since they would meet the always_sampled condition.

However, I was afraid that it might not work for the services that are listed because of this condition:

When there's an "inverted not sample" decision, the trace is not sampled;

Perhaps I didn't understand what an inverted not sample decision is.

What I understood earlier:

{
  name: backwards-compatibility-policy-v1,
  type: and,
  and: {
    and_sub_policy:
      [
        {
          name: services-using-tail_sampling-policy,
          type: string_attribute,
          string_attribute:
            {
              key: service.name,
              values:
                [
                  list,
                  of,
                  services,
                  already,
                  using,
                  tail_sampling
                ],
              invert_match: true,
            },
        },
        { name: sample-all-policy, type: always_sample },
      ],
  },
}

If a span comes in where service.name = list - this would lead to an inverted not sample decision. Since it is an and policy, the overall decision would be not sampled. Therefore any other policies that might lead to a sampled decision would have no impact because there is an inverted not sample decision.

I suppose this is not correct, since the approach now works and is deployed to production 🚀

Though I would like to understand this a bit better.

@bryan-aguilar
Copy link
Contributor

@jpkrohling you are absolutely right! Good call and thank you for the correction!

@sarincasm glad to see you got something to work!

@jpkrohling
Copy link
Member

I have to admit I was confused myself as well, and agree that the document can be clearer on that. If you can come up with a way to make it clearer, perhaps with examples, I would be happy to review it!

To add more light to your case: the matching of the attribute itself is going to be inverted, not the decision.

@sarincasm
Copy link
Contributor Author

Thank you very much!

I will close this issue as resolved, because the feature I wanted already exists and no further changes are needed.

I was thinking of how can we make the documentation clearer, and I think that the best way is through a practical example.

I created a PR here: #26480

codeboten pushed a commit that referenced this issue Sep 14, 2023
Add documentation to clarify how the tail sampling processor can be used
in a complex case, by combining multiple policies.

Linked to #25190
Linked to #24606

---------

Co-authored-by: Alex Boten <[email protected]>
@rpriyanshu9
Copy link
Contributor

rpriyanshu9 commented Oct 6, 2023

@sarincasm In the example you added,

Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team team_a.

^^ This means that team_a is the only team which is not ready to move to tail sampling. So we just need to not sample any spans sent by team_a, is it?

If yes, then how does it relate to Rule #1 in the example?

     {
        # Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
        name: backwards-compatibility-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: services-using-tail_sampling-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                      invert_match: true,
                    },
                },
                { name: sample-all-policy, type: always_sample },
              ],
          },
      }

^^ Does it mean that in the values, we need to pass in the names of service belonging to team_a + any additional service that wants to opt out of tail sampling?

Also if team_a is not using tail sampling, then why before the other policies it is written that:

# BEGIN: policies for team_a
      {
        # Rule 2: low sampling for readiness/liveness probes

I might not be able to make sense of the example here, could you please help?

@rpriyanshu9
Copy link
Contributor

@jpkrohling @sarincasm Since you guys have more clear understanding of this, can you add some more descriptive info on :

Each policy will result in a decision, and the processor will evaluate them to make a final decision:

- When there's an "inverted not sample" decision, the trace is not sampled;
- When there's a "sample" decision, the trace is sampled;
- When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;
- In all other cases, the trace is NOT sampled

Especially what is an inverted not sample and inverted sample decision.

@sarincasm
Copy link
Contributor Author

@rpriyanshu9

This means that team_a is the only team which is not ready to move to tail sampling. So we just need to not sample any spans sent by team_a, is it?

It is the opposite. team_a is the one that wants to move to tail sampling. However, we don't want to disrupt the services from other teams.
Therefore, if the service does not belong to team_a -> we sample all of the traces - In effect meaning 100 percent of the traces will be sampled.

Also, to answer this specific statement.

So we just need to not sample any spans sent by team_a, is it?

Again, an important thing to understand is that for teams that are not ready to move to tail sampling - we want to sample all traces - which in effect means that tail sampling logic is not being applied to them.

Does it mean that in the values, we need to pass in the names of service belonging to team_a + any additional service that wants to opt out of tail sampling?

Again, it would be the opposite - Here you would pass in the list of services that want to opt-in to tail sampling.
(Since, it is an invert match, it will end up applying to the services not in the list - achieving the desired effect of not using tail sampling for services that are not yet ready).

Perhaps, it is most important to understand what this means:

Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team team_a.

Lets say there is a team_b which is not yet ready to move to tail sampling. For those services, we simply want to sample all traces. Hence we apply the always_sample policy to them - in effect meaning that tail sampling has no effect on those traces, and are simply forwarded by the collector.

For services belonging to team_a - we apply more complicated logic as described in the subsequent rules below # BEGIN: policies for team_a .

@rpriyanshu9
Copy link
Contributor

@sarincasm thanks for explaining.

Btw in the rule no. 2, is the sampling percentage correct?

From the description it says that only 1% of traces are to sampled, but the value given in the code is 0.1

Should it be 1? I checked the code and there it seems that the value is divided by 100.. Pls let me know if that is correct or of not, I can fix this by opening a PR.

@sarincasm
Copy link
Contributor Author

@rpriyanshu9 you're right. I think there's a typo in the rule. Rule #2 should say that the percentage should be 0.1 percent. So, the correct rule should read:

Sample only 0.1 percent of Readiness/liveness probes

@rahjesh-vunet
Copy link

Consider this requirement.

  • For all services,
    • sample all error traces
  • For all services,
    • sample all traces which are having duration > 5000 ms
  • For service A
    • Sample 10% of the remaining traces
  • For service B
    • Sample 20% of the remaining traces
  • For all other services
    • Sample only 5% of the remaining traces.

Is this achievable?

I can have the following policies.

  1. status_codes policy for ERROR
  2. latency policy with threshold_ms > 5000
  3. AND policy for service.name=A and probabilistic=10%
  4. AND policy for service.name=B and probabilistic=20%
  5. AND policy for service.name=A,B inverted=true and probabilistic=5%

With this policies, due to the policy #5, the traces from Service A and Service B are always not sampled as it ended up "Inverted Not Sampled" there. Is that right?

Is there a way to achieve the tail sampling requirements mentioned above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants