Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Traces Homepage #279

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Explore Traces Homepage #279

wants to merge 27 commits into from

Conversation

joey-grafana
Copy link
Collaborator

In this PR we have added a home page for Explore Traces! 🥳 🚀

The homepage allows you to get a quick overview of your services without the need to add filters or select your desired RED metric.

Through this view you can easily see your services with errors and also your slowest services: two key areas that we constantly strive to monitor.

Keep in mind this is only a v1 for the homepage, there is much more planned! Thanks @nadinevehling for all your work!

Screenshot 2024-12-19 at 09 26 16

@joey-grafana joey-grafana added enhancement New feature or request area/frontend labels Dec 19, 2024
@joey-grafana joey-grafana self-assigned this Dec 19, 2024
@joey-grafana joey-grafana marked this pull request as ready for review January 2, 2025 09:16
@joey-grafana joey-grafana requested a review from a team January 2, 2025 09:16
@alexbikfalvi
Copy link

alexbikfalvi commented Jan 2, 2025

Great work, leaving here my comments that we discussed.

  • For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?
  • Trace name might confuse users on what is being shown (service name and resource name), as users have mentioned that sometimes they struggle to know what's being displayed. Would it make sense to be more specific (say "Service name" and "??? name")?
  • When there is no Tempo data source we show the following ("Errored services" has a "Data source error" and "Slow services" is missing, which looks like a bug). As a low effort fix, we could customize the error message (e.g. "There are no Tempo data sources" instead of "Data source error").
    Screenshot 2025-01-02 at 3 38 33 PM

To discuss, for the next iteration:

  • While the list of "Slow services" helps with getting an overview of top 10 slowest services/methods, I feel that the list of "Errored services" is much more limited: we currently show at most 10 relatively random errors (they are not even the most recent). When the error rate is high, it's doesn't feel like this list is meaningful beyond just being the starting point to see some errors.

@joey-grafana
Copy link
Collaborator Author

For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?

I think we will need the topk API for getting the top erroring services @joe-elliott @mdisibio (something that we add frontend support for in the next iteration).

Trace name might confuse users on what is being shown (service name and resource name), as users have mentioned that sometimes they struggle to know what's being displayed. Would it make sense to be more specific (say "Service name" and "??? name")?

Service name probably makes more sense here.

When there is no Tempo data source we show the following ("Errored services" has a "Data source error" and "Slow services" is missing, which looks like a bug). As a low effort fix, we could customize the error message (e.g. "There are no Tempo data sources" instead of "Data source error").

Yes I can update this.

@adrapereira
Copy link
Collaborator

adrapereira commented Jan 2, 2025

For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?

I think we will need the topk API for getting the top erroring services @joe-elliott @mdisibio (something that we add frontend support for in the next iteration).

@alexbikfalvi @joey-grafana

The results aren't consistent because these are search queries that are returning traces. An easy way to make these results consistent is to change it to an instant metric query, like this {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name). The only downside is that we'll lose the data for the "Since" column, but it can be replaced with the count of errors for that service.
⚠️ Instant queries aren't currently supported by the datasource, so we would have to "hack" it and set the step field to be the same as the time range, but this is something that we already do elsewhere in the app.

Copy link
Collaborator

@adrapereira adrapereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition to the app! Left a few comments in the code on some small things I noticed.

As Alex mentioned, the errors panel seems to react differently to the loading state than the duration panel, which seems to disappear.

As mentioned in another comment we're using search queries here and I'm not sure if that's the best option. Let's consider using metric queries to see what feels best before we make a final decision on what to ship.

src/pages/Home/Home.tsx Outdated Show resolved Hide resolved
src/pages/Home/Home.tsx Outdated Show resolved Hide resolved
src/components/Home/AttributePanelRows.tsx Outdated Show resolved Hide resolved
src/components/Home/AttributePanelScene.tsx Show resolved Hide resolved
src/components/Home/HeaderScene.tsx Outdated Show resolved Hide resolved
@joey-grafana
Copy link
Collaborator Author

For a query like {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name, name, trace:id) there are thousands of frame results for instant metrics (what's the best way to limit this?). We could count the errors, but they're often 0 (as root span apparently not counted, only child spans of root), and then sort by errors but each trace always has 0 or 1 errors (with local & dev - not sure how often in prod there are more than one error).

I wonder would the best move be to group by span:id as well and then combine all errors from each trace.

@adrapereira
Copy link
Collaborator

For a query like {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name, name, trace:id) there are thousands of frame results for instant metrics
I wonder would the best move be to group by span:id as well and then combine all errors from each trace.

Grouping by trace ID or span ID isn't ideal, it will increase the load on Tempo, make the return message huge and as you noticed won't be the easiest to work with on the app.
We only need the trace IDs to link to specific traces right? Here's two proposals:

  1. We don't use trace IDs anymore and when the user clicks the link we only filter the app for that service name / span name.
  2. We default to 1. but if there are exemplars we use them to link to a specific trace.

Since we're running a metrics query we're working on top of aggregated data, so it doesn't make sense to group by ID IMO, it defeats the purpose of the data aggregation in this case. The best we can use are the exemplars if we want examples of trace IDs.

src/components/Home/DurationAttributePanel.tsx Outdated Show resolved Hide resolved
src/utils/utils.ts Show resolved Hide resolved
@joey-grafana
Copy link
Collaborator Author

Thanks for all the suggestions (which should be covered in the last commits). This is what the UI looks like.

Screenshot 2025-01-08 at 14 23 40

Copy link
Collaborator

@adrapereira adrapereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better! Still have a few more comments before approving.

if (type === 'errors') {
const getLabel = (df: DataFrame) => {
const valuesField = df.fields.find((f) => f.name !== 'time');
return valuesField?.labels?.['resource.service.name'].slice(1, -1) /* remove quotes */ ?? 'Service name not found';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.slice(1,-1) is too brittle, if we remove the quotes somewhere else this will eat away the service name. replace('"', "") is safer.

Comment on lines +76 to +84
let yBuckets = data.data?.series.map((s) => parseFloat(s.fields[1].name)).sort((a, b) => a - b);
if (yBuckets?.length) {
const slowestBuckets = Math.floor(yBuckets.length / 4);
let minBucket = yBuckets.length - slowestBuckets - 1;
if (minBucket < 0) {
minBucket = 0;
}

const minDuration = yBucketToDuration(minBucket - 1, yBuckets);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied code right? Could likely be refactored to a function that could be reused to find the slowest threshold.

children: [
new AttributePanel({
query: {
query: `{nestedSetParent<0 && kind=server && duration > ${minDuration}} | by (resource.service.name)`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| by(resource.service.name) isn't doing anything since this is a search query and not a metrics query.
Also, this will suffer of the same issues as the errors table had, the results won't be consistent across refreshes since it's a search query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants