Explore Traces Homepage #279

joey-grafana · 2024-12-19T09:35:39Z

In this PR we have added a home page for Explore Traces! 🥳 🚀

The homepage allows you to get a quick overview of your services without the need to add filters or select your desired RED metric.

Through this view you can easily see your services with errors and also your slowest services: two key areas that we constantly strive to monitor.

Keep in mind this is only a v1 for the homepage, there is much more planned! Thanks @nadinevehling for all your work!

alexbikfalvi · 2025-01-02T14:43:46Z

Great work, leaving here my comments that we discussed.

For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?
Trace name might confuse users on what is being shown (service name and resource name), as users have mentioned that sometimes they struggle to know what's being displayed. Would it make sense to be more specific (say "Service name" and "??? name")?
When there is no Tempo data source we show the following ("Errored services" has a "Data source error" and "Slow services" is missing, which looks like a bug). As a low effort fix, we could customize the error message (e.g. "There are no Tempo data sources" instead of "Data source error").

To discuss, for the next iteration:

While the list of "Slow services" helps with getting an overview of top 10 slowest services/methods, I feel that the list of "Errored services" is much more limited: we currently show at most 10 relatively random errors (they are not even the most recent). When the error rate is high, it's doesn't feel like this list is meaningful beyond just being the starting point to see some errors.

joey-grafana · 2025-01-02T16:02:22Z

For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?

I think we will need the topk API for getting the top erroring services @joe-elliott @mdisibio (something that we add frontend support for in the next iteration).

Trace name might confuse users on what is being shown (service name and resource name), as users have mentioned that sometimes they struggle to know what's being displayed. Would it make sense to be more specific (say "Service name" and "??? name")?

Service name probably makes more sense here.

When there is no Tempo data source we show the following ("Errored services" has a "Data source error" and "Slow services" is missing, which looks like a bug). As a low effort fix, we could customize the error message (e.g. "There are no Tempo data sources" instead of "Data source error").

Yes I can update this.

adrapereira · 2025-01-02T16:13:46Z

For Errored services, the the list of traces is not always the same when refreshing the page. Is it possible to be consistent and show always the last 10?

I think we will need the topk API for getting the top erroring services @joe-elliott @mdisibio (something that we add frontend support for in the next iteration).

@alexbikfalvi @joey-grafana

The results aren't consistent because these are search queries that are returning traces. An easy way to make these results consistent is to change it to an instant metric query, like this {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name). The only downside is that we'll lose the data for the "Since" column, but it can be replaced with the count of errors for that service.
⚠️ Instant queries aren't currently supported by the datasource, so we would have to "hack" it and set the step field to be the same as the time range, but this is something that we already do elsewhere in the app.

adrapereira

Nice addition to the app! Left a few comments in the code on some small things I noticed.

As Alex mentioned, the errors panel seems to react differently to the loading state than the duration panel, which seems to disappear.

As mentioned in another comment we're using search queries here and I'm not sure if that's the best option. Let's consider using metric queries to see what feels best before we make a final decision on what to ship.

src/pages/Home/Home.tsx

src/components/Home/AttributePanelRows.tsx

src/components/Home/AttributePanelScene.tsx

src/components/Home/HeaderScene.tsx

joey-grafana · 2025-01-06T08:53:33Z

For a query like {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name, name, trace:id) there are thousands of frame results for instant metrics (what's the best way to limit this?). We could count the errors, but they're often 0 (as root span apparently not counted, only child spans of root), and then sort by errors but each trace always has 0 or 1 errors (with local & dev - not sure how often in prod there are more than one error).

I wonder would the best move be to group by span:id as well and then combine all errors from each trace.

adrapereira · 2025-01-06T13:51:54Z

For a query like {nestedSetParent < 0 && status = error} | count_over_time() by (resource.service.name, name, trace:id) there are thousands of frame results for instant metrics
I wonder would the best move be to group by span:id as well and then combine all errors from each trace.

Grouping by trace ID or span ID isn't ideal, it will increase the load on Tempo, make the return message huge and as you noticed won't be the easiest to work with on the app.
We only need the trace IDs to link to specific traces right? Here's two proposals:

We don't use trace IDs anymore and when the user clicks the link we only filter the app for that service name / span name.
We default to 1. but if there are exemplars we use them to link to a specific trace.

Since we're running a metrics query we're working on top of aggregated data, so it doesn't make sense to group by ID IMO, it defeats the purpose of the data aggregation in this case. The best we can use are the exemplars if we want examples of trace IDs.

src/components/Home/DurationAttributePanel.tsx

src/utils/utils.ts

joey-grafana · 2025-01-08T14:24:37Z

Thanks for all the suggestions (which should be covered in the last commits). This is what the UI looks like.

adrapereira

Looking better! Still have a few more comments before approving.

adrapereira · 2025-01-08T17:02:11Z

src/components/Home/AttributePanelRows.tsx

+    if (type === 'errors') {
+      const getLabel = (df: DataFrame) => {
+        const valuesField = df.fields.find((f) => f.name !== 'time');
+        return valuesField?.labels?.['resource.service.name'].slice(1, -1) /* remove quotes */ ??  'Service name not found';


.slice(1,-1) is too brittle, if we remove the quotes somewhere else this will eat away the service name. replace('"', "") is safer.

adrapereira · 2025-01-08T17:22:32Z

src/components/Home/AttributePanel.tsx

+                let yBuckets = data.data?.series.map((s) => parseFloat(s.fields[1].name)).sort((a, b) => a - b);
+                if (yBuckets?.length) {
+                  const slowestBuckets = Math.floor(yBuckets.length / 4);
+                  let minBucket = yBuckets.length - slowestBuckets - 1;
+                  if (minBucket < 0) {
+                    minBucket = 0;
+                  }
+
+                  const minDuration = yBucketToDuration(minBucket - 1, yBuckets);


This is copied code right? Could likely be refactored to a function that could be reused to find the slowest threshold.

adrapereira · 2025-01-08T17:24:11Z

src/components/Home/AttributePanel.tsx

+                      children: [
+                        new AttributePanel({ 
+                          query: {
+                            query: `{nestedSetParent<0 && kind=server && duration > ${minDuration}} | by (resource.service.name)`,


| by(resource.service.name) isn't doing anything since this is a search query and not a metrics query.
Also, this will suffer of the same issues as the errors table had, the results won't be consistent across refreshes since it's a search query.

joey-grafana added 17 commits November 27, 2024 13:08

Homepage

ec44a14

Navigation and error checking

bd7838d

Several improvements

81ef65a

Card design

e819a64

Add skeleton component for loading state

a8ccf7c

Light theme styling

73f5097

Overall styling and responsiveness

3ae2c37

Fix conflcit in routes

0ee4a71

Lazy load home page

66fb603

Feature tracking

eee3048

Remove preview badge

39995be

Move rockets

be25180

AttributePanelRows

ba79ae9

Loading, error, empty states

9bfef1b

Update url link

5ef2afd

Improve styling

0e9f488

Improve skeleton styling

ff6e905

joey-grafana added enhancement New feature or request area/frontend labels Dec 19, 2024

joey-grafana self-assigned this Dec 19, 2024

Fix cspell

15b171a

joey-grafana marked this pull request as ready for review January 2, 2025 09:16

joey-grafana requested a review from a team January 2, 2025 09:16

adrapereira requested changes Jan 2, 2025

View reviewed changes

joey-grafana added 3 commits January 2, 2025 18:54

Utils and styling

079e6b2

Update url

9d81be6

Update error messages and improvements

1fd557d

ifrost reviewed Jan 7, 2025

View reviewed changes

src/components/Home/DurationAttributePanel.tsx Outdated Show resolved Hide resolved

src/utils/utils.ts Show resolved Hide resolved

joey-grafana added 6 commits January 8, 2025 08:56

Reuse AttributePanel for duration

c58d17d

Update icon

cd77048

Remove DurationAttributePanel file

e3fe498

Add AttributePanelRow

42dc261

Update icon hover

fe0c848

Tests and improvements

2be0320

adrapereira requested changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Traces Homepage #279

Explore Traces Homepage #279

joey-grafana commented Dec 19, 2024

alexbikfalvi commented Jan 2, 2025 •

edited

Loading

joey-grafana commented Jan 2, 2025

adrapereira commented Jan 2, 2025 •

edited

Loading

adrapereira left a comment

joey-grafana commented Jan 6, 2025

adrapereira commented Jan 6, 2025

joey-grafana commented Jan 8, 2025

adrapereira left a comment

adrapereira Jan 8, 2025

adrapereira Jan 8, 2025

adrapereira Jan 8, 2025

Explore Traces Homepage #279

Are you sure you want to change the base?

Explore Traces Homepage #279

Conversation

joey-grafana commented Dec 19, 2024

alexbikfalvi commented Jan 2, 2025 • edited Loading

joey-grafana commented Jan 2, 2025

adrapereira commented Jan 2, 2025 • edited Loading

adrapereira left a comment

Choose a reason for hiding this comment

joey-grafana commented Jan 6, 2025

adrapereira commented Jan 6, 2025

joey-grafana commented Jan 8, 2025

adrapereira left a comment

Choose a reason for hiding this comment

adrapereira Jan 8, 2025

Choose a reason for hiding this comment

adrapereira Jan 8, 2025

Choose a reason for hiding this comment

adrapereira Jan 8, 2025

Choose a reason for hiding this comment

alexbikfalvi commented Jan 2, 2025 •

edited

Loading

adrapereira commented Jan 2, 2025 •

edited

Loading