Explore adding table ids tags to some metrics #4511

keith-turner · 2024-05-01T14:28:37Z

Is your feature request related to a problem? Please describe.

The monitor currently has a custom metrics system. This custom system in the monitor tracks certain metrics per tablet server per table id. The monitor currently lacks any metrics for scan severs that are similar to the current monitor tablet server metrics. There are metrics being emitted by scan servers that almost have enough information to construct a scan server metrics dashboard in an external metrics system, however the lack of table id information prevents having parity with the tablet server view in the monitor.

In general, if table ids were added to some subset of tablet server and scan server metrics then it could allow an external metrics system to achieve parity with the metrics offered by the monitor. The metrics in the monitor that do this are primarily concerned with table read, write, and compactions.

Describe the solution you'd like

For a subset of metrics related to reading and writing data tag them with table ids. This would like result in a Map<TableId, Meter> in the code. For this map would need to be able to add new meters as new table ids are encountered and remove table ids that have not been used in a while. The meters in this map would have the table id tag.

Describe alternatives you've considered

Do not try to achieve feature parity in external metrics and custom monitor metrics.

The text was updated successfully, but these errors were encountered:

EdColeman · 2024-05-01T15:00:54Z

Other than well defined Accumulo system tables (root, metadata,...) it is probably not a good idea to add unbounded tags. Systems that have large numbers of dynamic tables will have issues.

keith-turner · 2024-05-01T15:22:41Z

The following may be relevant if something is done for this issue.

https://docs.micrometer.io/micrometer/reference/concepts/meter-provider.html

keith-turner · 2024-05-01T15:27:04Z

Other than well defined Accumulo system tables (root, metadata,...) it is probably not a good idea to add unbounded tags. Systems that have large numbers of dynamic tables will have issues.

Yeah there is big difference there. The monitor metrics only track the tables that currently exist, it will not track any deleted tables. Adding a table id tag for external metrics would cause the metrics system to have data for tables that no longer exists. So that is a huge difference. If anything is done twoards this, would probably need to only add tags for tables that are specified by a user. So could provide a configurable list of tables to tag and only add tags for tables in that list.

dlmarion · 2024-05-01T15:29:59Z

https://docs.micrometer.io/micrometer/reference/concepts/meter-provider.html

I saw this as well. Additionally, there is a MultiGauge.

If we are going to do something that will dynamically create Meters, then we need to be sure to remove them (MeterRegistry.remove). Otherwise the process will continue to report the metrics for it's lifetime.

keith-turner · 2024-05-01T15:31:10Z

It may be more useful to tag the metrics with a table name instead of a table id.

EdColeman · 2024-05-01T15:34:39Z

One way to think of it is that each tag will create a unique time-series in some (most?) back ends. So, in addition to the number of metrics that are reported each interval, there can be negative impact to the collection / storage / display systems.

EdColeman · 2024-05-01T15:50:29Z

Would something like tracing work? If we could activate / de-activate tracing on-demand, then it seems that maybe would could collect the needed values for profiling and then turn them off when not being used? No idea what it would look like at this point.

dlmarion · 2024-05-01T17:33:20Z

Would something like tracing work?

If tracing provided the level of information needed, then it could be enabled on a subset of the scan servers / tablet servers via settings in accumulo-env.sh. If there are problems with specific queries, those queries could be directed to specific scan servers via the resource group mechanism.

keith-turner · 2024-05-01T18:02:39Z

One way to think of it is that each tag will create a unique time-series in some (most?) back ends. So, in addition to the number of metrics that are reported each interval, there can be negative impact to the collection / storage / display systems.

Yeah, I think for this to be workable would need a user provided list of tables to tag. This could be a metric property with a value of a list of table names or it could be a per table property that is a boolean. So this would allow control over which tables are tagged.

Would something like tracing work?

I think that would be out of scope of this issue. This issue could be closed if its not something that seems workable. This issue was about supporting current functionality of the monitor in an external metrics system. Currently in the monitor if a table is clicked on it will show metrics that are per table and per tserver. Also in the tables view in the monitor can see metrics like how many compactions are queued for a table and what the current ingest rate is for a table.

dlmarion · 2024-05-01T18:08:58Z

Yeah, I think for this to be workable would need a user provided list of tables to tag. This could be a metric property with a value of a list of table names or it could be a per table property that is a boolean. So this would allow control over which tables are tagged.

The code needs to be responsive to changes in the list so that processes don't need to be restarted for metrics to be enabled / disabled.

EdColeman · 2024-05-01T18:11:19Z

For example guidance, Prometheus label guidelines provides the following:,

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.

dlmarion · 2024-05-01T18:15:10Z

For example guidance, Prometheus label guidelines provides the following:,

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.

These guidelines aren't general time series database guidelines, right? They are based on the limitations of Prometheus?

EdColeman · 2024-05-01T18:19:54Z

While those are specific to Prometheus - my understanding is that they apply generally across various metric systems. There are similar limitations for things like InfluxDB and it all relates to unique tags (labels) creating a unique time-series.

dlmarion · 2024-05-01T18:23:20Z

Yeah, I get that total cardinality is an issue for most of the TSDBs, but I think limitations are different for each. My point was that we should not take the numbers from Prometheus as a hard limit.

keith-turner · 2024-06-04T16:29:34Z

Looking at #3608 prompted me to revisit this. Being able to drop the custom thrift stuff in the accumulo code that is only used by the monitor would improve the maintainability of the Accumulo code base IMO. Adding table tags does increase cardinality and that is a valid concern. Looking around there are multiple ways to deal w/ cardinality.

Micrometer supports filters that can be set on a registry. The filter name is bit misleading because they also support transformation. Setting filters in a MeterRegistryFactory implementation it would be possible to drop tableId tags or reduce their cardinality (like reduce them to two values "system" or "user"). The LoggingMeterRegistryFactory could have an option for dropping table id tags as a form of documentation.

Another way too many table ids could be handled is on ingest into the metrics system. Prometheus has relabeling as described here and here that can reduce tag cardinality on ingest. Thought influxdb had a similar capability, but could not find it.

keith-turner · 2024-10-24T20:31:30Z

QueueMetrics could serve as a potential model for per table metrics.

keith-turner · 2024-10-26T19:02:30Z

Experimented with adding tableId tags to metric in this branch. The branch is a mess because I got sidetracked by micrometer-metrics/micrometer#5607 during this experiment which caused the following problem.

Tsever creates a registry that ingnores tableId tags using the SPI plugin and then wraps this registry with a CompositeMeterRegistry
Tsever hosts a tablet for tableId=1 and create meters for tableId=1
Tsever hosts a tablet for tableId=2 and create meters for tableId=2
The tablet for tableId=1 is unloaded from the tserver and there are no other tablets for tableId=1 so remove the meters for tableId=1. Because of micrometer issue 5607 after this removal no metrics are ever seen again for tableId=2.

Trying to figure out how this could be worked around and/or fixed upstream. Want to wait and see if the bug is considered valid before attempting an upstream fix. If we can not figure out a workaround, this bug may prevent adding tableId tags to metrics because we must have the ability to filter out tableId tags. Would only run into this bug if filtering and/or collapsing the tableId tag using a MeterFilter.

keith-turner · 2024-10-29T02:20:54Z

Looked into finding a work around for the behavior of CompositeRegistry and have not found one so for. Did notice the Accumulo code could sometimes avoid using CompositeRegistry and opened #5021. That may open up a hacky workaround of only adding tableId tags when not using CompositeRegistry which is far from optimal.

keith-turner · 2024-10-30T15:05:29Z

While researching this I found another problem with using MeterFilters to remove tableId tags. The problem is that Gauge meters are not properly aggregated. If there are two Gauge meters and tags are removed collapsing them into a single meter, then the collapsed meter will only pull data from one of the gauges. Given there two problems with using meter filters to remove tableId tags in the only way I can see forward is to have a an accumulo property that enables/disable per table tags on metrics. I think the implementation of this could be pretty straightforward and I am going to to try to prototype it.

dlmarion · 2024-10-30T15:28:05Z

What if you had a Thread in the tserver that maintained the aggregate Meters and removed them from the MeterRegistry when no tablets for the table are being hosted?

keith-turner · 2024-10-30T16:40:07Z

What if you had a Thread in the tserver that maintained the aggregate Meters and removed them from the MeterRegistry when no tablets for the table are being hosted?

I tried to do something like this and ran into problems with MeterFilters. I was trying to support the following functionality in my initial experiment.

Accumulo has some per table meters. When a tablet server or scan server no longer has tablets for a table it will eventually remove meters related to that table from the registry. This removal was done by a scheduled task in the experimental branch.
If a user does not want per table meters, then they could use MeterFilter.ignoreTags() to remove the tableId tags when setting up their registries.

While trying to support the above workflow I found two problems with MeterFilter.ignoreTags(). First was micrometer issue 5607 which caused metrics to be lost. The second issue I found is the one I mentioned with two or more gauges in an earlier comment where data is lost.

keith-turner · 2024-10-31T19:30:51Z

Opened micrometer-metrics/micrometer#5616 about the other problem I ran into while attempting to use MeterFilters to drop table ids. Looking into this a bit more we should be using Counter meters instead of Gauge meters in the code when possible, will open an issue about this.

Some tablet update metrics were incrmenting a counter by zero. Changed the increment to one. Noticed this while working on apache#4511.

Some tablet update metrics were incrmenting a counter by zero. Changed the increment to one. Noticed this while working on #4511.

keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label May 1, 2024

ddanielr added this to Accumulo: Observability Jun 10, 2024

keith-turner added this to the 4.0.0 milestone Jul 12, 2024

keith-turner self-assigned this Oct 25, 2024

keith-turner added a commit to keith-turner/accumulo that referenced this issue Nov 1, 2024

fixes tablet update metrics incrementing

722f32e

Some tablet update metrics were incrmenting a counter by zero. Changed the increment to one. Noticed this while working on apache#4511.

This was referenced Nov 1, 2024

fixes tablet update metrics incrementing #5029

Merged

adds optional per table metrics #5030

Open

keith-turner added a commit that referenced this issue Nov 4, 2024

fixes tablet update metrics incrementing (#5029)

68e0406

Some tablet update metrics were incrmenting a counter by zero. Changed the increment to one. Noticed this while working on #4511.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore adding table ids tags to some metrics #4511

Explore adding table ids tags to some metrics #4511

keith-turner commented May 1, 2024

EdColeman commented May 1, 2024

keith-turner commented May 1, 2024

keith-turner commented May 1, 2024 •

edited

Loading

dlmarion commented May 1, 2024

keith-turner commented May 1, 2024

EdColeman commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

keith-turner commented May 1, 2024

dlmarion commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

keith-turner commented Jun 4, 2024 •

edited

Loading

keith-turner commented Oct 24, 2024

keith-turner commented Oct 26, 2024

keith-turner commented Oct 29, 2024

keith-turner commented Oct 30, 2024

dlmarion commented Oct 30, 2024

keith-turner commented Oct 30, 2024

keith-turner commented Oct 31, 2024 •

edited

Loading

Explore adding table ids tags to some metrics #4511

Explore adding table ids tags to some metrics #4511

Comments

keith-turner commented May 1, 2024

EdColeman commented May 1, 2024

keith-turner commented May 1, 2024

keith-turner commented May 1, 2024 • edited Loading

dlmarion commented May 1, 2024

keith-turner commented May 1, 2024

EdColeman commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

keith-turner commented May 1, 2024

dlmarion commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

EdColeman commented May 1, 2024

dlmarion commented May 1, 2024

keith-turner commented Jun 4, 2024 • edited Loading

keith-turner commented Oct 24, 2024

keith-turner commented Oct 26, 2024

keith-turner commented Oct 29, 2024

keith-turner commented Oct 30, 2024

dlmarion commented Oct 30, 2024

keith-turner commented Oct 30, 2024

keith-turner commented Oct 31, 2024 • edited Loading

keith-turner commented May 1, 2024 •

edited

Loading

keith-turner commented Jun 4, 2024 •

edited

Loading

keith-turner commented Oct 31, 2024 •

edited

Loading