Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch: Add metrics for tasks. #3832

Closed
Tracked by #1977
liurenjie1024 opened this issue Jul 13, 2022 · 10 comments
Closed
Tracked by #1977

batch: Add metrics for tasks. #3832

liurenjie1024 opened this issue Jul 13, 2022 · 10 comments
Labels
component/batch Batch related related issue. type/feature

Comments

@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Jul 13, 2022

Add task level metrics, including incoming number of rows and data size of each exchange source, outgoing number of rows and data size of each output.

@liurenjie1024 liurenjie1024 changed the title Add metrics for tasks. batch: Add metrics for tasks. Jul 13, 2022
@liurenjie1024 liurenjie1024 added type/feature component/batch Batch related related issue. labels Jul 13, 2022
@BowenXiao1999
Copy link
Contributor

BowenXiao1999 commented Aug 8, 2022

My notes for this part:

We can collect metrics for batch executor like StreamingMetrics in streaming_stats.rs.

The metrics "exchange_recv_size" and "exchange_frag_recv_size" can be a good ref for collecting the input data rows of task and out data rows of task (accumulation).
Their prs: #3696

But a seems like we can not get the input recv size for fragment contains table scan: They do not have exchange source as input.

cc @ZENOTME

@ZENOTME
Copy link
Contributor

ZENOTME commented Aug 17, 2022

The problem with this registry is that we need to clean up the registry after query finished.
There are two place we should clean up:

  • metrics in cn
  • metrics in prometheus server

for clean up job, I think we can implement it using 'range search', such as: batch_exchange_recv_row_number{query_id="aaa"} can get all item which query_id equal "aaa".
for metrics in prometheus server, it supports 'range search'.
for metrics in cn, it can't support 'range search'.

So I think we can clean metrics in prometheus first.
Seems complicated to implement clean metrics in cn, we need to record all {queryID, source_stage_id, target_stage_id, source_task_id, target_task_id}. @BowenXiao1999 @liurenjie1024

@BowenXiao1999
Copy link
Contributor

WDYM by metrics in cn? I think they are all metrics in prometheus

@ZENOTME
Copy link
Contributor

ZENOTME commented Aug 17, 2022

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:

pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}

And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

@skyzh
Copy link
Contributor

skyzh commented Aug 17, 2022

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

@skyzh
Copy link
Contributor

skyzh commented Aug 17, 2022

What's the difference between metrics in CN and metrics in Prometheus?

@BowenXiao1999
Copy link
Contributor

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:

pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}

And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

@ZENOTME
Copy link
Contributor

ZENOTME commented Aug 17, 2022

What's the difference between metrics in CN and metrics in Prometheus?

For example there is a metrics in CN, like this: , (or maybe I should call it a value in prometheus client

#[derive(Debug)]
pub(crate) struct Metric {
   value:u64,
}

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

And as you say, and this metrics(value) will pull by prometheus and store in prometheus server.

The metrics in prometheus is pull from the metrics in CN.


I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

I look up the implementation and find delete_label_values will only delete the metrics in CN. (children.remove(&h)
I'm not sure it will sync with metrics in Prometheus.

 pub fn delete(&self, labels: &HashMap<&str, &str>) -> Result<()> {
        let h = self.hash_labels(labels)?;

        let mut children = self.children.write();
        if children.remove(&h).is_none() {   <--------------------------------children is Hash<u64,T>
            return Err(Error::Msg(format!("missing labels {:?}", labels)));
        }

        Ok(())
    }

@liurenjie1024
Copy link
Contributor Author

  1. We don't need to care about deleting metrics in promethues, it will delete by promethues server after some preconfigured interval.
  2. For deleting task level metrics, here is the changes:
    a. Maintain one BatchMetrics for each BatchTaskExecution
    b. When a batch task finished/aborted, move it to a deletion queue, which executes deletion of metrics after several minutes. It's important not to delete it immediately since promethues pull data periodically.

@BowenXiao1999
Copy link
Contributor

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/batch Batch related related issue. type/feature
Projects
None yet
Development

No branches or pull requests

4 participants