-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][txn] PIP-160 Metrics stats of Transaction buffered writer #16758
Conversation
f28469a
to
5bfbd31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where the TxnLogBufferedWriterMetricsDefinition.java
's labelNames
and labelValues
has been set, except in the tests.
@asafm Could you please help review the PR? |
Hi @tjiuming
|
/pulsarbot rerun-failure-checks |
I started writing a few comments then I realized I have some profound suggestions I thought maybe it's a better idea to raise them first. I think currently this PR in my opinion quite complicated. It took me 2 hours to get the hang of it all.
BufferedWriter is used on two occasions: Pending Ack stores, and Transaction Log. How? BufferedWriterMetricsConfiguration is the same as the definition you have: Since you said the pending ack stores will share the same labels, but won't be differentiated by the label values, you can create a single BufferedWriterMetrics instance and use it whenever you create a new Buffered writer. When a ledger is closed, its buffered writer is closed.
I have many more comments, but I thought it's best to discuss the big ones first. |
Hi @asafm
I've taken care of all the suggestions, could you review this PR again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the metrics documentation
@@ -272,23 +290,43 @@ private void doTrigFlush(boolean force, boolean byScheduleThreads){ | |||
return; | |||
} | |||
if (force) { | |||
if (metricsStats != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a default TxnLogBufferedWriterMetricsStatsDisabled implementation to avoid many null checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a default TxnLogBufferedWriterMetricsStatsDisabled implementation to avoid many null checks.
Good Idea. In the complete design, we should have two implementations like UML blow, one for enabling the batch feature, and another for disabled:
Sorry, I should have added some code comments in this PR(It's now written in Modifications).
To reduce later maintenance costs, I'd like to ditch the 'DisabledMetricsStat' and we'll always use the implementation 'MetricsStatimpl' even if the Txn buffer writer disables batch feature. This constructor without 'param-metricsStats' and these' null checks' will be removed in the next PR( 7-2 in Modifications ). This is compatible only with split PR, making each PR have less code
* {@link TxnLogBufferedWriterMetricsStats}, such as The Transaction Coordinator using coordinatorId as label and | ||
* The Transaction Pending Ack Store using subscriptionName as label. | ||
*/ | ||
private static final HashMap<String, Collector> COLLECTOR_CACHE = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a map here?
Looks like it only used in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a map here? Looks like it is only used in the constructor.
Yes, we need it.
To build Metrics Stat, we need execute these two steps:
- Create
Collector
and register toCollectorRegistry
, perhaps the Collector isHistogram
orCounter
- Register labels to
Collector
and getCollector.child
(holds by Metrics Stat). This step can also be omitted, because we can executecollector.labels(labelValues)
to getCollector.child
.
In the Transaction log scenario, multiple Transaction Log share the same Collector
, and each has its own Collector.Child
, so when we build metrics stat for each Transaction Log, we call collector.labels(labelValues)
to get the Collector.Child
.
( High light)
However, the CollectorRegistry does not provide an API like below, and it will throw IllegalArgumentException when we registering collector with the same name more than once.
public Collector getRegistedCollector(String name);
Throws IllegalArgumentException at this line:
So we have to manage the registered collectors ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way to solve this issue by making sure we're not defining metrics twice:
We know that we plan to create only one Metrics instance per metic-prefix. So in that case, both TxLog and PendingAckStoreProvider will create one with its own prefix, and that's it. No need to verify it was created before. In the event a future developer will make a mistake, it will fail in the constructor in some test right since CollectorRegistry.register() will fail on a duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @asafm
That's a good way to do it, but this can make the code confuse:
- When the Transaction Log Provider opens a Transaction Log, passed the
Histogram
to Transaction Log. That is OK. - When the Transaction Log close, remove the labels. That is ok too.
But the Transaction Log Provider will hold all the Collector
of Txn Buffered Writer, this is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the Transaction Log Provider will hold all the Collector of Txn Buffered Writer; this is confusing
Can you explain this further? I didn't understand how the provider would hold all the Collectors (metrics). It should only be to a single instance of the metric class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @asafm
Can you explain this further? I didn't understand how the provider would hold all the Collectors (metrics). It should only be to a single instance of the metric class.
When MLTransactionMetadataStoreProvider
initialized, we create Collector
like this:
public MLTransactionMetadataStoreProvider(){
this.recordsPerBatchMetric = ...
this.batchSizeBytesMetric = ...
this.oldestRecordInBatchDelayTimeSecondsMetric = ...
this.batchFlushTriggeredByMaxRecordsMetric = ...
this.batchFlushTriggeredByMaxSizeMetric = ...
this.batchFlushTriggeredByMaxDelayMetric = ...
}
And when creating MlTransactionLogImpl
, pass these Collector
to MlTransactionLogImpl
like this:
public class MLTransactionMetadataStoreProvider{
public TransactionMetadataStore openStore(...){
TransactionMetadataStore store = ...;
setMetrics(store);
return store;
}
private void setMetrics(TransactionMetadataStore store) {
store.recordsPerBatchMetric = this.recordsPerBatchMetric;
store.batchSizeBytesMetric = this.batchSizeBytesMetric;
store.oldestRecordInBatchDelayTimeSecondsMetric = this.oldestRecordInBatchDelayTimeSecondsMetric;
...
}
}
The MLTransactionMetadataStoreProvider
will hold all the Collector of Txn Buffered Writer, this is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either what you wrote is completely the wrong direction or I completely misunderstood you :)
I thought that in MLTransactionLogImp, when open a ledger you also create a buffered writer, thus in here you will also pass an instance of TxnLogBufferedWriterMetricsStats
.
public void openLedgerComplete(ManagedLedger ledger, Object ctx) {
MLTransactionLogImpl.this.managedLedger = ledger;
MLTransactionLogImpl.this.bufferedWriter = new TxnLogBufferedWriter<>(
managedLedger, ((ManagedLedgerImpl) managedLedger).getExecutor(),
timer, TransactionLogDataSerializer.INSTANCE,
txnLogBufferedWriterConfig.getBatchedWriteMaxRecords(),
txnLogBufferedWriterConfig.getBatchedWriteMaxSize(),
txnLogBufferedWriterConfig.getBatchedWriteMaxDelayInMillis(),
txnLogBufferedWriterConfig.isBatchEnabled());
As I understand, this only happens once per instance of MLTransactionLogImpl
. So I don't understand why you want to pass metric by metric. Something doesn't add up here.
I guess my main question is: how many MLTransactionLogImpl
are there in a single broker process? More than 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe more than one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if more than one, then the design must change. I thought the whole idea was that you have a single instance of metrics per metric prefix.
If not, after much thought I suggest the following:
abstract class BufferMetrics {
protected abstract void observeRecordsPerBatch(int)
protected abstract void incFlushTriggeredByMaxRecords(int)
}
MLTransactionMetadataStoreBufferedWriterMetrics extends BufferMetrics {
static private Histogram recordsPerBatchMetric = new Histogram.Builder()
.name("pulsar_tx_store_bufferedwriter_batch_record_count")
.labelNames(new String[]{"txCoordinatorId"})
.help("Records per batch histogram")
.buckets(RECORD_COUNT_PER_ENTRY_BUCKETS)
.register(registry));
private Histogram.Child recordsPerBatchHistogram;
public MLTransactionMetadataStoreBufferedWriterMetrics(String txCoordinatorId) {
recordsPerBatchHistogram = recordsPerBatchHistogram.labels(txCoordinatorId)
}
protected observeRecordsPerBatch(value) {
recordsPerBatchHistogram.observe(value)
}
}
Another approach which I disliked a bit, but it's still ok:
Add to Pulsar Common:
class PrometheusRegistryChecker {
static defaultMetricRegistryNameToCollector = new HashMap<String, Collector>()
static Collector registerIfNotExists(collector) {}
}
Like FunctionCollectorRegistryImpl
I prefer the second implementation because if we need to define many variables of Collector type. And Map
is really just another representation of multiple variables.
Now common
has not the dependency: prometheus-client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I can leave with that but just to be on the safe side I asked also for @tjiuming advice here.
…ch record must large than 9
38fd979
to
8ab54f6
Compare
rebase master |
…trics (#17701) Master Issue: #15370 ### Modifications - Make transaction `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider` support buffered writer metrics. - Motivation: #15370 ---- - Delete constructor of `TxnLogBufferedWriter` without parameter `metrics`. - Motivation: it is unnecessary. ---- - Add a default `DisabledTxnLogBufferedWriterMetricsStats` implementation. ---- - Previous PR remaining code to optimize: remove the check code `if (metrics != null)`. The motivation see: - Motivation: #16758 (comment) ---- - Make transaction log buffered writer only create by the `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider`. - Motivation: #16758 (comment) ### Documentation - [ ] `doc-required` - [x] `doc-not-needed` - [ ] `doc` - [ ] `doc-complete` ### Matching PR in forked repository PR in forked repository: - poorbarcode#3
Master Issue: #15370
Motivation
see #15370
Modifications
I will complete proposal #15370 with these pull requests( current pull request is a part of step 7-1 ):
TxnLogBufferedWriter
GET /admin/v3/transactions/coordinatorStats
GET /admin/v3/transactions/pendingAckStats/:tenant/:namespace:/:topic:/:subName
7-1. Metrics of Txn Buffered Writer.
7-2.
TransactionLog
andPendingAckStore
enables the Metrics of Txn Buffered WriterThe desired effect
TransactionLog
should createTxnLogBufferedWriter
with params:The metrics output of
TransactionLog
will like this:PendingAckStore
is the same. But all the PendingackStores will not differentiate the Subscription labels (because there are too many)Manage the registered collectors ourselves.
To build Metrics Stat, we need to execute these two steps:
Collector
and register toCollectorRegistry
, perhaps the Collector isHistogram
orCounter
Collector
and getCollector.child
(holds by Metrics Stat). This step can also be omitted because we can executecollector.labels(labelValues)
to getCollector.child
.In the Transaction log scenario, multiple Transaction Logs share the same
Collector
, and each has its ownCollector.Child
, so when we build metrics stat for each Transaction Log, we callcollector.labels(labelValues)
to get theCollector.Child
. However, the CollectorRegistry does not provide an API like this:and it will throw IllegalArgumentException when we registering collector with the same name more than once, see:
https://github.com/prometheus/client_java/blob/1966186deaaa83aec496d88ff604a90795bad688/simpleclient/src/main/java/io/prometheus/client/CollectorRegistry.java#L49-L65
So we have to manage the registered collectors ourselves.
Holds the
Collector.child
by each Metrics stat instanceTo save the overhead of
collector.labels(labelValues)
, we make each Metrics Stat hold a reference ofCollector.child
, because this method is not light enough:https://github.com/prometheus/client_java/blob/1966186deaaa83aec496d88ff604a90795bad688/simpleclient/src/main/java/io/prometheus/client/SimpleCollector.java#L63-L80
Code will be removed in the next PR (7-2)
In the complete design, we should have two implementations like UML blow, one for enabling the batch feature, and another for disabled:
To reduce later maintenance costs, I'd like to ditch the 'DisabledMetricsStat' and we'll always use the implementation 'MetricsStatimpl' even if the Txn buffer writer disables the batch feature. This constructor without 'param-metricsStats' and these' null checks' will be removed in the next PR. This is compatible only with split PR, making each PR have less code
Documentation
doc-required
doc-not-needed
doc
doc-complete