-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add monitoring #674
feat: add monitoring #674
Conversation
10e8ac4
to
5fe72a5
Compare
server/clip_server/torch-flow.yml
Outdated
monitoring: true | ||
port_monitoring: 9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the both monitoring
and monitoring_port
are redundant. If monitor_port
is set, then monitor is enabled, otherwise disabled. Hence, monitoring
is not necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that there is always a default value pass to monitoring_port, so we do need another parameter to check to say if monitoring is enable or no. What I could do here is just use the default monitoring port so that the yaml does not contain redundant information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value can be None
, am I right?
self.summary_text = self.get_summary('text_preproc_second', 'Time to preprocess text') | ||
self.summary_image = self.get_summary('image_preproc_second', 'Time to preprocess image') | ||
self.summary_encode = self.get_summary('encode_second', 'Time to encode') | ||
def get_summary(self, title, details): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a global Summary
context instance, which can be initialized in the BaseExecutor. Hence, the developer does not need to inject the above codes into the executor implementation. For custome tracking name, we can borrow some ideas from the metric logger from Tensorboard
. e.g.,
self.summary.log('text_docs', len(da), ...)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already some base summary initialize in the BaseExecutor, but here proces text and proces image are specific so this executor so they would always need to be define here
I note for the Tensorboard for custom tracking name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with Felix on this one, I'd prefer:
Best: No change to the executor at all. Injection happens inside the core. Think about Hub Executors, how can we change all of them, especially those are not owned by us.
Okay: injection happens at @requests(..., monitor=True, summary=True)
Bad: need a lot of lines of change like the current PR. This is unacceptable to me, also to all of our Hub Executors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be unclear.
The core will support all bunch of metrics which will be common to all Executor. Like the latency on gateway,head,worker side and the time that the function wrapper by requests takes.
This will be common to all Executor and not a single line of code would be needed on the Executor side.
In addition of that I want to have an extra feature that allow Executor developer to customize and enrich this monitoring by defining by themself new metrics to expose. This is highly relevant for clip as a service as I would like to know how much time is actually used on gpu computation and how much time is used in the text/image preprocessing. The idea with that extra feature is that you can monitoring each sub function that is called inside your request method. And those things can't be abstracted away as it is not common to all executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is not acceptable as I pointed out
Summary of the sync between @numb3r3 and me: We think about three possible interfaces for exposing the metrics from an Executor:
I will create a draft feature for each of the proposition so that we can try it out with @numb3r3 and decide what we want to ship at the end |
|
@@ -52,6 +52,7 @@ def __init__( | |||
|
|||
self._pool = ThreadPool(processes=num_worker_preprocess) | |||
|
|||
@monitor('preproc_images_seconds','Time preprocessing images') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the monitor decorator work with arbitrary function? or it can only work with the Executor function (i.e., member function of the executor) whose input and outputs are DocumentArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works with any method of the Executor, it does not require any particular signature. It still need to be a method and not a arbitrary function though, as it needs to access the metrics registry which is store inside the Executor
2d3c830
to
1b04c87
Compare
Codecov Report
@@ Coverage Diff @@
## main #674 +/- ##
==========================================
+ Coverage 80.56% 80.86% +0.30%
==========================================
Files 16 16
Lines 1137 1155 +18
==========================================
+ Hits 916 934 +18
Misses 221 221
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
WIP, blocked by jina-ai/serve#4526