Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add monitoring #674

Merged
merged 6 commits into from
May 27, 2022
Merged

feat: add monitoring #674

merged 6 commits into from
May 27, 2022

Conversation

samsja
Copy link
Contributor

@samsja samsja commented Apr 5, 2022

WIP, blocked by jina-ai/serve#4526

@samsja samsja marked this pull request as draft April 5, 2022 13:43
@samsja samsja force-pushed the feat-monitoring branch 3 times, most recently from 10e8ac4 to 5fe72a5 Compare April 5, 2022 14:45
Comment on lines 14 to 15
monitoring: true
port_monitoring: 9000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the both monitoring and monitoring_port are redundant. If monitor_port is set, then monitor is enabled, otherwise disabled. Hence, monitoring is not necessary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that there is always a default value pass to monitoring_port, so we do need another parameter to check to say if monitoring is enable or no. What I could do here is just use the default monitoring port so that the yaml does not contain redundant information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value can be None, am I right?

self.summary_text = self.get_summary('text_preproc_second', 'Time to preprocess text')
self.summary_image = self.get_summary('image_preproc_second', 'Time to preprocess image')
self.summary_encode = self.get_summary('encode_second', 'Time to encode')
def get_summary(self, title, details):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a global Summary context instance, which can be initialized in the BaseExecutor. Hence, the developer does not need to inject the above codes into the executor implementation. For custome tracking name, we can borrow some ideas from the metric logger from Tensorboard. e.g.,

self.summary.log('text_docs', len(da), ...)
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already some base summary initialize in the BaseExecutor, but here proces text and proces image are specific so this executor so they would always need to be define here
I note for the Tensorboard for custom tracking name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with Felix on this one, I'd prefer:

Best: No change to the executor at all. Injection happens inside the core. Think about Hub Executors, how can we change all of them, especially those are not owned by us.
Okay: injection happens at @requests(..., monitor=True, summary=True)
Bad: need a lot of lines of change like the current PR. This is unacceptable to me, also to all of our Hub Executors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be unclear.

The core will support all bunch of metrics which will be common to all Executor. Like the latency on gateway,head,worker side and the time that the function wrapper by requests takes.
This will be common to all Executor and not a single line of code would be needed on the Executor side.

In addition of that I want to have an extra feature that allow Executor developer to customize and enrich this monitoring by defining by themself new metrics to expose. This is highly relevant for clip as a service as I would like to know how much time is actually used on gpu computation and how much time is used in the text/image preprocessing. The idea with that extra feature is that you can monitoring each sub function that is called inside your request method. And those things can't be abstracted away as it is not common to all executor.

Copy link
Member

@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is not acceptable as I pointed out

@samsja
Copy link
Contributor Author

samsja commented Apr 14, 2022

Summary of the sync between @numb3r3 and me:

We think about three possible interfaces for exposing the metrics from an Executor:

  • Accessing the raw prometheus client.
    • Con: user need to define the summary by themself.
    • Pro: they can leverage the full power of the prometheus client. This solution will stay any way because the two others will rely on it
  • Having a decorator to monitor private method
  • Having a pytorch-like interface to do the monitoring: ex: self.summary.log('text_docs', len(da), ...)
    • Pro: easier to use as we just need to add one line of code
    • Con: we have to teach a new monitoring interface to the user

I will create a draft feature for each of the proposition so that we can try it out with @numb3r3 and decide what we want to ship at the end

@samsja
Copy link
Contributor Author

samsja commented May 3, 2022

Summary of the sync between @numb3r3 and me:

We think about three possible interfaces for exposing the metrics from an Executor:

* Accessing the raw prometheus client.
  
  * Con: user need to define the summary by themself.
  * Pro: they can leverage the full power of the prometheus client. This solution will stay any way because the two others will rely on it

* Having a decorator to monitor private method

* Having a pytorch-like interface to do the monitoring: ex: `self.summary.log('text_docs', len(da), ...)`
  
  * Pro: easier to use as we just need to add one line of code
  * Con: we have to teach a new monitoring interface to the user

I will create a draft feature for each of the proposition so that we can try it out with @numb3r3 and decide what we want to ship at the end

@numb3r3 @JoanFM

@@ -52,6 +52,7 @@ def __init__(

self._pool = ThreadPool(processes=num_worker_preprocess)

@monitor('preproc_images_seconds','Time preprocessing images')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the monitor decorator work with arbitrary function? or it can only work with the Executor function (i.e., member function of the executor) whose input and outputs are DocumentArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works with any method of the Executor, it does not require any particular signature. It still need to be a method and not a arbitrary function though, as it needs to access the metrics registry which is store inside the Executor

@samsja samsja force-pushed the feat-monitoring branch 2 times, most recently from 2d3c830 to 1b04c87 Compare May 16, 2022 08:05
@samsja samsja marked this pull request as ready for review May 16, 2022 08:05
@samsja samsja force-pushed the feat-monitoring branch from 1b04c87 to 993e58b Compare May 16, 2022 08:24
@numb3r3 numb3r3 force-pushed the feat-monitoring branch from 993e58b to a18cfad Compare May 26, 2022 04:03
@github-actions github-actions bot added size/s and removed size/xs labels May 26, 2022
@codecov
Copy link

codecov bot commented May 26, 2022

Codecov Report

Merging #674 (5f6c199) into main (e1acf9a) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #674      +/-   ##
==========================================
+ Coverage   80.56%   80.86%   +0.30%     
==========================================
  Files          16       16              
  Lines        1137     1155      +18     
==========================================
+ Hits          916      934      +18     
  Misses        221      221              
Flag Coverage Δ
cas 80.86% <100.00%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/clip_server/executors/clip_onnx.py 84.50% <100.00%> (+1.43%) ⬆️
server/clip_server/executors/clip_tensorrt.py 92.85% <100.00%> (+0.85%) ⬆️
server/clip_server/executors/clip_torch.py 85.71% <100.00%> (+1.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c9625...5f6c199. Read the comment docs.

@numb3r3 numb3r3 merged commit 60a986a into main May 27, 2022
@numb3r3 numb3r3 deleted the feat-monitoring branch May 27, 2022 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants