-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add traversal paths #750
Conversation
Codecov Report
@@ Coverage Diff @@
## main #750 +/- ##
==========================================
+ Coverage 81.49% 81.74% +0.24%
==========================================
Files 17 17
Lines 1232 1205 -27
==========================================
- Hits 1004 985 -19
+ Misses 228 220 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
num_worker_preprocess: int = 4, | ||
minibatch_size: int = 32, | ||
traversal_paths: str = '@r', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traversal_paths: str = '@r', |
:param traversal_paths: Default traversal paths for encoding, used if | ||
the traversal path is not passed as a parameter with the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param traversal_paths: Default traversal paths for encoding, used if | |
the traversal path is not passed as a parameter with the request. |
""" | ||
super().__init__(*args, **kwargs) | ||
self._minibatch_size = minibatch_size | ||
|
||
self._use_default_preprocessing = use_default_preprocessing | ||
self._max_length = max_length | ||
self._traversal_paths = traversal_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._traversal_paths = traversal_paths |
""" | ||
|
||
traversal_paths = parameters.get('traversal_paths', self._traversal_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traversal_paths = parameters.get('traversal_paths', self._traversal_paths) | |
traversal_paths = parameters.get('traversal_paths', '@r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanxiao I don't agree with this suggestion. It will break the following use case:
gateway -> encoder #1 (work on root_level) -> encoder #2(work on chunk_level)
It is impossible to pass the proper parameters:
client.post(on='/', parameters={'traversal_paths': '?????'})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By defining the default traversal path in __init__
, client.pos(on='/')
works
gateway -> encoder #1 (traversal_paths='@r') -> encoder #2(traversal_paths='@c')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it is impossible? i dont get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanxiao From the above flow example, there are two encoders both working on different level documents (one on root-level, another on chunk-level).
- use
@r
on request parameter at client:client.post(on='/', parameters={'traversal_paths': '@r'})
-> encoder 2 cannot work - use
@c
on request parameter at client:client.post(on='/', parameters={'traversal_paths': '@c'})
-> encoder 1 cannot work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you should be able to send parameter to one particular Executor
client/clip_client/client.py
Outdated
@@ -184,10 +181,15 @@ def _iter_doc(self, content) -> Generator['Document', None, None]: | |||
) | |||
|
|||
def _get_post_payload(self, content, kwargs): | |||
parameters = {} | |||
if 'batch_size' in kwargs: | |||
parameters['minibatch_size'] = kwargs['batch_size'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i disagree the exposing minibatch_size
to public client. It can easily overload a CAS server. Imagine user now has the capability of controlling both request_size
and minibatch_size
, the user can easily occupy the full GPU usage on our Berlin GPU server. It can easily make our GPU OOM by setting large request_size
and minibatch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C-S architecture, one should not aim to expose every server args to client, it is very risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. Then we need to update the document about how to control batch size.
client/clip_client/client.py
Outdated
@@ -187,7 +184,7 @@ def _get_post_payload(self, content, kwargs): | |||
return dict( | |||
on='/', | |||
inputs=self._iter_doc(content), | |||
request_size=kwargs.get('batch_size', 8), | |||
request_size=kwargs.get('batch_size', 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request_size=kwargs.get('batch_size', 32), | |
request_size=kwargs.get('batch_size', 8), |
No description provided.