Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of DBS3Reader APIs #11098

Closed
vkuznet opened this issue Apr 15, 2022 · 4 comments · Fixed by #11099 or #11101
Closed

Improve speed of DBS3Reader APIs #11098

vkuznet opened this issue Apr 15, 2022 · 4 comments · Fixed by #11099 or #11101

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Apr 15, 2022

Impact of the new feature
Current implementation of DBS3Reader API relies on sequential access of data from DBS. Even though it works just fine for heavy populated datasets/blocks it can be very slow especially when we require to fetch information from lots of blocks, e.g. parentage information. This ticket should address the speed of DBS3Reader APIs by refactoring codebase to take advantage of concurrency in API calls

Is your feature request related to a problem? Please describe.
For heavily populated datasets (with lots of blocks) I found that certain APIs, e.g. listDatasetFileDetails is very slow, see dmwm/dbs2go#5

Describe the solution you'd like
Refactor code to take advantage of concurrent (parallel) execution of APIs

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
See this ticket: dmwm/dbs2go#5

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 18, 2022

The proposed approach in PR#11099 can be applied to the following APIs:

  • listFilesInBlockWithParents
  • listFileBlockLocation
  • getParentFilesGivenParentDataset
  • getParentFilesByLumi
  • findAndInsertMissingParentage
  • fixMissingParentageDatasets

All of them uses sequential calls to DBS APIs via for loop. These calls can be parallelized which will significantly reduce time spent in a given API. I suggest to provide individual PRs for each listed APIs.

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 19, 2022

@amaltaro I think this issue should stays open until we merge #11099 or even longer until I provide other PRs for different APIs

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 21, 2022

PR #11099 addresses first three APIs:

  • listFilesInBlockWithParents
  • listFileBlockLocation
  • getParentFilesGivenParentDataset

The getParentFilesByLumi is partially covered by parallel execution, and last two APIs findAndInsertMissingParentage and fixMissingParentageDatasets will require more significant effort to make their code execute concurrently.

@amaltaro
Copy link
Contributor

Reopening it because the right PR to fix it is being discussed in #11099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants