Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading improvement #84

Merged
merged 336 commits into from
Aug 12, 2022
Merged

Conversation

DownstreamDataTeam
Copy link
Contributor

Description of change

  • Implemented multithreading by:
    • offset: as before
    • bookmark: new method, reducing maximum offset to improve performance by moving through the stream using filters instead of offset
    • bookmark: same as above, but instead of using date time, it uses date as filters
  • Child streams are extracted using multithreading.
  • Added unit tests for the multithreading implementation.
  • Changed page_size default value from 500 to 200
  • Hardcoded the page_size and max_threads for some streams in order to improve performance and avoid causing extreme loads
  • Removed the multithreading POC file
  • Some streams are not using multithreading because of technical issues.

Rollback steps

  • revert this branch

Alexandru Rosca and others added 30 commits March 21, 2022 19:26
# Conflicts:
#	tap_mambu/sync.py
#	tap_mambu/tap_mambu_refactor/main.py
…into 'release/40'

[ECDDC-592] Refactored credit_arrangements stream, implemented unit tests

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!68
Refactored 7 more streams (singer-io#72)

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!71
…ion logic to a helper module

Refactored unit tests to use the new Generator/Processor selection logic, and also to reduce code duplication
…ease/40'

[ECDDC-653] Refactored main.py and unit tests

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!72
…_breakdown_stream

# Conflicts:
#	tap_mambu/tap_mambu_refactor/main.py
…to 'release/40'

[ECDDC-646] New Sonar config version

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!65
[ECDDC-591] Fixed issue with users deduplication key

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!78
…' into 'release/40'

[ECDDC-603] Added interest accrual breakdown stream

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!74
…se/40'

[ECDDC-649] Added task_link_key field to tasks stream

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!75
[ECDDC-652] Adjusted Snyk dev test

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!67
…ease/40'

[ECDDC-653] Finished implementing catalog automatic fields checker test

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!77
Radu Marinoiu and others added 17 commits June 22, 2022 07:28
…-extraction' into 'release/45'

[ECDDC-695] Implement multithreaded child streams extraction

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!118
# Conflicts:
#	tap_mambu/helpers/client.py
#	tap_mambu/sync.py
#	tap_mambu/tap_processors/processor.py
[ECDDC-726] Merge master into release 45

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!119
…se/45'

[ECDDC-727] Added unit tests for multithreading

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!115
…nto 'release/45'

[ECDDC-729] Refactored offset and bookmark multithreaded generators

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!120
… 'release/45'

[ECDDC-716] Deposit accounts missing records

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!123
Copy link
Contributor

@dmosorast dmosorast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes requested and comments. I think the biggest blocker here is the usage of set for deduping (for performance and consistency reasons).

tap_mambu/tap_generators/multithreaded_offset_generator.py Outdated Show resolved Hide resolved
tap_mambu/tap_generators/multithreaded_offset_generator.py Outdated Show resolved Hide resolved
tap_mambu/tap_generators/multithreaded_offset_generator.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved


class MultithreadedRequestsPool:
_dispatcher = ThreadPoolExecutor(max_workers=100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any metrics on the CPU usage of this? It does look like it will only create threads if it needs, and we think that most of them will just be waiting on network I/O, so CPU usage shouldn't be too high, but it'd be good to see some performance metrics here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, it's using quite a bit while running the tap-tester tests, so I'm a bit concerned about load in a multi-tenant environment. Would be nice to make the degree of parallelism configurable as a max through config or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an idea, yes. Maybe we should talk more about this once I finish up fixing stuff after this review

Radu Marinoiu added 6 commits August 4, 2022 14:54
… to one using a custom dict class "HashableDict" which implements a hash and unique keys for each dict (generated from the json dump data)
[ECDDC-726] Resolve issues in the PR

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!131
Copy link
Contributor

@dmosorast dmosorast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more changes for performance requested, we can talk about this more in depth if you'd like.

tap_mambu/helpers/hashable_dict.py Outdated Show resolved Hide resolved
tap_mambu/helpers/hashable_dict.py Outdated Show resolved Hide resolved
tap_mambu/helpers/hashable_dict.py Outdated Show resolved Hide resolved

def __eq__(self, other):
if isinstance(other, HashableDict):
return self.__key() == other.__key()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use the hash instead of the key because equality is a required property of __hash__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure what you mean by equality being a required property of hash.
Also, it feels like changing equality from key to hash(key) could result in collisions for records that are not the same but result in the same hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my mistake. I misread the statement in that Python documentation (and it seems like I need to review my Data Structures textbook 😅). At first read I interpreted this to also mean that hashes that are equal map to objects that are equal.

The only required property is that objects which compare equal have the same hash value

@@ -130,11 +133,11 @@ def error_check_and_fix(self, final_buffer: set, temp_buffer: set, futures: list
final_buffer = self.check_and_get_set_reunion(final_buffer, temp_buffer, self.artificial_limit)
except RuntimeError: # if errors are found
LOGGER.exception("Discrepancies found in extracted data, and errors couldn't be corrected."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exception level? I'd be concerned that a data set with a lot of activity on it during the sync could cause quite a bit of noise with all of the call stacks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in Slack, this can be included in the next PR.

Radu Marinoiu added 2 commits August 11, 2022 09:07
[ECDDC-726] Replaced all json.dumps with tuple conversions, as they are faster to compute a hash for

See merge request mambucom/product/ecosystem/mambu-marketplace/connectors/singer/tap-mambu!132
@dmosorast dmosorast merged commit f7d91ac into singer-io:master Aug 12, 2022
@dmosorast dmosorast mentioned this pull request Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants