Skip to content
This repository has been archived by the owner on May 30, 2022. It is now read-only.

Performance improvements #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Performance improvements #1

wants to merge 3 commits into from

Conversation

laurentS
Copy link
Collaborator

@laurentS laurentS commented Feb 8, 2022

Initially posted as datamill-co#204.

This PR addresses the changes proposed in datamill-co#203

As described in the issue, loading data seemed unnecessarily slow. On a test CSV file containing 50k rows, each with 2 cols of fixed length text, target-postgres was spending about 20 seconds loading the data.

Some profiling helped identify the bottlenecks, as described in the issue.

This PR proposes 3 distinct improvements, in decreasing order of time savings (based on my test case). With cProfile instrumentation, the loading took approx 32 seconds instead of 20:

There is only a limited combination of arguments to _serialize_table_record_field_name so that it can be cached to avoid recalculating over and over. The method was called for each field of each row, including all 5 metadata fields. In my example, this led to about 350k calls. With functools.lru_cache applied, cache_info showed that only 7 calls were really made, all remaining calls resulted in cache hits. My test data is extremely consistent, so might be a best case, but my understanding is that this should result in at most number_of_fields * possible_data_types actual calls. If multiple batches are required to load the data, each batch will recreate a fresh cache. My fix to allow caching feels a bit hackish, but it seems to work (improvement suggestions welcome!). This saves about 15s/32 (with profiling on).
The same seems to apply to serialize_table_record_datetime_value (in my tests, each record has the same value in _sdc_batched_at). The call to arrow is pretty expensive. In my testing, the exact same value was formatted once per row. Applying caching on this seems straightforward. In my sample, it led to about 14s/32 of time savings.
The last bit of excessive time is spent in deepcopy which is a weirdly slow function. Replacing it with pickling/unpickling saved about 1.5s again.

In the end, with these speedups applied, the same data now loads in under 4s instead of the original 20s.

laurentS added 3 commits April 4, 2021 01:25
Using a partial function and tweaking how the arguments are passed
allows applying lru_cache on the method. The function is called for each
field of each row, but the possible values for args are a lot more
limited, so caching is very effective here.
@laurentS laurentS requested a review from ericboucher February 8, 2022 21:12
@@ -19,6 +20,14 @@

RESERVED_NULL_DEFAULT = 'NULL'

@lru_cache(maxsize=128)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make maxsize a constant at the top maybe?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants