Performance improvements #1

laurentS · 2022-02-08T21:12:17Z

This PR addresses the changes proposed in datamill-co#203

As described in the issue, loading data seemed unnecessarily slow. On a test CSV file containing 50k rows, each with 2 cols of fixed length text, target-postgres was spending about 20 seconds loading the data.

Some profiling helped identify the bottlenecks, as described in the issue.

This PR proposes 3 distinct improvements, in decreasing order of time savings (based on my test case). With cProfile instrumentation, the loading took approx 32 seconds instead of 20:

There is only a limited combination of arguments to _serialize_table_record_field_name so that it can be cached to avoid recalculating over and over. The method was called for each field of each row, including all 5 metadata fields. In my example, this led to about 350k calls. With functools.lru_cache applied, cache_info showed that only 7 calls were really made, all remaining calls resulted in cache hits. My test data is extremely consistent, so might be a best case, but my understanding is that this should result in at most number_of_fields * possible_data_types actual calls. If multiple batches are required to load the data, each batch will recreate a fresh cache. My fix to allow caching feels a bit hackish, but it seems to work (improvement suggestions welcome!). This saves about 15s/32 (with profiling on).
The same seems to apply to serialize_table_record_datetime_value (in my tests, each record has the same value in _sdc_batched_at). The call to arrow is pretty expensive. In my testing, the exact same value was formatted once per row. Applying caching on this seems straightforward. In my sample, it led to about 14s/32 of time savings.
The last bit of excessive time is spent in deepcopy which is a weirdly slow function. Replacing it with pickling/unpickling saved about 1.5s again.

In the end, with these speedups applied, the same data now loads in under 4s instead of the original 20s.

Using a partial function and tweaking how the arguments are passed allows applying lru_cache on the method. The function is called for each field of each row, but the possible values for args are a lot more limited, so caching is very effective here.

ericboucher · 2022-02-09T02:11:16Z

target_postgres/postgres.py

@@ -19,6 +20,14 @@

 RESERVED_NULL_DEFAULT = 'NULL'

+@lru_cache(maxsize=128)


let's make maxsize a constant at the top maybe?

laurentS added 3 commits April 4, 2021 01:25

Cache datetime formatting function

ffc8976

Use pickle instead of deepcopy to speedup row copying

7041e5c

laurentS requested a review from ericboucher February 8, 2022 21:12

ericboucher approved these changes Feb 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #1

Performance improvements #1

laurentS commented Feb 8, 2022

ericboucher Feb 9, 2022

		@@ -19,6 +20,14 @@

		RESERVED_NULL_DEFAULT = 'NULL'

		@lru_cache(maxsize=128)

Performance improvements #1

Are you sure you want to change the base?

Performance improvements #1

Conversation

laurentS commented Feb 8, 2022

ericboucher Feb 9, 2022

Choose a reason for hiding this comment