fix: reduce memory usage for export_csv #900

sylvansson · 2025-02-02T18:19:41Z

Summary:

This PR attempts to help reduce memory usage for the export CSV function (#898) by writing to the CSV file incrementally instead of accumulating data in a DataCollector instance and writing at the very end.

I don't know how much data is read from the production database with the current queries so I can't really quantify the impact of my changes, but this is definitely not a long-term solution. We could probably reduce memory overhead much further by reading feed data in smaller batches using something like yield_per, but I preferred not touching that code because I'm not familiar with SQLAlchemy. I could address that in a subsequent PR.

Expected behavior:

The export CSV function should use less memory but produce the same output.

Testing tips:

Based on #888, we should able to run the api-dev workflow and test against the Dev environment, but as far as I can tell I (understandably) don't have the ability to do that :-)

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

welcome · 2025-02-02T18:19:44Z

Thanks for opening this pull request! You're awesome. We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should update your pull request title to start with a semantic prefix. Examples of titles with semantic prefixes:

fix: Bug with ssl network connections + Java module permissions.
feat: Initial support for multiple @PrimaryKey annotations.
docs: update RELEASE.md with new process
To get this PR to the finish line, please do the following:
Include tests when adding/changing behavior
Include screenshots

CLAassistant · 2025-02-02T18:19:47Z

All committers have signed the CLA.

sylvansson · 2025-02-02T19:42:43Z

functions-python/export_csv/src/main.py

+    "location.bounding_box.minimum_longitude",
+    "location.bounding_box.maximum_longitude",
+    "location.bounding_box.extracted_on",
+    "status",


@jcpitre I noticed we don't set the status for realtime feeds, is that intentional?

This was not intentional, the realtime feeds also should have the status column populated.

I added a specific issue to address this, as we are missing a few other column values. #906

davidgamez · 2025-02-06T19:34:52Z

Based on #888, we should able to run the api-dev workflow and test against the Dev environment, but as far as I can tell I (understandably) don't have the ability to do that :-)

Yes, that specific requirement in the issue is for our internal team. For security and cost reasons, we don't share our environments' DB.
I did test this branch with QA/DEV and PROD data, and it looks like memory usage was reduced by ~45%. Great work!

davidgamez · 2025-02-06T19:42:45Z

I don't know how much data is read from the production database with the current queries so I can't really quantify the impact of my changes, but this is definitely not a long-term solution. We could probably reduce memory overhead much further by reading feed data in smaller batches using something like yield_per, but I preferred not touching that code because I'm not familiar with SQLAlchemy. I could address that in a subsequent PR.

Aside from this PR improvement (awesome work 🥳 ), the main driver for memory increase is the fact that we are downloading datasets twice a week. This means that if the feed total stays the same(that is actually increasing with newcomers every week/month), the data will still grow with the new datasets. This is why I will keep the original issue open until we address this cause. The next round of optimization should look into the query and get only the latest dataset from each feed(or a limited number of datasets ordered by date).

davidgamez · 2025-02-06T21:03:47Z

Hi @sylvansson, Thanks for this contribution! I'm ready to approve the PR. However, the original code had exactly 80% test coverage, and after your changes, it went down to 78%(our minimum threshold is 80%). I know that it might sound a little restrictive, but as for now, if the test coverage fails on coverage, the CI blocks deployments to DEV/QA and PROD. The reduction in test coverage, I think, is due to the re-shuffle of the code other than any new branch of code that this PR adds.

Next step:

if you can take a crack at bumping the coverage, we will approve and merge the PR.
If you don't have the spare cycles to work on it, we(our internal team) will branch off this PR and add the test to reach at least the minimum threshold.

Let us know how you would like to proceed, and thanks again for your continued support!

sylvansson · 2025-02-06T21:12:45Z

Hi @davidgamez, thanks for the review. I'll fix the coverage % tonight. I'm also more than happy to tackle the other optimizations you've mentioned once this has gone out :-)

davidgamez

LGTM!

welcome · 2025-02-06T21:45:03Z

🥳 Congrats on getting your first pull request merged!

sylvansson added 2 commits February 2, 2025 12:11

MobilityData#898 Stop relying on global variable for CSV file path

aeb5585

MobilityData#898 Write to CSV file incrementally

8ad6712

sylvansson commented Feb 2, 2025

View reviewed changes

sylvansson changed the title ~~feat: reduce memory usage for export_csv~~ fix: reduce memory usage for export_csv Feb 2, 2025

emmambd linked an issue Feb 3, 2025 that may be closed by this pull request

Export CSV function exceeds memory limit #898

Open

Merge branch 'main' into 898-reduce-export-csv-memory-usage

df898f8

davidgamez self-requested a review February 6, 2025 19:32

Fix coverage

c4a6aed

davidgamez approved these changes Feb 6, 2025

View reviewed changes

davidgamez merged commit 1035eed into MobilityData:main Feb 6, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce memory usage for export_csv #900

fix: reduce memory usage for export_csv #900

sylvansson commented Feb 2, 2025 •

edited

Loading

welcome bot commented Feb 2, 2025

CLAassistant commented Feb 2, 2025 •

edited

Loading

sylvansson Feb 2, 2025

davidgamez Feb 6, 2025

davidgamez Feb 6, 2025

davidgamez commented Feb 6, 2025

davidgamez commented Feb 6, 2025

davidgamez commented Feb 6, 2025

sylvansson commented Feb 6, 2025

davidgamez left a comment

welcome bot commented Feb 6, 2025

fix: reduce memory usage for export_csv #900

fix: reduce memory usage for export_csv #900

Conversation

sylvansson commented Feb 2, 2025 • edited Loading

welcome bot commented Feb 2, 2025

CLAassistant commented Feb 2, 2025 • edited Loading

sylvansson Feb 2, 2025

Choose a reason for hiding this comment

davidgamez Feb 6, 2025

Choose a reason for hiding this comment

davidgamez Feb 6, 2025

Choose a reason for hiding this comment

davidgamez commented Feb 6, 2025

davidgamez commented Feb 6, 2025

davidgamez commented Feb 6, 2025

sylvansson commented Feb 6, 2025

davidgamez left a comment

Choose a reason for hiding this comment

welcome bot commented Feb 6, 2025

sylvansson commented Feb 2, 2025 •

edited

Loading

CLAassistant commented Feb 2, 2025 •

edited

Loading