Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reduce memory usage for export_csv #900

Merged

Conversation

sylvansson
Copy link
Contributor

@sylvansson sylvansson commented Feb 2, 2025

Summary:

This PR attempts to help reduce memory usage for the export CSV function (#898) by writing to the CSV file incrementally instead of accumulating data in a DataCollector instance and writing at the very end.

I don't know how much data is read from the production database with the current queries so I can't really quantify the impact of my changes, but this is definitely not a long-term solution. We could probably reduce memory overhead much further by reading feed data in smaller batches using something like yield_per, but I preferred not touching that code because I'm not familiar with SQLAlchemy. I could address that in a subsequent PR.

Expected behavior:

The export CSV function should use less memory but produce the same output.

Testing tips:

Based on #888, we should able to run the api-dev workflow and test against the Dev environment, but as far as I can tell I (understandably) don't have the ability to do that :-)

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

Copy link

welcome bot commented Feb 2, 2025

Thanks for opening this pull request! You're awesome. We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should update your pull request title to start with a semantic prefix. Examples of titles with semantic prefixes:

  • fix: Bug with ssl network connections + Java module permissions.
  • feat: Initial support for multiple @PrimaryKey annotations.
  • docs: update RELEASE.md with new process
    To get this PR to the finish line, please do the following:
  • Include tests when adding/changing behavior
  • Include screenshots

@CLAassistant
Copy link

CLAassistant commented Feb 2, 2025

CLA assistant check
All committers have signed the CLA.

"location.bounding_box.minimum_longitude",
"location.bounding_box.maximum_longitude",
"location.bounding_box.extracted_on",
"status",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcpitre I noticed we don't set the status for realtime feeds, is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not intentional, the realtime feeds also should have the status column populated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a specific issue to address this, as we are missing a few other column values. #906

@sylvansson sylvansson changed the title feat: reduce memory usage for export_csv fix: reduce memory usage for export_csv Feb 2, 2025
@emmambd emmambd linked an issue Feb 3, 2025 that may be closed by this pull request
@davidgamez davidgamez self-requested a review February 6, 2025 19:32
@davidgamez
Copy link
Member

Based on #888, we should able to run the api-dev workflow and test against the Dev environment, but as far as I can tell I (understandably) don't have the ability to do that :-)

Yes, that specific requirement in the issue is for our internal team. For security and cost reasons, we don't share our environments' DB.
I did test this branch with QA/DEV and PROD data, and it looks like memory usage was reduced by ~45%. Great work!

@davidgamez
Copy link
Member

I don't know how much data is read from the production database with the current queries so I can't really quantify the impact of my changes, but this is definitely not a long-term solution. We could probably reduce memory overhead much further by reading feed data in smaller batches using something like yield_per, but I preferred not touching that code because I'm not familiar with SQLAlchemy. I could address that in a subsequent PR.

Aside from this PR improvement (awesome work 🥳 ), the main driver for memory increase is the fact that we are downloading datasets twice a week. This means that if the feed total stays the same(that is actually increasing with newcomers every week/month), the data will still grow with the new datasets. This is why I will keep the original issue open until we address this cause. The next round of optimization should look into the query and get only the latest dataset from each feed(or a limited number of datasets ordered by date).

@davidgamez
Copy link
Member

Hi @sylvansson, Thanks for this contribution! I'm ready to approve the PR. However, the original code had exactly 80% test coverage, and after your changes, it went down to 78%(our minimum threshold is 80%). I know that it might sound a little restrictive, but as for now, if the test coverage fails on coverage, the CI blocks deployments to DEV/QA and PROD. The reduction in test coverage, I think, is due to the re-shuffle of the code other than any new branch of code that this PR adds.

Next step:

  1. if you can take a crack at bumping the coverage, we will approve and merge the PR.
  2. If you don't have the spare cycles to work on it, we(our internal team) will branch off this PR and add the test to reach at least the minimum threshold.

Let us know how you would like to proceed, and thanks again for your continued support!

@sylvansson
Copy link
Contributor Author

Hi @davidgamez, thanks for the review. I'll fix the coverage % tonight. I'm also more than happy to tackle the other optimizations you've mentioned once this has gone out :-)

Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidgamez davidgamez merged commit 1035eed into MobilityData:main Feb 6, 2025
2 of 3 checks passed
Copy link

welcome bot commented Feb 6, 2025

🥳 Congrats on getting your first pull request merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export CSV function exceeds memory limit
3 participants