-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reduce memory usage for export_csv #900
fix: reduce memory usage for export_csv #900
Conversation
Thanks for opening this pull request! You're awesome. We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should update your pull request title to start with a semantic prefix. Examples of titles with semantic prefixes:
|
"location.bounding_box.minimum_longitude", | ||
"location.bounding_box.maximum_longitude", | ||
"location.bounding_box.extracted_on", | ||
"status", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcpitre I noticed we don't set the status for realtime feeds, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not intentional, the realtime feeds also should have the status
column populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a specific issue to address this, as we are missing a few other column values. #906
Yes, that specific requirement in the issue is for our internal team. For security and cost reasons, we don't share our environments' DB. |
Aside from this PR improvement (awesome work 🥳 ), the main driver for memory increase is the fact that we are downloading datasets twice a week. This means that if the feed total stays the same(that is actually increasing with newcomers every week/month), the data will still grow with the new datasets. This is why I will keep the original issue open until we address this cause. The next round of optimization should look into the query and get only the latest dataset from each feed(or a limited number of datasets ordered by date). |
Hi @sylvansson, Thanks for this contribution! I'm ready to approve the PR. However, the original code had exactly 80% test coverage, and after your changes, it went down to 78%(our minimum threshold is 80%). I know that it might sound a little restrictive, but as for now, if the test coverage fails on coverage, the CI blocks deployments to DEV/QA and PROD. The reduction in test coverage, I think, is due to the re-shuffle of the code other than any new branch of code that this PR adds. Next step:
Let us know how you would like to proceed, and thanks again for your continued support! |
Hi @davidgamez, thanks for the review. I'll fix the coverage % tonight. I'm also more than happy to tackle the other optimizations you've mentioned once this has gone out :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary:
This PR attempts to help reduce memory usage for the export CSV function (#898) by writing to the CSV file incrementally instead of accumulating data in a DataCollector instance and writing at the very end.
I don't know how much data is read from the production database with the current queries so I can't really quantify the impact of my changes, but this is definitely not a long-term solution. We could probably reduce memory overhead much further by reading feed data in smaller batches using something like
yield_per
, but I preferred not touching that code because I'm not familiar with SQLAlchemy. I could address that in a subsequent PR.Expected behavior:
The export CSV function should use less memory but produce the same output.
Testing tips:
Based on #888, we should able to run the api-dev workflow and test against the Dev environment, but as far as I can tell I (understandably) don't have the ability to do that :-)
Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.sh
to make sure you didn't break anything