-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: store exported data even if blocked due to throttle violation
#1
Comments
RFE-2I suggest to add a message in the |
RFE-3An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban). RFE-4Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit? |
Ah yes indeed, it's pretty annoying. And yeah, that's why I added I think there are a few options, although need to think of all the pros and cons before implementing any (because it might be a fair emount of work):
|
Yeah, I think that makes a lot of sense! Can't remember whether there was such an endpoint, but maybe it's possible to reuse stexport/src/stexport/export.py Line 4 in e93ec39
Perhaps this? stexport/src/stexport/export.py Lines 3 to 64 in e93ec39
|
I see that RFE-4 does not make sense, it's just 5 urls-per-site. REF-3 is the important stuff - i got my list of sites to scrap from this: https://stackexchange.com/users/263317/ankostis?tab=accounts |
oh btw about that
Yeah indeed, it's also kind of a problem -- a consequence of the way exporters work -- they output data to stdout (for simplicity) and then it's dumped atomically. Even if it was written in the process, since it's a single JSON structure (dictionary in this case), it would be malformed unless it's complete, so would require some manual intervention to make it well-formed. Maybe it won't be necessary in this case if we make less requests and we get away with it -- but a more general way to do this might be to let the export files be JSONL. So for example, it could dump a json per site on each line. It would allow it to be backed by a single text file (so we keep simplicity), but also flexible enough and easy to assemble back by |
Weird that you got hit with a full day? The limits they disclose are:
And regarding RFE-3, looks like it could leverage the EDIT: Made PR for RFE-3 in #3 |
Wow, yeah, I've made maybe 300 requests total while testing and I just got the 24hr ban... Wasn't able to get a full export unfortunately |
Some things I noticed from my own experiments:
Quota is returned in the raw api response: stexport/src/stexport/export.py Lines 134 to 138 in e9e1129
items from it, so we could at least inspect it beforehand to avoid ban). However there is and issue: AWegnerGitHub/stackapi#41 pip3 install --user stackapi --upgrade now it reports stats correctly:
So maybe at the very least it would be possible to warn the user if we're about to make too many requests (i.e. |
As a result, they get 10K daily limit instead of 300 requests. #1
And I think I figured it out :) The 'site apis' didn't get api parameters so they ended up with 300 requests default limit instead of 10K. Once we merge @Cobertos PR (which still makes sense nevertheless!) I can also merge it and hopefully it will resolve this? stexport/src/stexport/export.py Line 150 in 4a3687b
P.S. also noticed some weird |
After merging origin/fix locally, I was able run the full export. fetch_backoff backed off about 6 times (each time it backed off like 7 times of up to 120s), but I got through all of the below sites without being banned :3
|
As a result, they get 10K daily limit instead of 300 requests. #1
As a result, they get 10K daily limit instead of 300 requests. #1
Ok, I guess this is fixed in master! I also updated readme about getting |
When authenticated with a stackexchange app and downloading all sites,
stack-exchange blocked export with a
throttle_violation(502)
error,and the exporter crashed, without storing any of the downloaded data.
The expected behavior would be to save data collected in a try-finaly case.
Repored against e93ec39(Dec 3 2021)
RFE-1
Add
--version
so that reports like that can provide the version of the tool each issue refers to.The text was updated successfully, but these errors were encountered: