Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: store exported data even if blocked due to throttle violation #1

Closed
ankostis opened this issue Mar 8, 2021 · 12 comments
Closed

Comments

@ankostis
Copy link

ankostis commented Mar 8, 2021

When authenticated with a stackexchange app and downloading all sites,
stack-exchange blocked export with a throttle_violation(502) error,
and the exporter crashed, without storing any of the downloaded data.

The expected behavior would be to save data collected in a try-finaly case.

  • The behaviour is particularly grave, because the penalty for the app is 24hours!
  • The throttling error is expected to always happen, since the sites are too many.
./stexport.git $ python -m stexport.export \
        --user_id ... --key ... --access_token ... \
        --all-sites \
        stackexchange-$(date +%Y%m%d).json
[I 210308 10:16:14 export:149] exporting ['3dprinting', '3dprinting.meta', 'academia', 'academia.meta', 'ai', 'ai.meta', 'alcohol', 'alcohol.meta', 'android', 'android.meta', 'anime', 'anime.meta', 'apple', 'apple.meta', 'arduino', 'arduino.meta', 'askubuntu', 'astronomy', 'astronomy.meta', 'aviation', 'aviation.meta', 'bicycles', 'bicycles.meta', 'bioinformatics', 'bioinformatics.meta', 'biology', 'biology.meta', 'bitcoin', 'bitcoin.meta', 'blender', 'blender.meta', 'boardgames', 'boardgames.meta', 'bricks', 'bricks.meta', 'buddhism', 'buddhism.meta', 'chemistry', 'chemistry.meta', 'chess', 'chess.meta', 'chinese', 'chinese.meta', 'christianity', 'christianity.meta', 'civicrm', 'civicrm.meta', 'codegolf', 'codegolf.meta', 'codereview', 'codereview.meta', 'coffee', 'coffee.meta', 'communitybuilding', 'communitybuilding.meta', 'computergraphics', 'computergraphics.meta', 'conlang', 'conlang.meta', 'cooking', 'cooking.meta', 'craftcms', 'craftcms.meta', 'crafts', 'crafts.meta', 'crypto', 'crypto.meta', 'cs', 'cs.meta', 'cs50', 'cs50.meta', 'cseducators', 'cseducators.meta', 'cstheory', 'cstheory.meta', 'datascience', 'datascience.meta', 'dba', 'dba.meta', 'devops', 'devops.meta', 'diy', 'diy.meta', 'drones', 'drones.meta', 'drupal', 'drupal.meta', 'dsp', 'dsp.meta', 'earthscience', 'earthscience.meta', 'ebooks', 'ebooks.meta', 'economics', 'economics.meta', 'electronics', 'electronics.meta', 'elementaryos', 'elementaryos.meta', 'ell', 'ell.meta', 'emacs', 'emacs.meta', 'engineering', 'engineering.meta', 'english', 'english.meta', 'eosio', 'eosio.meta', 'es.meta.stackoverflow', 'es.stackoverflow', 'esperanto', 'esperanto.meta', 'ethereum', 'ethereum.meta', 'expatriates', 'expatriates.meta', 'expressionengine', 'expressionengine.meta', 'fitness', 'fitness.meta', 'freelancing', 'freelancing.meta', 'french', 'french.meta', 'gamedev', 'gamedev.meta', 'gaming', 'gaming.meta', 'gardening', 'gardening.meta', 'genealogy', 'genealogy.meta', 'german', 'german.meta', 'gis', 'gis.meta', 'graphicdesign', 'graphicdesign.meta', 'ham', 'ham.meta', 'hardwarerecs', 'hardwarerecs.meta', 'hermeneutics', 'hermeneutics.meta', 'hinduism', 'hinduism.meta', 'history', 'history.meta', 'homebrew', 'homebrew.meta', 'hsm', 'hsm.meta', 'interpersonal', 'interpersonal.meta', 'iot', 'iot.meta', 'iota', 'iota.meta', 'islam', 'islam.meta', 'italian', 'italian.meta', 'ja.meta.stackoverflow', 'ja.stackoverflow', 'japanese', 'japanese.meta', 'joomla', 'joomla.meta', 'judaism', 'judaism.meta', 'korean', 'korean.meta', 'languagelearning', 'languagelearning.meta', 'latin', 'latin.meta', 'law', 'law.meta', 'lifehacks', 'lifehacks.meta', 'linguistics', 'linguistics.meta', 'literature', 'literature.meta', 'magento', 'magento.meta', 'martialarts', 'martialarts.meta', 'math', 'math.meta', 'matheducators', 'matheducators.meta', 'mathematica', 'mathematica.meta', 'mathoverflow.net', 'mattermodeling', 'mattermodeling.meta', 'mechanics', 'mechanics.meta', 'medicalsciences', 'medicalsciences.meta', 'meta', 'meta.askubuntu', 'meta.mathoverflow.net', 'meta.serverfault', 'meta.stackoverflow', 'meta.superuser', 'monero', 'monero.meta', 'money', 'money.meta', 'movies', 'movies.meta', 'music', 'music.meta', 'musicfans', 'musicfans.meta', 'mythology', 'mythology.meta', 'networkengineering', 'networkengineering.meta', 'opendata', 'opendata.meta', 'opensource', 'opensource.meta', 'or', 'or.meta', 'outdoors', 'outdoors.meta', 'parenting', 'parenting.meta', 'patents', 'patents.meta', 'pets', 'pets.meta', 'philosophy', 'philosophy.meta', 'photo', 'photo.meta', 'physics', 'physics.meta', 'pm', 'pm.meta', 'poker', 'poker.meta', 'politics', 'politics.meta', 'portuguese', 'portuguese.meta', 'psychology', 'psychology.meta', 'pt.meta.stackoverflow', 'pt.stackoverflow', 'puzzling', 'puzzling.meta', 'quant', 'quant.meta', 'quantumcomputing', 'quantumcomputing.meta', 'raspberrypi', 'raspberrypi.meta', 'retrocomputing', 'retrocomputing.meta', 'reverseengineering', 'reverseengineering.meta', 'robotics', 'robotics.meta', 'rpg', 'rpg.meta', 'ru.meta.stackoverflow', 'ru.stackoverflow', 'rus', 'rus.meta', 'russian', 'russian.meta', 'salesforce', 'salesforce.meta', 'scicomp', 'scicomp.meta', 'scifi', 'scifi.meta', 'security', 'security.meta', 'serverfault', 'sharepoint', 'sharepoint.meta', 'sitecore', 'sitecore.meta', 'skeptics', 'skeptics.meta', 'softwareengineering', 'softwareengineering.meta', 'softwarerecs', 'softwarerecs.meta', 'sound', 'sound.meta', 'space', 'space.meta', 'spanish', 'spanish.meta', 'sports', 'sports.meta', 'sqa', 'sqa.meta', 'stackapps', 'stackoverflow', 'stats', 'stats.meta', 'stellar', 'stellar.meta', 'superuser', 'sustainability', 'sustainability.meta', 'tex', 'tex.meta', 'tezos', 'tezos.meta', 'tor', 'tor.meta', 'travel', 'travel.meta', 'tridion', 'tridion.meta', 'ukrainian', 'ukrainian.meta', 'unix', 'unix.meta', 'ux', 'ux.meta', 'vegetarianism', 'vegetarianism.meta', 'vi', 'vi.meta', 'video', 'video.meta', 'webapps', 'webapps.meta', 'webmasters', 'webmasters.meta', 'windowsphone', 'windowsphone.meta', 'woodworking', 'woodworking.meta', 'wordpress', 'wordpress.meta', 'workplace', 'workplace.meta', 'worldbuilding', 'worldbuilding.meta', 'writing', 'writing.meta']
...
[I 210308 10:21:55 export:132] exporting askubuntu: users/{ids}/reputation
[I 210308 10:21:56 export:132] exporting askubuntu: users/{ids}/reputation-history
[I 210308 10:21:57 export:132] exporting askubuntu: users/{ids}/suggested-edits
[E 210308 10:21:58 _common:101] Giving up fetch_backoff(...) after 1 tries (stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds'))
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./stexport.git/src/stexport/export.py", line 188, in <module>
    main()
  File "./stexport.git/src/stexport/export.py", line 181, in main
    j = exporter.export_json(sites=sites)
  File "./stexport.git/src/stexport/export.py", line 153, in export_json
    all_data[site] = self.export_site(site=site)
  File "./stexport.git/src/stexport/export.py", line 134, in export_site
    data[ep] = fetch_backoff(
  File ".venv/lib/python3.9/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "./stexport.git/src/stexport/export.py", line 106, in fetch_backoff
    return api.fetch(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/stackapi/stackapi.py", line 198, in fetch
    raise StackAPIError(self._previous_call, error, code, message)
stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds')

Repored against e93ec39(Dec 3 2021)

RFE-1

Add --version so that reports like that can provide the version of the tool each issue refers to.

@ankostis
Copy link
Author

ankostis commented Mar 8, 2021

RFE-2

I suggest to add a message in the --help message, warning users about the use of --all-sites and the IP-banning.
The message should recommend instead the use of --site for all stackexchange site that the user has actually contributed posts, replies & comments (unfortunately, the user's profile does not list sites you have cast votes).

@ankostis
Copy link
Author

ankostis commented Mar 8, 2021

RFE-3

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

RFE-4

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

@karlicoss
Copy link
Owner

Ah yes indeed, it's pretty annoying. And yeah, that's why I added --site option -- I ended up only running it for a few sites (instead of the whole network)
But I had no idea it banned for 24h! Maybe it's recent, but anyway would be good to add to readme yeah.

I think there are a few options, although need to think of all the pros and cons before implementing any (because it might be a fair emount of work):

  • only fetch 'all' stuff once, and then retrieve last N items (making sure it fits in the limit). Then it would be possible to merge individual 'sliced' in the data access layer (so it would be kind of a synthetic export)
  • keep track of last updates in some sort of 'state file'. Then on next update do some sort of 'preflight' request to figure out if you need to fetch anything new at all. This would mean that in most cases the tool would only need to make 1 request for each stackexchange site.

@karlicoss
Copy link
Owner

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

Yeah, I think that makes a lot of sense! Can't remember whether there was such an endpoint, but maybe it's possible to reuse

"users/{ids}",
for it? Maybe there is some meta information in the result of this call that would tell if you need to do any further calls for this site at all.

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

Perhaps this?

ENDPOINTS = [
"users/{ids}",
"users/{ids}/answers",
"users/{ids}/badges",
"users/{ids}/comments",
"users/{ids}/favorites",
"users/{ids}/mentioned",
## these don't take 'site' parameter..
# "users/{id}/network-activity",
# "users/{id}/notifications",
##
"users/{ids}/posts",
"users/{id}/privileges",
"users/{ids}/questions",
## these overlap with 'questions'
# users/{ids}/questions/featured
# users/{ids}/questions/no-answers
# users/{ids}/questions/unaccepted
# users/{ids}/questions/unanswered
##
"users/{ids}/reputation",
"users/{ids}/reputation-history",
## this needs auth token
# users/{id}/reputation-history/full
##
"users/{ids}/suggested-edits",
"users/{ids}/tags",
## these overlap with 'tags'
# users/{id}/tags/{tags}/top-answers
# users/{id}/tags/{tags}/top-questions
##
"users/{ids}/timeline",
"users/{id}/top-answer-tags",
"users/{id}/top-question-tags",
"users/{id}/top-tags",
## TODO err, this was often resulting in internal server error...
# "users/{id}/write-permissions",
##
## these need auth token, not sure how useful are they
# users/{id}/inbox
# users/{id}/inbox/unread
##
]
# FILTER = 'default'
FILTER = '!LVBj2(M0Wr1s_VedzkH(VG'
# check it out here https://api.stackexchange.com/docs/read-filter#filters=!SnL4e6G*07of2S.ynb&filter=default&run=true
# TODO eh, better make it explicit with 'filter' api call https://api.stackexchange.com/docs/create-filter
# private filters: answer.{accepted, downvoted, upvoted}; comment.upvoted . wonder why, accepted is clearly visible on the website..
#

@ankostis
Copy link
Author

ankostis commented Mar 8, 2021

I see that RFE-4 does not make sense, it's just 5 urls-per-site.

REF-3 is the important stuff - i got my list of sites to scrap from this: https://stackexchange.com/users/263317/ankostis?tab=accounts

@karlicoss
Copy link
Owner

karlicoss commented Mar 8, 2021

oh btw about that

the exporter crashed, without storing any of the downloaded data

Yeah indeed, it's also kind of a problem -- a consequence of the way exporters work -- they output data to stdout (for simplicity) and then it's dumped atomically. Even if it was written in the process, since it's a single JSON structure (dictionary in this case), it would be malformed unless it's complete, so would require some manual intervention to make it well-formed.

Maybe it won't be necessary in this case if we make less requests and we get away with it -- but a more general way to do this might be to let the export files be JSONL. So for example, it could dump a json per site on each line. It would allow it to be backed by a single text file (so we keep simplicity), but also flexible enough and easy to assemble back by dal.py (just need to read the input file exhaustively instead of one json.load as before). Also with single --site, it would be exactly the same output as before, which is kind of nice I guess.

@Cobertos
Copy link
Contributor

Cobertos commented Mar 9, 2021

Weird that you got hit with a full day? The limits they disclose are:

  • > 30 reqs/sec ban (ban of 30s to 2m, though they say subject to change)
  • 10,000 reqs per day per app/user (access_token or IP) pair.
  • There's also per-method limits

And regarding RFE-3, looks like it could leverage the me/associated-users endpoint.

EDIT: Made PR for RFE-3 in #3

@Cobertos
Copy link
Contributor

Cobertos commented Mar 9, 2021

Wow, yeah, I've made maybe 300 requests total while testing and I just got the 24hr ban... Wasn't able to get a full export unfortunately

@karlicoss
Copy link
Owner

karlicoss commented Mar 10, 2021

Some things I noticed from my own experiments:

Quota is returned in the raw api response:

data[ep] = fetch_backoff(
api,
endpoint=ep.format(ids=self.user_id, id=self.user_id),
filter=FILTER,
)['items']
(we only extract items from it, so we could at least inspect it beforehand to avoid ban). However there is and issue: AWegnerGitHub/stackapi#41 which isn't released on pypi yet (AWegnerGitHub/stackapi#44) the author kindly released it now! So after pip3 install --user stackapi --upgrade now it reports stats correctly:

 'quota_max': 300,
 'quota_remaining': 28,

So maybe at the very least it would be possible to warn the user if we're about to make too many requests (i.e. len(ENDPOINTS) * len(sites) > 300

karlicoss added a commit that referenced this issue Mar 10, 2021
As a result, they get 10K daily limit instead of 300 requests.

#1
@karlicoss
Copy link
Owner

karlicoss commented Mar 10, 2021

And I think I figured it out :) The 'site apis' didn't get api parameters so they ended up with 300 requests default limit instead of 10K. Once we merge @Cobertos PR (which still makes sense nevertheless!) I can also merge it and hopefully it will resolve this?

api = _get_api(**self.api_params)

P.S. also noticed some weird Expecting value: line 1 column 1 (char 0) failure at times when exporting everything (seems to happen only on specific communities somehow, guess that's why it hasn't happened before).
Also worked around: c2ae2f7 , will merge after #3

@Cobertos
Copy link
Contributor

After merging origin/fix locally, I was able run the full export. fetch_backoff backed off about 6 times (each time it backed off like 7 times of up to 120s), but I got through all of the below sites without being banned :3

exporting ['academia', 'ai', 'alcohol', 'android', 'arduino', 'askubuntu', 'bicycles', 'blender', 'codegolf', 'codereview', 'computergraphics', 'diy', 'electronics', 'english', 'gamedev', 'gaming', 'gis', 'interpersonal', 'japanese', 'math', 'mechanics', 'money', 'movies', 'music', 'outdoors', 'parenting', 'physics', 'politics', 'raspberrypi', 'security', 'serverfault', 'skeptics', 'softwareengineering', 'softwarerecs', 'stackapps', 'stackoverflow', 'superuser', 'travel', 'unix', 'video', 'webapps', 'webmasters', 'workplace', 'worldbuilding', 'writing']

karlicoss added a commit that referenced this issue Mar 13, 2021
As a result, they get 10K daily limit instead of 300 requests.

#1
karlicoss added a commit that referenced this issue Mar 13, 2021
As a result, they get 10K daily limit instead of 300 requests.

#1
@karlicoss
Copy link
Owner

Ok, I guess this is fixed in master! I also updated readme about getting access_token (it's just a matter of copying URL so figured not worth a --login functionality)
Thanks everyone, now we can finally properly export all of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants