RFE: store exported data even if blocked due to `throttle violation` #1

ankostis · 2021-03-08T08:42:13Z

When authenticated with a stackexchange app and downloading all sites,
stack-exchange blocked export with a throttle_violation(502) error,
and the exporter crashed, without storing any of the downloaded data.

The expected behavior would be to save data collected in a try-finaly case.

The behaviour is particularly grave, because the penalty for the app is 24hours!
The throttling error is expected to always happen, since the sites are too many.

./stexport.git $ python -m stexport.export \
        --user_id ... --key ... --access_token ... \
        --all-sites \
        stackexchange-$(date +%Y%m%d).json
[I 210308 10:16:14 export:149] exporting ['3dprinting', '3dprinting.meta', 'academia', 'academia.meta', 'ai', 'ai.meta', 'alcohol', 'alcohol.meta', 'android', 'android.meta', 'anime', 'anime.meta', 'apple', 'apple.meta', 'arduino', 'arduino.meta', 'askubuntu', 'astronomy', 'astronomy.meta', 'aviation', 'aviation.meta', 'bicycles', 'bicycles.meta', 'bioinformatics', 'bioinformatics.meta', 'biology', 'biology.meta', 'bitcoin', 'bitcoin.meta', 'blender', 'blender.meta', 'boardgames', 'boardgames.meta', 'bricks', 'bricks.meta', 'buddhism', 'buddhism.meta', 'chemistry', 'chemistry.meta', 'chess', 'chess.meta', 'chinese', 'chinese.meta', 'christianity', 'christianity.meta', 'civicrm', 'civicrm.meta', 'codegolf', 'codegolf.meta', 'codereview', 'codereview.meta', 'coffee', 'coffee.meta', 'communitybuilding', 'communitybuilding.meta', 'computergraphics', 'computergraphics.meta', 'conlang', 'conlang.meta', 'cooking', 'cooking.meta', 'craftcms', 'craftcms.meta', 'crafts', 'crafts.meta', 'crypto', 'crypto.meta', 'cs', 'cs.meta', 'cs50', 'cs50.meta', 'cseducators', 'cseducators.meta', 'cstheory', 'cstheory.meta', 'datascience', 'datascience.meta', 'dba', 'dba.meta', 'devops', 'devops.meta', 'diy', 'diy.meta', 'drones', 'drones.meta', 'drupal', 'drupal.meta', 'dsp', 'dsp.meta', 'earthscience', 'earthscience.meta', 'ebooks', 'ebooks.meta', 'economics', 'economics.meta', 'electronics', 'electronics.meta', 'elementaryos', 'elementaryos.meta', 'ell', 'ell.meta', 'emacs', 'emacs.meta', 'engineering', 'engineering.meta', 'english', 'english.meta', 'eosio', 'eosio.meta', 'es.meta.stackoverflow', 'es.stackoverflow', 'esperanto', 'esperanto.meta', 'ethereum', 'ethereum.meta', 'expatriates', 'expatriates.meta', 'expressionengine', 'expressionengine.meta', 'fitness', 'fitness.meta', 'freelancing', 'freelancing.meta', 'french', 'french.meta', 'gamedev', 'gamedev.meta', 'gaming', 'gaming.meta', 'gardening', 'gardening.meta', 'genealogy', 'genealogy.meta', 'german', 'german.meta', 'gis', 'gis.meta', 'graphicdesign', 'graphicdesign.meta', 'ham', 'ham.meta', 'hardwarerecs', 'hardwarerecs.meta', 'hermeneutics', 'hermeneutics.meta', 'hinduism', 'hinduism.meta', 'history', 'history.meta', 'homebrew', 'homebrew.meta', 'hsm', 'hsm.meta', 'interpersonal', 'interpersonal.meta', 'iot', 'iot.meta', 'iota', 'iota.meta', 'islam', 'islam.meta', 'italian', 'italian.meta', 'ja.meta.stackoverflow', 'ja.stackoverflow', 'japanese', 'japanese.meta', 'joomla', 'joomla.meta', 'judaism', 'judaism.meta', 'korean', 'korean.meta', 'languagelearning', 'languagelearning.meta', 'latin', 'latin.meta', 'law', 'law.meta', 'lifehacks', 'lifehacks.meta', 'linguistics', 'linguistics.meta', 'literature', 'literature.meta', 'magento', 'magento.meta', 'martialarts', 'martialarts.meta', 'math', 'math.meta', 'matheducators', 'matheducators.meta', 'mathematica', 'mathematica.meta', 'mathoverflow.net', 'mattermodeling', 'mattermodeling.meta', 'mechanics', 'mechanics.meta', 'medicalsciences', 'medicalsciences.meta', 'meta', 'meta.askubuntu', 'meta.mathoverflow.net', 'meta.serverfault', 'meta.stackoverflow', 'meta.superuser', 'monero', 'monero.meta', 'money', 'money.meta', 'movies', 'movies.meta', 'music', 'music.meta', 'musicfans', 'musicfans.meta', 'mythology', 'mythology.meta', 'networkengineering', 'networkengineering.meta', 'opendata', 'opendata.meta', 'opensource', 'opensource.meta', 'or', 'or.meta', 'outdoors', 'outdoors.meta', 'parenting', 'parenting.meta', 'patents', 'patents.meta', 'pets', 'pets.meta', 'philosophy', 'philosophy.meta', 'photo', 'photo.meta', 'physics', 'physics.meta', 'pm', 'pm.meta', 'poker', 'poker.meta', 'politics', 'politics.meta', 'portuguese', 'portuguese.meta', 'psychology', 'psychology.meta', 'pt.meta.stackoverflow', 'pt.stackoverflow', 'puzzling', 'puzzling.meta', 'quant', 'quant.meta', 'quantumcomputing', 'quantumcomputing.meta', 'raspberrypi', 'raspberrypi.meta', 'retrocomputing', 'retrocomputing.meta', 'reverseengineering', 'reverseengineering.meta', 'robotics', 'robotics.meta', 'rpg', 'rpg.meta', 'ru.meta.stackoverflow', 'ru.stackoverflow', 'rus', 'rus.meta', 'russian', 'russian.meta', 'salesforce', 'salesforce.meta', 'scicomp', 'scicomp.meta', 'scifi', 'scifi.meta', 'security', 'security.meta', 'serverfault', 'sharepoint', 'sharepoint.meta', 'sitecore', 'sitecore.meta', 'skeptics', 'skeptics.meta', 'softwareengineering', 'softwareengineering.meta', 'softwarerecs', 'softwarerecs.meta', 'sound', 'sound.meta', 'space', 'space.meta', 'spanish', 'spanish.meta', 'sports', 'sports.meta', 'sqa', 'sqa.meta', 'stackapps', 'stackoverflow', 'stats', 'stats.meta', 'stellar', 'stellar.meta', 'superuser', 'sustainability', 'sustainability.meta', 'tex', 'tex.meta', 'tezos', 'tezos.meta', 'tor', 'tor.meta', 'travel', 'travel.meta', 'tridion', 'tridion.meta', 'ukrainian', 'ukrainian.meta', 'unix', 'unix.meta', 'ux', 'ux.meta', 'vegetarianism', 'vegetarianism.meta', 'vi', 'vi.meta', 'video', 'video.meta', 'webapps', 'webapps.meta', 'webmasters', 'webmasters.meta', 'windowsphone', 'windowsphone.meta', 'woodworking', 'woodworking.meta', 'wordpress', 'wordpress.meta', 'workplace', 'workplace.meta', 'worldbuilding', 'worldbuilding.meta', 'writing', 'writing.meta']
...
[I 210308 10:21:55 export:132] exporting askubuntu: users/{ids}/reputation
[I 210308 10:21:56 export:132] exporting askubuntu: users/{ids}/reputation-history
[I 210308 10:21:57 export:132] exporting askubuntu: users/{ids}/suggested-edits
[E 210308 10:21:58 _common:101] Giving up fetch_backoff(...) after 1 tries (stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds'))
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./stexport.git/src/stexport/export.py", line 188, in <module>
    main()
  File "./stexport.git/src/stexport/export.py", line 181, in main
    j = exporter.export_json(sites=sites)
  File "./stexport.git/src/stexport/export.py", line 153, in export_json
    all_data[site] = self.export_site(site=site)
  File "./stexport.git/src/stexport/export.py", line 134, in export_site
    data[ep] = fetch_backoff(
  File ".venv/lib/python3.9/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "./stexport.git/src/stexport/export.py", line 106, in fetch_backoff
    return api.fetch(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/stackapi/stackapi.py", line 198, in fetch
    raise StackAPIError(self._previous_call, error, code, message)
stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds')

Repored against e93ec39(Dec 3 2021)

RFE-1

Add --version so that reports like that can provide the version of the tool each issue refers to.

The text was updated successfully, but these errors were encountered:

ankostis · 2021-03-08T09:10:02Z

RFE-2

I suggest to add a message in the --help message, warning users about the use of --all-sites and the IP-banning.
The message should recommend instead the use of --site for all stackexchange site that the user has actually contributed posts, replies & comments (unfortunately, the user's profile does not list sites you have cast votes).

ankostis · 2021-03-08T09:15:21Z

RFE-3

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

RFE-4

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

karlicoss · 2021-03-08T09:15:34Z

Ah yes indeed, it's pretty annoying. And yeah, that's why I added --site option -- I ended up only running it for a few sites (instead of the whole network)
But I had no idea it banned for 24h! Maybe it's recent, but anyway would be good to add to readme yeah.

I think there are a few options, although need to think of all the pros and cons before implementing any (because it might be a fair emount of work):

only fetch 'all' stuff once, and then retrieve last N items (making sure it fits in the limit). Then it would be possible to merge individual 'sliced' in the data access layer (so it would be kind of a synthetic export)
keep track of last updates in some sort of 'state file'. Then on next update do some sort of 'preflight' request to figure out if you need to fetch anything new at all. This would mean that in most cases the tool would only need to make 1 request for each stackexchange site.

karlicoss · 2021-03-08T09:22:35Z

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

Yeah, I think that makes a lot of sense! Can't remember whether there was such an endpoint, but maybe it's possible to reuse

stexport/src/stexport/export.py

Line 4 in e93ec39

"users/{ids}",

for it? Maybe there is some meta information in the result of this call that would tell if you need to do any further calls for this site at all.

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

Perhaps this?

stexport/src/stexport/export.py

Lines 3 to 64 in e93ec39

    
           ENDPOINTS = [ 
        
               "users/{ids}", 
        
               "users/{ids}/answers", 
        
               "users/{ids}/badges", 
        
               "users/{ids}/comments", 
        
               "users/{ids}/favorites", 
        
               "users/{ids}/mentioned", 
        
               ##  these don't take 'site' parameter.. 
        
               # "users/{id}/network-activity", 
        
               # "users/{id}/notifications", 
        
               ## 
        
               "users/{ids}/posts", 
        
               "users/{id}/privileges", 
        
               "users/{ids}/questions", 
        
               ## these overlap with 'questions' 
        
               # users/{ids}/questions/featured 
        
               # users/{ids}/questions/no-answers 
        
               # users/{ids}/questions/unaccepted 
        
               # users/{ids}/questions/unanswered 
        
               ## 
        
               "users/{ids}/reputation", 
        
               "users/{ids}/reputation-history", 
        
               ## this needs auth token 
        
               # users/{id}/reputation-history/full 
        
               ## 
        
               "users/{ids}/suggested-edits", 
        
               "users/{ids}/tags", 
        
               ## these overlap with 'tags' 
        
               # users/{id}/tags/{tags}/top-answers 
        
               # users/{id}/tags/{tags}/top-questions 
        
               ## 
        
               "users/{ids}/timeline", 
        
               "users/{id}/top-answer-tags", 
        
               "users/{id}/top-question-tags", 
        
               "users/{id}/top-tags", 
        
               ## TODO err, this was often resulting in internal server error... 
        
               # "users/{id}/write-permissions", 
        
               ## 
        
               ## these need auth token, not sure how useful are they 
        
               # users/{id}/inbox 
        
               # users/{id}/inbox/unread 
        
               ## 
        
           ] 
        
           # FILTER = 'default' 
        
           FILTER = '!LVBj2(M0Wr1s_VedzkH(VG' 
        
           # check it out here https://api.stackexchange.com/docs/read-filter#filters=!SnL4e6G*07of2S.ynb&filter=default&run=true 
        
           # TODO eh, better make it explicit with 'filter' api call https://api.stackexchange.com/docs/create-filter 
        
           # private filters: answer.{accepted, downvoted, upvoted}; comment.upvoted . wonder why, accepted is clearly visible on the website.. 
        
           #

ankostis · 2021-03-08T09:37:36Z

I see that RFE-4 does not make sense, it's just 5 urls-per-site.

REF-3 is the important stuff - i got my list of sites to scrap from this: https://stackexchange.com/users/263317/ankostis?tab=accounts

karlicoss · 2021-03-08T09:48:54Z

oh btw about that

the exporter crashed, without storing any of the downloaded data

Yeah indeed, it's also kind of a problem -- a consequence of the way exporters work -- they output data to stdout (for simplicity) and then it's dumped atomically. Even if it was written in the process, since it's a single JSON structure (dictionary in this case), it would be malformed unless it's complete, so would require some manual intervention to make it well-formed.

Maybe it won't be necessary in this case if we make less requests and we get away with it -- but a more general way to do this might be to let the export files be JSONL. So for example, it could dump a json per site on each line. It would allow it to be backed by a single text file (so we keep simplicity), but also flexible enough and easy to assemble back by dal.py (just need to read the input file exhaustively instead of one json.load as before). Also with single --site, it would be exactly the same output as before, which is kind of nice I guess.

Cobertos · 2021-03-09T01:30:33Z

Weird that you got hit with a full day? The limits they disclose are:

> 30 reqs/sec ban (ban of 30s to 2m, though they say subject to change)
10,000 reqs per day per app/user (access_token or IP) pair.
There's also per-method limits

And regarding RFE-3, looks like it could leverage the me/associated-users endpoint.

EDIT: Made PR for RFE-3 in #3

Cobertos · 2021-03-09T02:35:58Z

Wow, yeah, I've made maybe 300 requests total while testing and I just got the 24hr ban... Wasn't able to get a full export unfortunately

karlicoss · 2021-03-10T08:17:50Z

Some things I noticed from my own experiments:

info on quotas from stackapi https://github.com/AWegnerGitHub/stackapi/blob/master/docs/user/quickstart.rst#note-about-quotas
we're kinda doing requests too often (although seems that stackapi handles this gracefully via backoff https://github.com/AWegnerGitHub/stackapi/blob/a37e518c769d0ca376f62a4bb97df4714411833d/stackapi/stackapi.py#L227-L229 )
I still have too many sites I'm registered on to fit in 300 requests 😱

Quota is returned in the raw api response:

stexport/src/stexport/export.py

Lines 134 to 138 in e9e1129

    
           data[ep] = fetch_backoff( 
        
               api, 
        
               endpoint=ep.format(ids=self.user_id, id=self.user_id), 
        
               filter=FILTER, 
        
           )['items']

(we only extract items from it, so we could at least inspect it beforehand to avoid ban). However there is and issue: AWegnerGitHub/stackapi#41 ~~which isn't released on pypi yet (AWegnerGitHub/stackapi#44)~~ the author kindly released it now! So after pip3 install --user stackapi --upgrade now it reports stats correctly:

 'quota_max': 300,
 'quota_remaining': 28,

So maybe at the very least it would be possible to warn the user if we're about to make too many requests (i.e. len(ENDPOINTS) * len(sites) > 300

As a result, they get 10K daily limit instead of 300 requests. #1

karlicoss · 2021-03-10T08:54:37Z

And I think I figured it out :) The 'site apis' didn't get api parameters so they ended up with 300 requests default limit instead of 10K. Once we merge @Cobertos PR (which still makes sense nevertheless!) I can also merge it and hopefully it will resolve this?

stexport/src/stexport/export.py

Line 150 in 4a3687b

api = _get_api(**self.api_params)

P.S. also noticed some weird Expecting value: line 1 column 1 (char 0) failure at times when exporting everything (seems to happen only on specific communities somehow, guess that's why it hasn't happened before).
Also worked around: c2ae2f7 , will merge after #3

Cobertos · 2021-03-12T14:44:41Z

After merging origin/fix locally, I was able run the full export. fetch_backoff backed off about 6 times (each time it backed off like 7 times of up to 120s), but I got through all of the below sites without being banned :3

exporting ['academia', 'ai', 'alcohol', 'android', 'arduino', 'askubuntu', 'bicycles', 'blender', 'codegolf', 'codereview', 'computergraphics', 'diy', 'electronics', 'english', 'gamedev', 'gaming', 'gis', 'interpersonal', 'japanese', 'math', 'mechanics', 'money', 'movies', 'music', 'outdoors', 'parenting', 'physics', 'politics', 'raspberrypi', 'security', 'serverfault', 'skeptics', 'softwareengineering', 'softwarerecs', 'stackapps', 'stackoverflow', 'superuser', 'travel', 'unix', 'video', 'webapps', 'webmasters', 'workplace', 'worldbuilding', 'writing']

As a result, they get 10K daily limit instead of 300 requests. #1

karlicoss · 2021-03-13T06:28:25Z

Ok, I guess this is fixed in master! I also updated readme about getting access_token (it's just a matter of copying URL so figured not worth a --login functionality)
Thanks everyone, now we can finally properly export all of it!

karlicoss added a commit that referenced this issue Mar 10, 2021

fix: use api parameters for individual site APIs

4a3687b

As a result, they get 10K daily limit instead of 300 requests. #1

karlicoss added a commit that referenced this issue Mar 13, 2021

fix: use api parameters for individual site APIs

e223aeb

As a result, they get 10K daily limit instead of 300 requests. #1

karlicoss added a commit that referenced this issue Mar 13, 2021

fix: use api parameters for individual site APIs

b0fc207

As a result, they get 10K daily limit instead of 300 requests. #1

karlicoss closed this as completed Mar 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: store exported data even if blocked due to `throttle violation` #1

RFE: store exported data even if blocked due to `throttle violation` #1

ankostis commented Mar 8, 2021 •

edited

Loading

ankostis commented Mar 8, 2021 •

edited

Loading

ankostis commented Mar 8, 2021

karlicoss commented Mar 8, 2021

karlicoss commented Mar 8, 2021

ankostis commented Mar 8, 2021

karlicoss commented Mar 8, 2021 •

edited

Loading

Cobertos commented Mar 9, 2021 •

edited

Loading

Cobertos commented Mar 9, 2021 •

edited

Loading

karlicoss commented Mar 10, 2021 •

edited

Loading

karlicoss commented Mar 10, 2021 •

edited

Loading

Cobertos commented Mar 12, 2021

karlicoss commented Mar 13, 2021

RFE: store exported data even if blocked due to throttle violation #1

RFE: store exported data even if blocked due to throttle violation #1

Comments

ankostis commented Mar 8, 2021 • edited Loading

RFE-1

ankostis commented Mar 8, 2021 • edited Loading

RFE-2

ankostis commented Mar 8, 2021

RFE-3

RFE-4

karlicoss commented Mar 8, 2021

karlicoss commented Mar 8, 2021

ankostis commented Mar 8, 2021

karlicoss commented Mar 8, 2021 • edited Loading

Cobertos commented Mar 9, 2021 • edited Loading

Cobertos commented Mar 9, 2021 • edited Loading

karlicoss commented Mar 10, 2021 • edited Loading

karlicoss commented Mar 10, 2021 • edited Loading

Cobertos commented Mar 12, 2021

karlicoss commented Mar 13, 2021

RFE: store exported data even if blocked due to `throttle violation` #1

RFE: store exported data even if blocked due to `throttle violation` #1

ankostis commented Mar 8, 2021 •

edited

Loading

ankostis commented Mar 8, 2021 •

edited

Loading

karlicoss commented Mar 8, 2021 •

edited

Loading

Cobertos commented Mar 9, 2021 •

edited

Loading

Cobertos commented Mar 9, 2021 •

edited

Loading

karlicoss commented Mar 10, 2021 •

edited

Loading

karlicoss commented Mar 10, 2021 •

edited

Loading