Skip to content
This repository has been archived by the owner on Feb 28, 2023. It is now read-only.

DMArchiver is broken #83

Open
cajuncooks opened this issue Aug 16, 2020 · 24 comments
Open

DMArchiver is broken #83

cajuncooks opened this issue Aug 16, 2020 · 24 comments

Comments

@cajuncooks
Copy link
Contributor

I'm really just kind of hoping to open a dialogue here; I have no idea if there's anything we can reasonably do to solve this issue, but I mostly just want to hear that this is actually affecting somebody else. Earlier this week, this started happening to me:

$ dmarchiver
DMArchiver 0.2.6
Running on Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0]

[...]

Press Ctrl+C at anytime to write the current conversation and skip to the next one.
 Keep it pressed to exit the script.

Conversation ID not specified. Retrieving all the threads.
Expecting value: line 1 column 1 (char 0)

The last line there is the JSON parser failing, because the POST request doesn't give a valid response. This happens whether or not you specify a conversation ID; it seems that all messages URLs that were in use now fail. Authentication still works, but ~nothing else related to DMs, as far as I can tell. I tried several different changes to the headers passed into the request, but nothing that produced actual fruitful results.

I went through some of #79 seeking an alternate solution, but the API endpoints mentioned in there seem to not exist anymore, or are locked behind some kind of additional authentication, despite my API application having permissions for DM access.

$ twurl -X GET /1.1/dm/conversation/[conversation_id].json
{"errors":[{"message":"Your credentials do not allow access to this resource","code":220}]}

This is not specific to twurl, either... as I noted over in bear/python-twitter#665 (a PR which updates a deprecated DM endpoint in python-twitter), I get useless output there, an empty events array (when according to Twitter's own documentation, "[i]n rare cases the events array may be empty"). e.g.:

In [4]: api.GetDirectMessages(return_json=True)                                                                       
Out[4]: {'events': [], 'next_cursor': 'MTI5NDc2OTE5NTc1MDc3Mjc0Nw'}

This matches my experience using twurl as suggested in that documentation, too.

I'm hypothesizing that this all has something to do with the breach Twitter experienced last month and their development for the v2.0 API, but the gut punch is that API access to DMs is listed under "Nesting" (the least-developed column, it seems) on the roadmap, which means that we may be months from a solution if the methods used in this application are no longer viable. I'd love to contribute to a solution here that doesn't involve an always-running selenium webdriver or some other related nonsense, but I'm not sure how to approach it.

@Mincka
Copy link
Owner

Mincka commented Aug 16, 2020

Hi,

Indeed, it looks like that's the end of DMArchiver and its HTML parsing method.
At some point, it was expected that they would drop this interface to use only JavaScript and the API.
I tried to disable JavaScript and https://mobile.twitter.com/ is still accessible but not usable (only few messages are shown in conversations).

Now, it seems there is a difference between using the official API and using it "through the browser". If we were users of the API, we would face the same limitations and it would not be possible to retrieve all the DMs of a conversation. I just tried to stupidly scroll up in a conversation with thousands of messages and I was still able to retrieve everything. It does not mean however there is no limitation.

When inspecting a GET /1.1/dm/conversation/1001168196991373314.json request, we can see this in the headers:

access-control-expose-headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset
x-rate-limit-limit: 900
x-rate-limit-remaining: 788
x-rate-limit-reset: 1597555966

It looks like 900 requests with 20 messages per 15 minutes.

If the API is called exactly as a browser, I think we could avoid Selenium.

@cajuncooks
Copy link
Contributor Author

Aha... I see those GET request now in the inspector -- must just not be exposed via the normal API protocols. So maybe it's just a matter of passing the right authentication/cookies/headers? I will look into this tomorrow.

@Mincka
Copy link
Owner

Mincka commented Aug 16, 2020

The basic flow is the following:

  1. Authenticate on POST /sessions
  2. Get conversations on GET /1.1/dm/inbox_initial_state.json or GET /1.1/dm/inbox_timeline/trusted.json for the conversations not loaded at logon. Not sure if looping is necessary for the latest. I don't have enough conversations.
  3. Get messages on GET /1.1/dm/conversation/2350610210-788777070428032928.json. The max_id=1073460340120322053 URL parameter is used to know the position in the thread.
  4. Loop based on the value of status that can be be HAS_MORE or AT_END
{"conversation_timeline":{"status":"HAS_MORE","min_entry_id":"1202510056522140100","max_entry_id":"1251082889214502789",
{"conversation_timeline":{"status":"AT_END","min_entry_id":"952320302367985669","max_entry_id":"952981962960540420",
  1. Monitor the API limits with the headers

The requests must have the session cookies. This is not the hard part, requests can handle it.
E.g. auth_token is the identity of the user and is set with the response of POST /sessions

The requests must also have a Bearer that looks like this:
authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzXejRCOuH5E1I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHpKLTvJu4FA65AGWWjCpTnA

It's to authorize the browser, as a client, to access the API.

I found out that this bearer is returned in the response of
GET /responsive-web/client-web/main.05e1f885.js

Somewhere in the middle of the code:

const r="ACTION_FLUSH",i="ACTION_REFRESH",o="3033300",s="Web-12",a="AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzXejRCOuH5E1I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHpKLTvJu4FA65AGWWjCpTnA",c="14191373",l="ct0",u="x-csrf-token",d="eu_cn",p="ab_decider",h="fm",m="gt",f="responsive_web",b="_mb_tk",g="night_mode",_="rweb",y="m5",v="LiteNativeWrapper",w="/sw.js",E="_sl",O="tombstone://card",T="twid",I="TwitterAndroidLite",S=new Uint8Array([4,94,104,18,141,49,13,74,96,202,82,131,78,91,29,242,150,101,197,0,53,149,230,8,54,38,62,173,43,28,89,130,191,222,213,128,147,62,21,49,187,95,212,194,196,212,140,157,234,34,8,245,143,158,221,15,83,8,222,111,100,204,213,48,75])

I don't how how frequently this could change. From what I can find on Google, looks like almost a hardcoded value for some time now:
https://www.google.com/search?client=firefox-b-d&q=AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%253D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

Anyway, this would require a complete rewrite of DMArchiver since the parsing of the JSON is completely different of the HTML parsing. Maybe there are already libraries that do this kind of thing. Text could be ok, but attachments (images, videos, tweets, links require more work).

@Mincka
Copy link
Owner

Mincka commented Aug 16, 2020

I confirm that you may now be unable to read your own private messages in the browser.
The loader will just spin and you will have to wait for the next 15-minute window.

The response will look like this:

HTTP/1.1 429 Too Many Requests
access-control-allow-credentials: true
access-control-allow-origin: https://twitter.com
access-control-expose-headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset
cache-control: no-cache, no-store, max-age=0
connection: close
Content-Length: 56
content-type: application/json;charset=utf-8
date: Sun, 16 Aug 2020 08:24:44 GMT
server: tsa_o
strict-transport-security: max-age=631138519
x-connection-hash: 156a8755395be1da0e6871fdeae75079
x-rate-limit-limit: 900
x-rate-limit-remaining: 0
x-rate-limit-reset: 1597566893
x-response-time: 111

{"errors":[{"message":"Rate limit exceeded","code":88}]}

@cajuncooks
Copy link
Contributor Author

I tried setting up an app on the Twitter developer site and getting a bearer token using oauthlib, but I would get access denied with every OAuth2 token except for the public one, no matter what permissions I set. For now, GETting that .js file and parsing it with re.findall('(AAAAAA.*?)\"',str(response.content)) will do, until Twitter changes that, too.

Anyway, using your workflow as an outline, I was eventually able to properly set the headers and iterate through some GET calls to a conversation endpoint. We can randomly generate the x-csrf-token and corresponding ct0 value for the header's cookie field, and everything else in the headers is static or derived from either the requests state or the request itself.

I don't mind working through some of the parser details; I think the general data delivered by the API is quite similar, it just needs to be parsed in JSON rather than through lxml.html, which should make things much simpler in the end. With some fiddling I was also able to login with 2FA enabled, so I can likely address #26 too.

Careful attention should be paid to rate limiting, for sure... the rate limit works out 1 per second, but as you note, the worst thing that happens from the user's perspective is that they're locked out of their DMs for 15 minutes, which isn't so bad. I definitely think this is worth pursuing, and I'm hopeful that I'll have a branch to share with you late this week or next weekend.

@Mincka
Copy link
Owner

Mincka commented Aug 17, 2020

We can't go through the standard OAuth2 flow for DMs I think. It's still best to simulate a user in the browser. The downside is to enter login and password. A slightly better solution would be to manually extract and enter the auth_token but it would be too complicated for end users.

I checked the x-csrf-token and it's session-based so there is nothing to do if using sessions from requests, it will be transparent.

And sure, dropping lxml can just be a good news. I think it was a poor decision for a multi-platform tool. Using the API may also prevent random parsing breaks due to HTML updates.

My "calculation" was completely wrong also. You're right, that's just 20 messages per second in the end.
I didn't benchmark DMArchiver but I am quite certain that the rate was higher, at least 2 requests (20 messages) per second, for generated HTML...

To prevent lookup I think we need to use something like this with a default rate slightly below the maximum one, like 850 per 15 minutes. It should do the trick for the majority of the users. Anyway, a warning will be required.

@Prisoner416
Copy link

Well. Shoot. Accidentally locked myself out running the tool, got the account back only to have the tool broken. Thanks Twitter hackers. Realistically, how long are we looking at for a working rewrite?

cajuncooks added a commit to cajuncooks/DMArchiver that referenced this issue Aug 31, 2020
With Mincka#83, we need a new approach to allow this program to function.
This is the first attempt. Some features have been broken or removed,
and likely cannot be added back. Cards, embedded tweets, etc. have been
dropped, and it's possible that stickers are broken too (do they even
exist anymore?).

I can't promise that this works robustly, but it was tested in Python
3.7 with a saved session; I'm also not sure authentication isn't
totally broken as I tried to implement a 2FA fix, too, but locked
my account out from too many login attempts before I managed to get it
working.
@cajuncooks
Copy link
Contributor Author

Sorry for the delay here, real life gets in the way sometimes. As you'll read in the commit message above, I have a very rough draft of the new interpreter, which hasn't really been extensively tested, but it does work flawlessly on a <24h old group DM with a few hundred messages. I also tried to implement the 2FA process that I've been using, but my account has been locked for several hours now... hoping to be able to try again tomorrow.

Some things that are gone:

  • Cards (for links)
  • Embedded tweet cards
  • Alt Text

The presentation for these the way the interpreter used to parse them was done by Twitter on-site, so these are things that are fundamentally missing unless we manage to decode that API through their .js calls, too; I've replaced them as normal longform links in the text output. Listing conversations/grabbing from all conversations is probably still broken? Might be able to figure that one out, now. It's much, much easier to identify the conversation ID on the modern Twitter web interface, though.

Some other various comments... the twitter_handle option can actually be trivially re-implemented (change 'screen_name' to 'name' dependent on the flag), but I have dropped it in favor of defaulting to the handle. I don't think that the latest tweet ID check works at all, will need to look into that. In the 'entities' dict, hashtags and user handle mentions are also highlighted in addition to URLs, but I don't see the advantage of grabbing anything out of those, unless we wanted to embed links in the output (don't think this is a great idea, generally).

I don't know yet how well the rate-limiter works, will need to scrape a larger chat box that would take several hours to grab to see how well it behaves; right now it's set to the max (900 calls in 900 seconds). This would also test if all of the tweet_types have been accounted for, and (for an old enough chat box) if stickers still work.

Open to feedback on whatever; this is definitely not a finished product.

cajuncooks added a commit to cajuncooks/DMArchiver that referenced this issue Aug 31, 2020
With Mincka#83, we need a new approach to allow this program to function.
This is the first attempt. Some features have been broken or removed,
and likely cannot be added back. Cards, embedded tweets, etc. have been
dropped, and it's possible that stickers are broken too (do they even
exist anymore?).

I can't promise that this works robustly, but it was tested in Python
3.7 with a saved session; I'm also not sure authentication isn't
totally broken as I tried to implement a 2FA fix, too, but locked
my account out from too many login attempts before I managed to get it
working.
@cajuncooks
Copy link
Contributor Author

Turns out that tweets are an embedded type, though only when the tweeter hasn't blocked you, interesting! Also figured out what I was missing on the latest ID thing. My branch should have more complete functionality now, as far as I can tell. Still needs more testing and cleanup, likely.

@cajuncooks
Copy link
Contributor Author

One of the issues I'm encountering is that the image download links (https://ton.twitter.com/1.1/data/dm/[etc]) seem to require the API header call, and the limits on that are both not coming through on the response header and also appear to be much, much lower than the DM endpoint. Right now I'm just wrapping it in a while response.status_code == 429: sleep(60) and try again kind of block, but it would be good to find a better solution for this. The video/gif links are anonymously accessible, but the image links are not, for whatever reason.

@NuLL3rr0r
Copy link

I have been facing the same issue since a few months ago. Is there any working version or a work in progress that could be checked out?

@cajuncooks
Copy link
Contributor Author

Not sure what kind of testing @Mincka would want to do, but I've been running my forked branch (json_overhaul) within an existing larger framework (processing of the txt output into db + web front end) successfully for over a month now. I don't believe the hack I described for the image API rate limiting is in, but I can include it tomorrow. Think it needs some care before it would see an official release, though.

@NuLL3rr0r
Copy link

NuLL3rr0r commented Oct 9, 2020

@cajuncooks thank you. Is it publicly available on GitHub?

@cajuncooks
Copy link
Contributor Author

git clone https://github.com/cajuncooks/DMArchiver --branch json_overhaul, finally updated it with the silent hack around the image API.

Mincka pushed a commit that referenced this issue Oct 31, 2020
With #83, we need a new approach to allow this program to function.
This is the first attempt. Some features have been broken or removed,
and likely cannot be added back. Cards, embedded tweets, etc. have been
dropped, and it's possible that stickers are broken too (do they even
exist anymore?).

I can't promise that this works robustly, but it was tested in Python
3.7 with a saved session; I'm also not sure authentication isn't
totally broken as I tried to implement a 2FA fix, too, but locked
my account out from too many login attempts before I managed to get it
working.
@Mincka
Copy link
Owner

Mincka commented Oct 31, 2020

Thanks for the overhaul @cajuncooks. 👍
I merged your branch as a new base for the tool.

Can you confirm that you did not implement the retrieval of all conversations?
It looks like I need to specify a conversion ID otherwise it crashes with:

Conversation ID not specified. Retrieving all the threads.
Expecting value: line 1 column 1 (char 0)

With a conversation id specified, it worked for about 20000 processed tweets and then stopped without saving with:

An error occured during the parsing of the tweets.

Twitter error details below:
Code 88: Rate limit exceeded

Stopping execution due to parsing error while retrieving the tweets.

I need to check how you handle the rate limiting. I think that's because the API_LIMIT is the best case scenario at 900 but if you browsed a bit on Twitter before or at the same time, the counter is lower than 900 and it encounters the error before the local throttling. You implemented the silent wait on 429 for images only from what I see. The logic must be the same for all the API calls. Maybe it would be simpler to drop the local rate limiting and only use the 429 response code to wait when necessary.

Finally, I have two types of parsing error that I need to investigate:

The first looks related to handling of stickers.

Unexpected error ''sticker'' for tweet '814111061578313732', raw JSON will be used for the tweet.
Traceback (most recent call last):
  File "core.py", line 591, in _process_tweets
KeyError: 'sticker'

This second one looks related to a parsing error when the name of the conversation is updated.

Unexpected error '' for tweet '813823287868424192', raw JSON will be used for the tweet.
Traceback (most recent call last):
  File "core.py", line 595, in _process_tweets

Still, it's a great work that brings new hopes for DMArchiver! 🎉

@khawaisnaeem
Copy link

Do something guys you are champ!!

@Joshua7896
Copy link

Did DMArchiver die?

@jeffhuang
Copy link

First of all, thanks @Mincka for all your work on this, plus other contributors. We used dmarchiver for a while to export DMs for analysis for a research project. The other option for us is exporting the messages via Twitter's export tool, but that can take a few days to get the email.

Now unfortunately it looks like Twitter has disabled crawling/scraping by requiring javascript to do anything, even to get the authenticity_token. I'm not sure if there's a way to get around that without some major rework, so even the recent update by @cajuncooks is broken now. I'm going to look into other options, but it seems like exporting messages older than 30 days (older than the API allows) might be tricky. I wonder if anyone has tried using a headless chrome browser for something like this.

@Mincka
Copy link
Owner

Mincka commented Feb 1, 2021

Hi @jeffhuang, did you try to change the user agent?
https://twitter.com/magusnn/status/1339830611343679490?s=20

Maybe it could help in this case. Thank you for the heads up anyway. Indeed, looks like a headless browser in the next hack.

@jeffhuang
Copy link

That's an interesting finding @Mincka and thank you for the suggestion. I'll look into it, but have to be cautious since our project is for federally-funded research, so we might not be so comfortable with mimicking the googlebot user agent. But if I try it, I'll post an update here.

@scramblr
Copy link

That's an interesting finding @Mincka and thank you for the suggestion. I'll look into it, but have to be cautious since our project is for federally-funded research, so we might not be so comfortable with mimicking the googlebot user agent. But if I try it, I'll post an update here.

Jeff could you possibly at least do a Proof of Concept on this and let other users decide if this falls within proper use of the boundaries of their programs? I mean no disrespect and fully understand where you're coming from, but this has some uses for reporters and others in very specific use-cases that supersede the stigma attached to "Spoofing" Googlebot, which isn't illegal or even unethical in my opinion.

Cheers, and thank you for your time.

@Bebetternow22
Copy link

Has anyone found a way to fix this? I am not a coder, but I am trying to learn. I need to pull my own deleted DM's and I think this would really help if it still works. I will need help with this though - anyone willing to help a lady out?
Thanks!

@NuLL3rr0r
Copy link

I don't think so. This used to work on the old Twitter front end. It has changed a lot since then. So, one should write a totally new scrapper. Though, it definitely cannot archive deleted DM's. For that, I guess you could download your Twitter Archive and probably parse XML stuff from what I remember.

@Bebetternow22
Copy link

Bebetternow22 commented Feb 28, 2023 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants