Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 Output Format Options #112

Merged
merged 16 commits into from
Apr 13, 2021
Merged

Conversation

igorbrigadir
Copy link

@igorbrigadir igorbrigadir commented Feb 20, 2021

Following on from #102 (comment) i added some options to output "atomic" format of tweets, with all expansions filled inline in the tweet results.

I assumed "constructed" format would be the format that's currently implemented, and the new proposed "atomic" one has all includes inline. "response" format, i took to mean the raw response from the API.

Let me know what you think.

@igorbrigadir
Copy link
Author

I wasn't really sure if that's the way --atomic or --output-options should work, so i guessed.

@igorbrigadir
Copy link
Author

the geo expansion was wrong, it's not a list. fixed.

@igorbrigadir
Copy link
Author

There's a more compact, alternative implementation here which i really like: https://github.com/geduldig/TwitterAPI/blob/master/TwitterAPI/TwitterAPI.py#L414 it replaces the json structure as opposed to appending extra data, which i felt was important to keep the "atomic" output still compatible with any other tools that expect the non atomic format (it does not expand user mentions because those are referenced by username not by id but those can be added in in the same way).

@igorbrigadir
Copy link
Author

on second thought: tweet["attachments"]["polls"] is an array because poll_ids is an array in the json, and also in the spec https://api.twitter.com/2/openapi.json as far as i know. But this makes no sense because there can only be one poll attached to a tweet, not multiple ones. Maybe that should change here, contrary to the API response?

@jimmoffitt
Copy link
Collaborator

OK, I need to take a look soon. Thanks for all the information.

@igorbrigadir
Copy link
Author

I think i will change that polls thing afterall.. when i get back home later today.

@igorbrigadir
Copy link
Author

Might have another try at the replacements part of it, I think I have a better way of doing that too, just going to double check that it works and doesn't blow up 😅

@igorbrigadir
Copy link
Author

I cleaned up the code a bit more, i think it's working now the way i imagined it - tweets are expanded with whatever extra info is available in that individual call.

One caveat that maybe should be documented somewhere: some objects like geo places or pinned tweets can be left unexpanded because the place id or tweet wasn't inside includes on the same "page". Keeping all the expansions in memory across calls i thought would not be a good idea.

@igorbrigadir
Copy link
Author

More thinking out loud: Another thing i was thinking about was the possibility of specifying multiple output formats - or alternatively, including the raw request as an extra line of output in the "atomic" variant - just like the current version writes out the meta and includes objects separately, the "atomic" one proposed here should maybe write out the raw request as a separate line too. Sure, this means that it's writing nearly double the data, but i think it's worth preserving because errors are not processed, and i nearly always seem to end up needing the original API requests for something, either changing how data is parsed, or resuming crawls after an error or something like that.

@jimmoffitt
Copy link
Collaborator

Hi Igor,

Pondering these options: 'a' - atomic, 'r' - response, 'c' - constructed"

I like 'atomic' and 'response', but 'constructed' is not resonating with me ;) Trying to think of a more descriptive name for the mode where the client writes out the "data", "includes" and "meta" arrays in series.

Can you provide more context for your comments on the "errors"? Are these embedded errors about supporting objects that could not be provided?

I am not convinced about providing for simultaneous output modes... If the forcing function for that is some loss of "errors" metadata, let's update the code to handle those, rather than support simultaneous modes...?

@igorbrigadir
Copy link
Author

but 'constructed' is not resonating with me ;)

Yeah - i just made that one the current default to be backwards compatible i guess, so it wouldn't break anything that's currently relying on this format. Maybe "decomposed" is a better term for it?

I actually think "response" format should be the default - 1 json line per original API response makes a lot of sense to have as the default, and then optionally "atomic" outputting each tweet as an option.

Are these embedded errors about supporting objects that could not be provided?

Yes - these are the errors included in includes when you try to retrieve a suspended account or something. I just haven't managed to make some calls that would reliably return those so i haven't tested them out properly.

As for multiple output formats, it's more of an idea, I don't have strong enough opinions or requirements to do that to warrant working on it more - the raw "response" format should be the one to use in that case.

@igorbrigadir
Copy link
Author

With that last docs change I don't think I've anything else to add, unless there are more suggestions.

Separately, i've been working on twarc, and i'm gonna add things to twarc-csv to support the different formats to make things interoperable as much as possible, since it's a very common use case.

README.rst Show resolved Hide resolved
@jimmoffitt
Copy link
Collaborator

Digging back into the code this week. In the process of testing the three output options. The default 'r' option looks good so far, I like how it just echos back the exact response.

Moving on to 'a' testing... finding that the command-line setting is not being passed through, so looking into that.

Also, it seems that the currently separate '--atomic' option could/should be dropped in favor of a single "--output-option" (will rename it to be singular) option.

Also adding in a one-second pause when hitting the "all" endpoint.

Copy link
Collaborator

@jimmoffitt jimmoffitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging this pull request, will be testing more and making updates.

Copy link
Collaborator

@jimmoffitt jimmoffitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take another pass while testing and update.

@jimmoffitt jimmoffitt merged commit 5d58536 into xdevplatform:v2 Apr 13, 2021
@igorbrigadir
Copy link
Author

Great! I fixed that command line issue in #121

@jimmoffitt
Copy link
Collaborator

@igorbrigadir So, in my testing (inside my IDE) I don't see the atomic Tweets until all are "ready." So it is not yielding/emitting the Tweets one by one, but rather there is nothing emitted until the entire 'plug' of Tweets is ready... Different behavior from a few iterations ago. Maybe it is just me. I need to test on the command line next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants