v2 Output Format Options #112

igorbrigadir · 2021-02-20T00:02:02Z

Following on from #102 (comment) i added some options to output "atomic" format of tweets, with all expansions filled inline in the tweet results.

I assumed "constructed" format would be the format that's currently implemented, and the new proposed "atomic" one has all includes inline. "response" format, i took to mean the raw response from the API.

Let me know what you think.

this commit just as a shortcut to fork the v2 branch as a reminder to myself

igorbrigadir · 2021-02-20T01:16:27Z

I wasn't really sure if that's the way --atomic or --output-options should work, so i guessed.

igorbrigadir · 2021-02-21T21:16:29Z

the geo expansion was wrong, it's not a list. fixed.

igorbrigadir · 2021-02-21T21:25:20Z

There's a more compact, alternative implementation here which i really like: https://github.com/geduldig/TwitterAPI/blob/master/TwitterAPI/TwitterAPI.py#L414 it replaces the json structure as opposed to appending extra data, which i felt was important to keep the "atomic" output still compatible with any other tools that expect the non atomic format (it does not expand user mentions because those are referenced by username not by id but those can be added in in the same way).

igorbrigadir · 2021-02-23T21:44:27Z

on second thought: tweet["attachments"]["polls"] is an array because poll_ids is an array in the json, and also in the spec https://api.twitter.com/2/openapi.json as far as i know. But this makes no sense because there can only be one poll attached to a tweet, not multiple ones. Maybe that should change here, contrary to the API response?

jimmoffitt · 2021-02-27T00:11:11Z

OK, I need to take a look soon. Thanks for all the information.

igorbrigadir · 2021-02-27T10:36:51Z

I think i will change that polls thing afterall.. when i get back home later today.

igorbrigadir · 2021-02-27T23:41:44Z

Might have another try at the replacements part of it, I think I have a better way of doing that too, just going to double check that it works and doesn't blow up 😅

igorbrigadir · 2021-02-28T22:11:23Z

I cleaned up the code a bit more, i think it's working now the way i imagined it - tweets are expanded with whatever extra info is available in that individual call.

One caveat that maybe should be documented somewhere: some objects like geo places or pinned tweets can be left unexpanded because the place id or tweet wasn't inside includes on the same "page". Keeping all the expansions in memory across calls i thought would not be a good idea.

igorbrigadir · 2021-03-01T11:15:15Z

More thinking out loud: Another thing i was thinking about was the possibility of specifying multiple output formats - or alternatively, including the raw request as an extra line of output in the "atomic" variant - just like the current version writes out the meta and includes objects separately, the "atomic" one proposed here should maybe write out the raw request as a separate line too. Sure, this means that it's writing nearly double the data, but i think it's worth preserving because errors are not processed, and i nearly always seem to end up needing the original API requests for something, either changing how data is parsed, or resuming crawls after an error or something like that.

jimmoffitt · 2021-03-02T04:32:54Z

Hi Igor,

Pondering these options: 'a' - atomic, 'r' - response, 'c' - constructed"

I like 'atomic' and 'response', but 'constructed' is not resonating with me ;) Trying to think of a more descriptive name for the mode where the client writes out the "data", "includes" and "meta" arrays in series.

Can you provide more context for your comments on the "errors"? Are these embedded errors about supporting objects that could not be provided?

I am not convinced about providing for simultaneous output modes... If the forcing function for that is some loss of "errors" metadata, let's update the code to handle those, rather than support simultaneous modes...?

igorbrigadir · 2021-03-02T10:56:38Z

but 'constructed' is not resonating with me ;)

Yeah - i just made that one the current default to be backwards compatible i guess, so it wouldn't break anything that's currently relying on this format. Maybe "decomposed" is a better term for it?

I actually think "response" format should be the default - 1 json line per original API response makes a lot of sense to have as the default, and then optionally "atomic" outputting each tweet as an option.

Are these embedded errors about supporting objects that could not be provided?

Yes - these are the errors included in includes when you try to retrieve a suspended account or something. I just haven't managed to make some calls that would reliably return those so i haven't tested them out properly.

As for multiple output formats, it's more of an idea, I don't have strong enough opinions or requirements to do that to warrant working on it more - the raw "response" format should be the one to use in that case.

igorbrigadir · 2021-04-05T16:34:59Z

With that last docs change I don't think I've anything else to add, unless there are more suggestions.

Separately, i've been working on twarc, and i'm gonna add things to twarc-csv to support the different formats to make things interoperable as much as possible, since it's a very common use case.

README.rst

jimmoffitt · 2021-04-13T19:51:14Z

Digging back into the code this week. In the process of testing the three output options. The default 'r' option looks good so far, I like how it just echos back the exact response.

Moving on to 'a' testing... finding that the command-line setting is not being passed through, so looking into that.

Also, it seems that the currently separate '--atomic' option could/should be dropped in favor of a single "--output-option" (will rename it to be singular) option.

Also adding in a one-second pause when hitting the "all" endpoint.

jimmoffitt

After merging this pull request, will be testing more and making updates.

jimmoffitt

Will take another pass while testing and update.

igorbrigadir · 2021-04-13T21:52:46Z

Great! I fixed that command line issue in #121

jimmoffitt · 2021-04-14T19:30:01Z

@igorbrigadir So, in my testing (inside my IDE) I don't see the atomic Tweets until all are "ready." So it is not yielding/emitting the Tweets one by one, but rather there is nothing emitted until the entire 'plug' of Tweets is ready... Different behavior from a few iterations ago. Maybe it is just me. I need to test on the command line next.

igorbrigadir added 4 commits February 19, 2021 08:33

enable output options.

318eb26

this commit just as a shortcut to fork the v2 branch as a reminder to myself

add output formats to result stream

7d87d6f

simplify extracting expansions a bit

de965c7

add atomic and output format args options

7eb774a

fix geo expansion merge

4156565

make polls and geo 1 object, instead of a list

05c3d40

igorbrigadir added 5 commits February 28, 2021 21:44

alternative way to expand results

4a039b0

whitespace

d18d2cd

Update result_stream.py

769da93

whitespace

7c2b9de

Merge remote-tracking branch 'upstream/v2' into output-formats

3ac427d

igorbrigadir added 5 commits March 27, 2021 04:44

fix bug if no expansions are returned

35df66a

set default output to original API responses

cdf7d1e

rename output format options

53d78e1

rename output format options

465ff2e

add output formats docs

424a430

jiemakel reviewed Apr 13, 2021

View reviewed changes

README.rst Show resolved Hide resolved

jimmoffitt approved these changes Apr 13, 2021

View reviewed changes

jimmoffitt merged commit 5d58536 into xdevplatform:v2 Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 Output Format Options #112

v2 Output Format Options #112

igorbrigadir commented Feb 20, 2021 •

edited

Loading

igorbrigadir commented Feb 20, 2021

igorbrigadir commented Feb 21, 2021

igorbrigadir commented Feb 21, 2021

igorbrigadir commented Feb 23, 2021

jimmoffitt commented Feb 27, 2021

igorbrigadir commented Feb 27, 2021

igorbrigadir commented Feb 27, 2021

igorbrigadir commented Feb 28, 2021

igorbrigadir commented Mar 1, 2021

jimmoffitt commented Mar 2, 2021

igorbrigadir commented Mar 2, 2021

igorbrigadir commented Apr 5, 2021

jimmoffitt commented Apr 13, 2021

jimmoffitt left a comment

jimmoffitt left a comment

igorbrigadir commented Apr 13, 2021

jimmoffitt commented Apr 14, 2021

v2 Output Format Options #112

v2 Output Format Options #112

Conversation

igorbrigadir commented Feb 20, 2021 • edited Loading

igorbrigadir commented Feb 20, 2021

igorbrigadir commented Feb 21, 2021

igorbrigadir commented Feb 21, 2021

igorbrigadir commented Feb 23, 2021

jimmoffitt commented Feb 27, 2021

igorbrigadir commented Feb 27, 2021

igorbrigadir commented Feb 27, 2021

igorbrigadir commented Feb 28, 2021

igorbrigadir commented Mar 1, 2021

jimmoffitt commented Mar 2, 2021

igorbrigadir commented Mar 2, 2021

igorbrigadir commented Apr 5, 2021

jimmoffitt commented Apr 13, 2021

jimmoffitt left a comment

Choose a reason for hiding this comment

jimmoffitt left a comment

Choose a reason for hiding this comment

igorbrigadir commented Apr 13, 2021

jimmoffitt commented Apr 14, 2021

igorbrigadir commented Feb 20, 2021 •

edited

Loading