Skip to content

Method: Wrangle Practical Walk through (flatten_dict)

Kaicheng(Kevin) Yang edited this page Aug 23, 2021 · 2 revisions

This page is meant to provide a practical example of how you might use the flatten_dict function when cleaning Twitter data.

If we have some tweets that we would like information on, we can utilize the tweet_lookup method to gather the data, and include the everything = True flag to return all data fields and expansions. Once we get this data back, we'll walk through one approach to cleaning this data.

import osometweet
from osometweet.wrangle import flatten_dict

# Initialize the OsomeTweet object
bearer_token = "TWITTER_BEARER_TOKEN"
oauth2 = osometweet.OAuth2(bearer_token=bearer_token)
ot = osometweet.OsomeTweet(oauth2)

tweet_ids = ['1323314485705297926', '1328838299419627525']

# Fetch the tweets information
response = ot.tweet_lookup(tweet_ids, everything = True)

Now we have response which includes the following keys...

response.keys()
# prints
dict_keys(['data', 'includes'])

and we can see that the data object includes two tweets, as expected...

response["data"]
# prints
[{'conversation_id': '1323314485705297926',
  'lang': 'en',
  'context_annotations': [{'domain': {'id': '46',
     'name': 'Brand Category',
     'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
    'entity': {'id': '781974596752842752', 'name': 'Services'}},
   {'domain': {'id': '47',
     'name': 'Brand',
     'description': 'Brands and Companies'},
    'entity': {'id': '10045225402', 'name': 'Twitter'}},
   {'domain': {'id': '47',
     'name': 'Brand',
     'description': 'Brands and Companies'},
    'entity': {'id': '10045225402', 'name': 'Twitter'}}],
  'possibly_sensitive': False,
  'text': 'breathe',
  'id': '1323314485705297926',
  'public_metrics': {'retweet_count': 45980,
   'reply_count': 16504,
   'like_count': 231648,
   'quote_count': 19712},
  'author_id': '783214',
  'source': 'Sprinklr',
  'reply_settings': 'everyone',
  'created_at': '2020-11-02T17:22:14.000Z'},
 {'conversation_id': '1328838299419627525',
  'lang': 'en',
  'context_annotations': [{'domain': {'id': '46',
     'name': 'Brand Category',
     'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
    'entity': {'id': '781974596752842752', 'name': 'Services'}},
   {'domain': {'id': '47',
     'name': 'Brand',
     'description': 'Brands and Companies'},
    'entity': {'id': '10045225402', 'name': 'Twitter'}},
   {'domain': {'id': '47',
     'name': 'Brand',
     'description': 'Brands and Companies'},
    'entity': {'id': '10045225402', 'name': 'Twitter'}}],
  'possibly_sensitive': False,
  'text': 'some of you hating...\n\nbut we see you Fleeting 🧐',
  'id': '1328838299419627525',
  'public_metrics': {'retweet_count': 33818,
   'reply_count': 18909,
   'like_count': 274062,
   'quote_count': 39285},
  'author_id': '783214',
  'source': 'Sprinklr',
  'reply_settings': 'everyone',
  'created_at': '2020-11-17T23:11:53.000Z'}]

Also, we can see that the includes object returns one user object because both tweets were from the same account (see author_id above)

response["includes"]
# prints
[{'name': 'Twitter',
  'public_metrics': {'followers_count': 59529407,
   'following_count': 0,
   'tweet_count': 14721,
   'listed_count': 87311},
  'description': 'what’s happening?!',
  'profile_image_url': 'https://pbs.twimg.com/profile_images/1354479643882004483/Btnfm47p_normal.jpg',
  'id': '783214',
  'entities': {'url': {'urls': [{'start': 0,
      'end': 23,
      'url': 'https://t.co/DAtOo6uuHk',
      'expanded_url': 'https://about.twitter.com/',
      'display_url': 'about.twitter.com'}]}},
  'username': 'Twitter',
  'location': 'everywhere',
  'url': 'https://t.co/DAtOo6uuHk',
  'created_at': '2007-02-20T14:35:54.000Z',
  'protected': False,
  'verified': True}]

As we can see, both objects are in lists, and, for simplicity, we'd probably like to update each tweet object with the author information within the separate included object.

Below is a simple function for doing exactly this.

Note: this function is NOT included within the wrangle package.

def update_tweet_with_author_data(response):
    """
    Retrieve author information from the 'includes' object and
    insert it into a tweet object with a matching `author_id` (if present).

    This function iterates through all tweets in a response and takes the
    tweet author info (within the response['includes']['users'] dict_path)
    and adds it to that tweets data.
    """
    try:
        # Iterate through each returned tweet
        for tweet in response["data"]:

            # Grab the author id of that tweet
            author_id = tweet["author_id"]

            # Add the user object to the tweet object if the author_id matches
            _ = [tweet.update({"author_info": user }) for user in response["includes"]["users"] if user["id"] == author_id]

        return response

    except:
        raise Exception("Problem retrieving information!")

Basically, the above function tries to match each user id number (from all response["includes"]["users"] objects) to the author_id number in each tweet, if they match, it inserts the user information into that tweet.

So, if we now use this function with our response...

clean_response = update_tweet_with_author_data(response)

... clean_response's tweets will all include a new key-value called, author_info.

# print first tweet
clean_response["data"][0]

# Returns
{'text': 'breathe',
 'public_metrics': {'retweet_count': 45981,
  'reply_count': 16504,
  'like_count': 231649,
  'quote_count': 19712},
 'id': '1323314485705297926',
 'context_annotations': [{'domain': {'id': '46',
    'name': 'Brand Category',
    'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
   'entity': {'id': '781974596752842752', 'name': 'Services'}},
  {'domain': {'id': '47',
    'name': 'Brand',
    'description': 'Brands and Companies'},
   'entity': {'id': '10045225402', 'name': 'Twitter'}},
  {'domain': {'id': '47',
    'name': 'Brand',
    'description': 'Brands and Companies'},
   'entity': {'id': '10045225402', 'name': 'Twitter'}}],
 'possibly_sensitive': False,
 'lang': 'en',
 'author_id': '783214',
 'created_at': '2020-11-02T17:22:14.000Z',
 'reply_settings': 'everyone',
 'conversation_id': '1323314485705297926',
 'source': 'Sprinklr',
 'author_info': {'name': 'Twitter',    ###< ~~~~~~ Here we see the newly inserted author_info object
  'url': 'https://t.co/DAtOo6uuHk',
  'profile_image_url': 'https://pbs.twimg.com/profile_images/1354479643882004483/Btnfm47p_normal.jpg',
  'entities': {'url': {'urls': [{'start': 0,
      'end': 23,
      'url': 'https://t.co/DAtOo6uuHk',
      'expanded_url': 'https://about.twitter.com/',
      'display_url': 'about.twitter.com'}]}},
  'created_at': '2007-02-20T14:35:54.000Z',
  'location': 'everywhere',
  'verified': True,
  'username': 'Twitter',
  'description': 'what’s happening?!',
  'protected': False,
  'public_metrics': {'followers_count': 59529600,
   'following_count': 0,
   'tweet_count': 14721,
   'listed_count': 87315},
  'id': '783214'}}

The above effect is true for all tweets in the response object.

Now we can use the flat_dict function to flatten the dictionary and see what we get.

flatten_dict(clean_response["data"][0])

# Returns
{'text': 'breathe',
 'public_metrics.retweet_count': 45981,
 'public_metrics.reply_count': 16504,
 'public_metrics.like_count': 231649,
 'public_metrics.quote_count': 19712,
 'id': '1323314485705297926',
 'context_annotations': [{'domain': {'id': '46',
    'name': 'Brand Category',
    'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
   'entity': {'id': '781974596752842752', 'name': 'Services'}},
  {'domain': {'id': '47',
    'name': 'Brand',
    'description': 'Brands and Companies'},
   'entity': {'id': '10045225402', 'name': 'Twitter'}},
  {'domain': {'id': '47',
    'name': 'Brand',
    'description': 'Brands and Companies'},
   'entity': {'id': '10045225402', 'name': 'Twitter'}}],
 'possibly_sensitive': False,
 'lang': 'en',
 'author_id': '783214',
 'created_at': '2020-11-02T17:22:14.000Z',
 'reply_settings': 'everyone',
 'conversation_id': '1323314485705297926',
 'source': 'Sprinklr',
 'author_info.name': 'Twitter',
 'author_info.url': 'https://t.co/DAtOo6uuHk',
 'author_info.profile_image_url': 'https://pbs.twimg.com/profile_images/1354479643882004483/Btnfm47p_normal.jpg',
 'author_info.entities.url.urls': [{'start': 0,
   'end': 23,
   'url': 'https://t.co/DAtOo6uuHk',
   'expanded_url': 'https://about.twitter.com/',
   'display_url': 'about.twitter.com'}],
 'author_info.created_at': '2007-02-20T14:35:54.000Z',
 'author_info.location': 'everywhere',
 'author_info.verified': True,
 'author_info.username': 'Twitter',
 'author_info.description': 'what’s happening?!',
 'author_info.protected': False,
 'author_info.public_metrics.followers_count': 59529600,
 'author_info.public_metrics.following_count': 0,
 'author_info.public_metrics.tweet_count': 14721,
 'author_info.public_metrics.listed_count': 87315,
 'author_info.id': '783214'}

As we can see, everything is flattened except for the object whose values are lists - author_info.entities.url.urls and context_annotations. This is because individual tweets/users can have multiple urls and/or context annotations. One way of dealing with this would be to create separate tables for each of these different data objects with some sort of ID value tying them together, for example, the tweet_id.

To make this easier, I create a simple function for cleaning these lists of dictionaries that are not processed by the flatten_dict function, inserting the tweet_id into these objects.

def clean_list_of_dicts(data_obj, identifier, key_string):
    """
    Clean a list of dictionaries, inserting a labeled identifier into the list.

    The output will be a list of FLATTENED dictionaries with
    the proper identifier inserted (identified with whatever key_string
    we provide), which can be easily converted into a dataframe.
    
    Parmeters
    - data_obj : list of dictionaries to clean
    - identifier : some identifier that will tie this data object
        to another table
    - key_string : the string that will represent the key of the identifier
    """
    
    data_obj = [flatten_dict(temp) for temp in data_obj]
    _ = [obj.update({key_string:identifier}) for obj in data_obj]
    return data_obj

Now, with these functions, we can easily clean up our tweets with the following script.

# Create lists, which will house the cleaned up data 
# and be converted into pandas dataframes...
all_tweets = list()
all_urls = list()
all_context_annotations = list()

# Iterate through each tweet object, which now
# includes the `author_info` key as prevously processed
# by the update_tweet_with_author_data() function
for tweet in clean_response["data"]:
    
    # Flatten the tweet
    flat_tweet = flatten_dict(tweet)
    
    # Get the tweet ID and author ID
    tweet_id = flat_tweet["id"]
    author_id = flat_tweet["author_id"]
    
    # Remove the list objects, storing them as new objects
    # and returning `None` if not present
    context_annotations = flat_tweet.pop("context_annotations", None)
    author_urls = flat_tweet.pop("author_info.entities.url.urls", None)
    
    # Add our flat tweet to the list
    all_tweets.append(flat_tweet)
    
    # If context annotation is present, clean it and
    # then add each list object to the larger list via list.extend()
    if context_annotations is not None:
        context_annotations = clean_list_of_dicts(context_annotations, tweet_id, "tweet_id")
        all_context_annotations.extend(context_annotations)
    
    # Same as above
    if author_urls is not None:
        author_urls = clean_list_of_dicts(author_urls, author_id, "author_id")
        all_urls.extend(author_urls)

# Now each of these objects can easily be converted to 
# pandas DataFrame for analysis.
tweets_frame = pd.DataFrame(all_tweets)
urls_frame = pd.DataFrame(all_urls)
con_annotations_frame = pd.DataFrame(all_context_annotations)

Now, what we've created are the following three tables:

  1. tweets_frame: Each row is a tweet. Also includes the user info of the author of that tweet
  2. urls_frame: All urls within author_info.entities.url.urls, which also includes the author_id so that we know which author includes that url within their profile
  3. con_annotations_frame : all context annotations with a tweet_id column showing us which tweet they came from