The DMRC Academic Twitter Archive (DATA) Collector uses the Twarc Python library to collect tweets from Twitter's archive, via the API 2.0 for academic research. It then processes the collected .json files and pushes the data to a designated Google BigQuery database. Tweets can be collected from as far back as 22 Mar, 2006.
This tool can also be used to append archive data to an existing TCAT or TweetQuery dataset, and can serve as a backfill tool to supplement previously-collected datasets.
You can now also upload a previously collected Twitter API 2.0 json file (e.g. from Twitter's Tweet Downloader) to be processed and pushed to Google BigQuery * .
This tool is intended for researchers who wish to collect data from the Twitter archive (more than one week in the past), and have these data processed and pushed automatically to Google BigQuery. It was designed with DMRC researchers in mind, but may be useful to researchers from other institutions.
- Python 3.8 to 3.10 (if using a work or university PC, ensure you comply with the organisation's installation guidelines)
- A valid Twitter Academic API bearer token
- A valid Google service account and json key
xGB
free on your local drive for json file storage (the following are estimates and may differ depending on the data collected; you can store these files elsewhere after collection; see Managing Disk Space, below):
n Tweets | Size (GB) |
---|---|
250,000 | ~ 1.25 |
500,000 | 2 - 2.5 |
1,000,000 | 4 - 5 |
5,000,000 | 16 - 18 |
10,000,000 | 30 - 35 |
-
Navigate to a location with enough space to download tweets (refer to What You Will Need section, above) and create a new folder:
mkdir DMRC_DATA
. -
Navigate into your new directory (
cd DMRC_DATA
) and create your virtual environment:python -m venv DATA_env
. -
Activate your virtual environment:
cd DATA_env/Scripts
, thenactivate
. You should now see (DATA_env) before your path. -
Navigate up two levels:
cd ..
, followed bycd ..
. You should now be in the directory you created in step 1. -
Clone this repository:
git clone https://github.com/qut-dmrc/DMRC_Academic_Twitter_Archive_Collector.git
. ALTERNATIVELY: Click the green 'Code' button at the top left of the repository pane, select 'Download ZIP' and extract the contents to your new DMRC_DATA directory. Navigate into the cloned directory:cd DMRC_Academic_Twitter_Archive_Collector
.
- Install venv requirements:
python -m pip install -r requirements.txt
.
- Navigate to the collector:
cd DATA_collector
.
- Navigate to the
DATA_collector
directory (e.g.C:/Users/You/Desktop/DMRC_Academic_Twitter_Archive_Collector/DATA_collector
).
- Place your Google BigQuery service key json file into the
DATA_collector/access_key
directory.
- Open
DATA_collector/config/config_template.yml
.- Set your query parameters:
-
query:
- EITHER a string containing keyword(s) and/or phrase(s) up to 1024 characters each, e.g.
'cats OR kittens'
- OR a list of search strings (up to 1024 characters each) e.g.
['cats OR kittens', 'dogs OR puppies', 'birds chicks from:albomp']
- EITHER a string containing keyword(s) and/or phrase(s) up to 1024 characters each, e.g.
-
start_date: the earliest date to search, in UTC time.
-
end_date: the latest date to search, in UTC time.
-
- Enter your bearer token:
- bearer_token: your Twitter Academic API bearer token.
- Set your Google BigQuery project and dataset:
- project_id: name of the relevant Google BigQuery billing project. Must match the provided service account key.
- dataset: the name of your intended dataset, e.g.
'twitter_pets_2022'
. IMPORTANT: If the dataset already exists, the data will be appended to the existing dataset; if it does not exist, a new dataset will be created.
- Choose your schema type (DATA, TCAT, TweetQuery).
DATA = True
by default. Refer to Output, below, for schema details.
- Set your query parameters:
- Rename
config_template.yml
toconfig.yml
.
- Run
python ./run.py
.
After you run run.py
, you will be prompted to verify your query config details. If everything is correct, type y
, otherwise, type n
to exit and change your input.
There is a very good chance that (beneficial!) changes have been made to this repository. Remember to update before you use DATA using
git pull origin main
!
When you run DATA, you will be asked to select one of two options:
If your query is a string (i.e. cats dogs
), DATA will automatically get the counts for your query and ask you if you would like to proceed with your search:
If your query is a list of strings, i.e. ['cats dogs', 'cats OR dogs', 'cats OR dogs OR birds']
, you will be asked if you would like to check the volume of each query. If you select 'y', a .csv file will be written to your directory containing the counts for each query. If you select 'n', you will be asked if you would like to commence your search without running the counts first (i.e. if you already have the counts).
Option 2 allows the user to process tweets collected using Twarc2, provided the data were collected from the archive endpoint. Additionally, a file collected using DATA (and located in the my_collections/your_directory/collected_json
directory) can be moved into DATA_collector/json_input_files
and reprocessed. Files from Twitter's Tweet Downloader can be similarly processed from this directory, but this function is in testing. Refer to Uploading a .json file to be processed section, below.
Depending on the schema type selected, the tool will produce data as shown below:
Schema Type | Purpose | n Tables | Table Names |
---|---|---|---|
DATA | Standalone archive data analysis, where it is not necessary to append archive data to existing tables. | 13 | annotations author_description author_urls context_annotations hashtags cashtags interactions media mentions poll_options tweets urls edit_history (for tweets later than August 2022) |
TCAT | Backfill/append archive data to an existing TCAT table | 3 | hashtags mentions tweets |
TweetQuery | Backfill/append archive data to an existing TweetQuery table | 1 | tweets_flat |
A detailed overview of the tables and fields is located here.
Your query string should follow Twitter's rules for How to Build a Query.
Queries may be up to 1024 characters long.
Queries are case insensitive.
Operator | Logic | Example | What it does |
---|---|---|---|
AND | cats kittens | searches for tweets that contain keywords 'cats' AND 'kittens' | |
OR | OR | cats OR kittens | searches for tweets that contain keywords 'cats' OR 'kittens' |
- | NOT | cats -kittens | searches for tweets that contain keywords 'cats', but NOT 'kittens' |
( ) | Grouping | (cats OR kittens) (dogs puppies) | searches for tweets that contain keywords 'cats' or 'kittens' AND 'dogs' AND 'puppies' |
" " | Exact string | "cats are cute" | searches for the exact string as a keyword, e.g. "cats are cute" |
AND operators are evaluated before OR operators. For example, 'cats OR dogs puppies' will be evaluated as 'cats OR (dogs AND puppies)'.
The DATA tool collects Twitter data using the Twarc2 library. During processing, it generates several extra fields to enrich these data. One such field is called 'reference_level'.
Tweets that match the search parameters are situated at reference_level 0. If any tweets at reference level 0 happen to reference another tweet in any way (i.e. retweet, quote, reply), then that referenced tweet will be situated at reference_level 1, and so on. Unless you are interested in these referenced tweets, it can be useful to filter the data to include Tweets where reference_level=0 at the time of analysis. This focuses your analysis on tweets that directly match your search parameters, and reduces the size of your working dataset.
If you have a .json file from Twitter's Tweet Downloader, you can have this processed and pushed to a Google BigQuery dataset by following these steps:
TBC
'* Load from 'Tweet Collector' json currently in testing phase.
This tool is designed to run on a user's local computer. In order to keep collected file sizes manageable, collections are split into files containing (on average) 100,000 tweets. This means that for collections greater than this, files will average approximately 1GB in size.
If you need to clear some space on your hard drive, you can remove collected .json files from the DATA_Collector/my_collections/your_directory/collected_json folder
while the collector is running, and move them to a backup location.
If you do remove files from this location, and need to stop/restart the collector, you will need to update the start_date parameter in config.yml
to avoid re-collecting those files.
Be sure not to remove the current collection file (the newest file in the directory).
A timestamped log file will be generated at location DATA_collector/logging
each time you run a collection.
This log file contains all outputs and can aid in investigating any errors that may arise.
Refer to FAQs
This is an ongoing development project. If you have any further questions, please send an email to [email protected]. If you get any errors, please open an issue in GitHub.