A ‘whitelist’ data gathering pipeline approach that gathers ads directly from platform-provided transparency libraries into aggregate collections for downstream analysis
- HomeBrew Package Management Command Line Interface to install, update and manage software on MacOS
- Git Version control system to collaborate as well as to clone the public repos or repos with at least reading access.
- Python Python 3.8.5
- Anaconda/VirtualEnv To create an environment in which all the dependencies/libraries for this CLI tool are installed.
- AWS
- GBQ
- Create a Conda environment, remember the name of your environment to activate it every time you like to use this tool.
conda init
conda create --name <replace_this> python=3.8
conda activate <replace_this>
- Clone the code from GitHub repository
git clone https://github.com/qut-dmrc/adobserve_whitelist.git
cd source
pip install -r requirements.txt
- Get a list of FB pages to track
- There are two ways to gather FB page IDs.
- Get IDs from CrowdTangle(CT) given a list of FB Profile URLs to public accounts.. Access to CrowdTangle needs to be requested separately. This method is more robust than getting IDs directly from Facebook Ad library, but you can skip this if you do not have access to CT.
- a. Create a new empty list in one of the CT dashboards. E.g. whitelist_method_test
- b. Run CT_page_id/clearn_url_to_CT_template.py and batch upload the template, and keep a record of pages that failed to be tracked.
filename = 'Ad monitoring targets.csv' #The csv file with FB page URLs target_col = "Page ID" #The column with FB page URLs list_name = "whitelist_method_test" create_CT_list(filename,target_col,list_name)
- c. Run CT_page_id/facebook_page_info_CT.py to get account info from CT (mainly for page name to be used on ad library to search, and accountHandle to verify it's the right account)
- Get IDs from Meta Ad Library directly given a list of account names. (Step 3)
- Get IDs from CrowdTangle(CT) given a list of FB Profile URLs to public accounts.. Access to CrowdTangle needs to be requested separately. This method is more robust than getting IDs directly from Facebook Ad library, but you can skip this if you do not have access to CT.
- Get page IDs from the Facebook ad library (FB_page_id/facebook_ad_library_page_id.py)
- Replace credentials before running this script.
- Go to Meta Ad Library.
- Right click to inspect.
- Go to Network tab.
- Interact with the website and find the right server request that gives you the response which contains page info.
- Right click to copy as cURL.
- Conver cURL to Python request.
- Copy all credentials and paste them in get_page_info
- Replace
'q': '...'
in params with'q' : page["name"]
- Update variables to point to your CSV file, update column names for page name and handle.
- Run FB_page_id/facebook_ad_library_page_id.py
- You can get your ID from pages.csv, page info from results/[whitelist_test].json [You can replace whitelist_test with a meaningful name for your case study.]
- Replace credentials before running this script.
- Run CT_page_id/merge_id_url.py. Output:
- a. 1st iteration of page_url, name, id csv file, comment out #3 onward in the code
- b. [do not edit in excel and save as csv, all ids will be truncated, edit it directly in code editor] cross-check with facebook ad library or library( https://www.facebook.com/ads/library/report/?source=nav-header), may need to gather IDs manually for those that could not be retrieved programmatically.
- c. Once the IDs are complete, comment out #1 in the code. Run the second part(#3) of the script
- d. Ids.txt contains all the IDs to be added to the tracking list.
- e. Final excel/csv file with original data and id column are for reference.
- Use the ids to collect ads from ad library. [Check steps to collect ads from id regularly]
- Json files in results can be uploaded to a structured database of choice [Google Big Query in our example] named as page data [Optional]
- a. Run update_gbq_tables.py
- Update page category data on GBQ too. [Optional] a. update_page_category.py
- Create bucket of the same name as the dataset in AWS S3
- Check config.py contains IDs to track assigned to variable pages, point
bq_project
andbq_dataset
to your BQ project and dataset. E.g. pages = ['123','456','789'] - Create folders media/ and raw_json/ inside source/
- Modify
if __name__ == '__main__':
tablenames = ['page_info', 'ad_main','ad_snapshot_card','page_category'] dataset = "whitelist_test" mainLoop(dataset, True, False) # combine_new_data_with_existing(tablenames, dataset) push_json_to_s3(dataset,"raw_json/","raw_json/")
- Run
python __main__.py
in the terminal from whitelist_test\source
- Update session/cookies each time
mainLoop(dataset, False, False)
- Modify
if __name__ == '__main__':
if __name__ == '__main__': tablenames = ['page_info', 'ad_main','ad_snapshot_card','page_category'] dataset = "whitelist_test" mainLoop(dataset, False, False) combine_new_data_with_existing(tablenames, dataset) push_json_to_s3(dataset,"raw_json/","raw_json/")
- run
python __main__.py
- If the process gets interrupted midway through the list of pages, change the latest json file with today's date to the name of your dataset(e.g. whitelist_test) and set mainLoop in
if __name__ == '__main__'
tomainLoop(dataset, False, True)
Please do not hesitate to contact our team for any inquiries or assistance.
Jane Tan [email protected]
Daniel Angus [email protected]