My scripts for this approach were made in 2020, so it's now deprecated with the new Facebook UI. But you can use it as a reference for other similar implementations with Selenium.
In this approach, I will write example scripts to extract id, user info, content, date, comments, and replies of posts.
👉 Demo: https://www.youtube.com/watch?v=Fx0UWOzYsig
Note:
- These scripts just working for a Facebook page when not sign-in, not group or any other object.
- Maybe you will need to edit some of the CSS Selectors in the scripts, as Facebook might have changed them at the time of your use.
- Getting information of posts.
- Filtering comments.
- Checking redirect.
- Can be run with Incognito window.
- Simplifying browser to minimize time complexity.
- Delay with random intervals every loading more times to simulate human behavior.
- Not required sign-in to prevent Checkpoint.
- Hiding IP address to prevent from banning by:
- Collecting Proxies and filtering the slowest ones from:
- Tor Relays which used in Tor Browser, a network is comprised of thousands of volunteer-run servers.
-
Unable to detect some failed responses. Example: Rate limit exceeded (Facebook prevents from loading more).
➔ Have to run with
HEADLESS = False
to detect manually. -
Quite slow when running with a large number of loading more or when using IP hiding techniques.
-
Each post will be separated line by line.
-
Most of my successful tests were on Firefox with HTTP Request Randomizer proxy server.
-
My latest run on Firefox with Incognito windows using HTTP Request Randomizer:
Example data fields for a post
{
"url": "https://www.facebook.com/KTXDHQGConfessions/videos/352525915858361/",
"id": "352525915858361",
"utime": "1603770573",
"text": "Diễn tập PCCC tại KTX khu B tòa E1. ----------- #ktx_cfs Nguồn : Trường Vũ",
"reactions": ["308 Like", "119 Haha", "28 Wow"],
"total_shares": "26 Shares",
"total_cmts": "169 Comments",
"crawled_cmts": [
{
"id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0MzIyMTY2MzcwMjc%3D",
"utime": "1603770714",
"user_url": "https://www.facebook.com/KTXDHQGConfessions/",
"user_id": "KTXDHQGConfessions",
"user_name": "KTX ĐHQG Confessions",
"text": "Toà t á bây :) #Lép",
"replies": [
{
"id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0OTc5MDk5NjM3OTE%3D",
"utime": "1603772990",
"user_url": "https://www.facebook.com/KTXDHQGConfessions/",
"user_id": "KTXDHQGConfessions",
"user_name": "KTX ĐHQG Confessions",
"text": "Nguyễn Hoàng Đạt thật đáng tự hào :) #Lép"
}
]
}
]
}
pip install -r requirements.txt
- Helium: a wrapper around Selenium with more high-level API for web automation.
- HTTP Request Randomizer: used for collecting free proxies.
II. Customize CONFIG VARIABLES in crawler.py
-
Running the Browser:
-
PAGE_URL: URL of Facebook page.
-
TOR_PATH: use Proxy with Tor for
WINDOWS
/MAC
/LINUX
/NONE
: -
BROWSER_OPTIONS: run scripts using
CHROME
/FIREFOX
. -
PRIVATE: run with private mode or not:
- Prevent from Selenium detection ➔ navigator.driver must be undefined (check in Dev Tools).
- Start browser with Incognito / Private Window.
-
USE_PROXY: run with proxy or not. If True ➔ check:
- IF TOR_PATH ≠
NONE
➔ Use Tor's SOCKS proxy server. - ELSE ➔ Randomize proxies with HTTP Request Randomizer.
- IF TOR_PATH ≠
-
HEADLESS: run with headless browser or not.
-
SPEED_UP: simplify browser for minimizing loading time or not. If True ➔ use following settings:
- With Chrome :
# Disable loading image, CSS, ... browser_options.add_experimental_option('prefs', { "profile.managed_default_content_settings.images": 2, "profile.managed_default_content_settings.stylesheets": 2, "profile.managed_default_content_settings.cookies": 2, "profile.managed_default_content_settings.geolocation": 2, "profile.managed_default_content_settings.media_stream": 2, "profile.managed_default_content_settings.plugins": 1, "profile.default_content_setting_values.notifications": 2, })
- With Firefox :
# Disable loading image, CSS, Flash browser_options.set_preference('permissions.default.image', 2) browser_options.set_preference('permissions.default.stylesheet', 2) browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
-
-
Loading the Page:
python crawler.py
- Run at sign out state, cause some CSS Selectors will be different as sign in.
- With some Proxies, it might be quite slow or required to sign in (redirected).
- To achieve higher speed:
- If this is first time using these scripts, you can run without Tor & Proxies until Facebook requires to sign in.
- Use some popular VPN services (also run without Tor & Proxies): NordVPN, ExpressVPN, ...
- With HTTP Request Randomizer:
from browser import *
page_url = 'http://check.torproject.org'
proxy_server = random.choice(proxies).get_address()
browser_options = BROWSER_OPTIONS.FIREFOX
setup_free_proxy(page_url, proxy_server, browser_options)
# kill_browser()
- With Tor Relays:
from browser import *
page_url = 'http://check.torproject.org'
tor_path = TOR_PATH.WINDOWS
browser_options = BROWSER_OPTIONS.FIREFOX
setup_tor_proxy(page_url, tor_path, browser_options)
# kill_browser()