Generic Web Scraper

Description

This project is a collection of scraping methods that I thought were cool using Selenium and Chrome core. This helps with dynamically loading content that is using AJAX or web sockets to show information. The goal of this project is to build scraper tech generically that will for a wide array of applications.

Usage

-h, --help show this help message and exit
-yt, --youtube Grab some comments from a Youtube Video
-cc CMCNT, --comment_count CMCNT
The number of pages of comments to get
-u URL, --url URL The URL that will be grabbed
-db, -db_out Output the data to SQLite.
-c, --csv Output the data as CSV file
-o OUTPUT, --out_file OUTPUT
Output file
-v, --verbose Make things verbose

Setup Ubuntu 20.04

sudo apt-get install -y unzip xvfb libxi6 libgconf-2-4 libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo dpkg -i google-chrome*.deb
sudo apt --fix-broken install
(You might have to run the dpkg command again if you had to fix broken dependecies)

Python Dependencies
sudo apt install python3-pip
python3 -m pip install pyvirtualdisplay selenium pandas sqlalchemy

Install ChromeDriver

wget https://chromedriver.storage.googleapis.com/85.0.4183.87/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

sudo rm /usr/bin/chromedriver
sudo rm /usr/local/bin/chromedriver

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.gitignore		.gitignore
GenericWebScraper.py		GenericWebScraper.py
README.md		README.md
chromedriver_linux64.zip		chromedriver_linux64.zip
genericwebscraper.code-workspace		genericwebscraper.code-workspace
google-chrome-stable_current_amd64.deb		google-chrome-stable_current_amd64.deb
testfunctionality.py		testfunctionality.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generic Web Scraper

Description

Usage

Setup Ubuntu 20.04

About

Releases

Packages

Contributors 2

Languages

mspicer/GenericWebScraper

Folders and files

Latest commit

History

Repository files navigation

Generic Web Scraper

Description

Usage

Setup Ubuntu 20.04

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages