UNESP_PPGCC_Scraper is a web scraping tool designed to extract and process information from the IBILCE UNESP website (www.ibilce.unesp.br), with a specific focus on the Computer Science graduate program pages. This project utilizes the power of Scrapy and Selenium to handle dynamic, JavaScript-rendered content, ensuring comprehensive data collection from web pages. It also implements a Scrapy pipeline for post-scraping operations:
- HTML to Markdown conversion
- Relative to absolute link conversion
- External file download and conversion (e.g., PDF to PNG)
Before you begin, ensure you have the following installed:
- Python 3.11
- ChromeDriver
- LibreOffice (accessible via the 'soffice' command)
-
Clone the repository:
git clone https://github.com/unesp-ppgcc-2024-02-ap-chatbot/UNESP_PPGCC_Scraper.git cd repo_name/scraper
-
Create and activate a virtual environment:
# Linux python3 -m venv .venv source .venv/bin/activate # Windows python -m venv .venv .venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Download ChromeDriver and place the
.exe
file in the./scraper/
folder. -
Replace the following files in your virtual environment:
.venv/lib/python3.11/site-packages/scrapy_selenium/http.py .venv/lib/python3.11/site-packages/scrapy_selenium/middlewares.py
with the files provided in
./scraper/scrapy-selenium-update/
.
All commands should be run from the ./scraper-unesp/scraper
directory.
scrapy crawl links_spider -o "./data/metadata/links_metadata.csv" -t csv
This step scrapes all links from Unesp's Graduate Program in Computer Science (PPGCC) pages and saves them to a CSV file.
scrapy crawl page_content_spider -o "./data/metadata/page_content_metadata.csv" -t csv
This step extracts the main content from each page, saves HTML snippets, and converts them to Markdown format.
scrapy crawl external_content_spider -o "./data/metadata/external_content_metadata.csv" -t csv
This step downloads external files referenced on the pages, converts them to Markdown, and generates PNG images of the content.
The scraper produces the following output files:
links_metadata.csv
: Contains scraped links and their metadata.page_content_metadata.csv
: Contains information about extracted page content.external_content_metadata.csv
: Contains information about downloaded external files.
[Add more details about the structure and location of output files if necessary]
./
├── LICENSE.txt
├── README.MD
├── requirements.txt
├── scraper
│ ├── chromedriver
│ ├── data
│ │ ├── metadata
│ │ │ ├── external_content_metadata.csv
│ │ │ ├── links_metadata.csv
│ │ │ └── page_content_metadata.csv
│ │ ├── preprocessed
│ │ │ ├── external_content
│ │ │ │ ├── markdown
│ │ │ │ └── png
│ │ │ └── page_content
│ │ │ └── markdown
│ │ └── raw
│ │ ├── external_content
│ │ └── page_content
│ │ └── html
│ ├── scrapy.cfg
│ └── selenium_scraper
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ ├── spiders
│ │ └── scraping.py
│ └── utils.py
└── scrapy-selenium-update
├── http.py
└── middlewares.py
This project is licensed under the MIT License - see the LICENSE.txt file for details.