Skip to content

Latest commit

 

History

History
77 lines (66 loc) · 3.85 KB

File metadata and controls

77 lines (66 loc) · 3.85 KB

Amazon Data Science Books; Analysis & Visualizations

Folder Structure

  • csv_files contains the processed and un-processed csv files.
  • notebooks contains the all the .ipynb files. The notebook used to preprocess the data can be found here.
  • scraper contains the scraper.py file which was used to scrape the data from amazon.

Problem statement

The goal of this project is to gather information of Data Science realted books from amazon. There are total of 1351 entries in the csv_files/amazon_data_science_books.csv file.
Later we utlizied the scraped data to understand the following demographics and correlations using Tableau Dashboard:

  1. A doughnut chart showing the number of books published by the top 15 publishers and the others.
  2. A barchart of top 15 publisher by the amount of books published
  3. Average price of books by the top 15 publishers
  4. Price range of books
  5. Pages vs Price trend
  6. Top books by user reviews (rating 4.0 - 5.0)
  7. Average reviews of Top 15 publishers

Findings and Observations from the Dashboard

Note: Try viewing the Dashboard in Full Screen mode.

  1. Among the 1324 books (after preprocessing the data) 948 of them are published by only 15 publications.
  2. Packt has the highest publication of books
  3. Springer has the highest average price
  4. As the pages increase, the price of the books increases.
  5. Price of the most books fall around the range between (14.00 - 60-00) USD

You can visit the public dashboard here

First look on the dashboard
Also, try clicking the bars on the bar plots, and see the changes.

Build from Sources and run the selenium driver

  1. Clone the repo
git clone https://github.com/Tasfiq-K/amazon-data-science-books-analysis.git
  1. Initiaize and activate virtual environment
    If you are running Python 3.4+, you can use the venv module baked into Python:
python -m venv <directory name>

for example, if you name your directory 'venv', then run this command:

python -m venv venv

For activating the virtual environmet run:
On Windows

# In cmd.exe
venv\Scripts\activate.bat
# In Powershell
venv\Scripts\activate.psl

On Linux or MacOs

$ source venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Download Webdriver
    Download the web driver at your convenience, I've used the geckodriver to use it with the Firefox browser. You can download it from here

  2. Run the scraper

python scraper.py --geckodriver_path <path_to_chromedriver>
  1. You will get a file with the following name amazon_data_science_books.csv containing all the required fields and data. Alternatively, check the scraped data here

Analytics

Tableau Public View: https://public.tableau.com/app/profile/tasfiq.kamran/viz/AmazonDataScienceBooksDashboard/AmazonDataScienceBooks