Skip to content

Dublin bus project @ Data gathering and analysis project. Technion 094290 course - Winter 2020-2021

Notifications You must be signed in to change notification settings

yotammarton/dublin-bus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Dublin bus

1st place - students final projects

Presentation - Youtube (click)
Presentation - Youtube


Features

In this project we help the Dublin bus control center and citizines of the city by:

  • Indicating about picks in air pollution (data by BreezoMeter.com)

  • Recommendations for city closure - when the air pollution is extremely high, we want to reduce the emissions and advice to prevent the entrance of private cars to the city center.


  • Real time transit alternatives - car drivers are provided with an alternative transportation option with buses to the city center, based on real-time bus locations (Kafka stream)

Prerequisites

  1. Databricks machine with Databricks Runtime Version 6.4 (includes Apache Spark 2.4.5, Scala 2.11)

  1. Another machine that will be our warehouse, together with Apache Cassandra cluster

    • This warehouse is used to log extreme air quality events
    • See instructions below for setup

notebook.ipnyb

  • Contains the cells to run to reproduce the code
  • Kafka stream with dublin-bus data is needed
  • Air Quality data is needed but we will not supply

In this code we:

  • Show the closure location that will be recommended upon high air pollution event.
  • Calculate bus rides alternatives between every two bus stops in Dublin, based on kafka stram bus locations
  • Process air quality data
  • Read the Kafka data stream for dublin-bus data
  • Impute missing data in air quality index values by windowed mean imputation [1]
  • Create and updated live map for the dashboard, deploy it to git to update the bus control center dashboard.
  • Upload data to warehouse regarding times of high air pollution events

Cassandra cluster setup

  • We will setup Cassandra 3.11.9 on our machine, configure the cluster and open communication with other machine on the same network
  1. If reinstalling first do:
sudo apt-get remove cassandra
sudo apt-get autoremove cassandra
sudo rm -rf /var/lib/cassandra
sudo rm -rf /var/log/cassandra
sudo rm -rf /etc/cassandra
  • Delete everything it finds: sudo find / -name 'cassandra'
  1. Fresh installation of Cassandra
sudo rm -rf /etc/apt/sources.list.d/cassandra.sources.list
echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
  1. Verify you see cassandra.yaml ls /etc/cassandra/

  2. Verify node is running : nodetool status

  3. Run cqlsh than exit

  4. Stop cassandra: nodetool drain and let it finish, than run systemctl stop cassandra

  5. Configure cassandra.yaml (configuration file): sudo nano /etc/cassandra/cassandra.yaml

  1. Create new data dir in the right disk partition, move existing cassandra folders to the partition
mkdir /StudentData/cassandra
sudo mv /var/lib/cassandra/data /StudentData/cassandra
sudo mv /var/lib/cassandra/commitlog /StudentData/cassandra
  1. Finally, Start cassandra: systemctl start cassandra than journalctl -f -u cassandra and exit.

Thats it!

References

[1] A Review of Missing Data Treatment Methods. Liu Peng, Lei Lei.


Yotam Martin, Gal Goldstein

About

Dublin bus project @ Data gathering and analysis project. Technion 094290 course - Winter 2020-2021

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published