Presentation - Youtube (click)
In this project we help the Dublin bus control center and citizines of the city by:
-
Indicating about picks in air pollution (data by BreezoMeter.com)
-
Recommendations for city closure - when the air pollution is extremely high, we want to reduce the emissions and advice to prevent the entrance of private cars to the city center.
- Real time transit alternatives - car drivers are provided with an alternative transportation option with buses to the city center, based on real-time bus locations (Kafka stream)
- Databricks machine with Databricks Runtime Version 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
- Python 3.7 or higher
- spark-cassandra-connector v2.4 from: https://github.com/datastax/spark-cassandra-connector
- packages from requirements.txt in this repo
-
Another machine that will be our warehouse, together with Apache Cassandra cluster
- This warehouse is used to log extreme air quality events
- See instructions below for setup
- Contains the cells to run to reproduce the code
- Kafka stream with dublin-bus data is needed
- Air Quality data is needed but we will not supply
In this code we:
- Show the closure location that will be recommended upon high air pollution event.
- Calculate bus rides alternatives between every two bus stops in Dublin, based on kafka stram bus locations
- Process air quality data
- Read the Kafka data stream for dublin-bus data
- Impute missing data in air quality index values by windowed mean imputation [1]
- Create and updated live map for the dashboard, deploy it to git to update the bus control center dashboard.
- Upload data to warehouse regarding times of high air pollution events
- We will setup Cassandra 3.11.9 on our machine, configure the cluster and open communication with other machine on the same network
- If reinstalling first do:
sudo apt-get remove cassandra
sudo apt-get autoremove cassandra
sudo rm -rf /var/lib/cassandra
sudo rm -rf /var/log/cassandra
sudo rm -rf /etc/cassandra
- Delete everything it finds:
sudo find / -name 'cassandra'
- Fresh installation of Cassandra
sudo rm -rf /etc/apt/sources.list.d/cassandra.sources.list
echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
-
Verify you see cassandra.yaml
ls /etc/cassandra/
-
Verify node is running :
nodetool status
-
Run
cqlsh
than exit -
Stop cassandra:
nodetool drain
and let it finish, than runsystemctl stop cassandra
-
Configure cassandra.yaml (configuration file):
sudo nano /etc/cassandra/cassandra.yaml
- Change in the nano editor
-
rpc_address -> 0.0.0.0
-
broadcast_rpc_address: 1.2.3.4 -> uncomment this line
-
data_file_directories -> /StudentData/cassandra/data
-
commitlog_directory -> /StudentData/cassandra/commitlog
- later on if you encouter querying request timeout you should add few zeroes (2-6) to every number like in:
-
- Create new data dir in the right disk partition, move existing cassandra folders to the partition
mkdir /StudentData/cassandra
sudo mv /var/lib/cassandra/data /StudentData/cassandra
sudo mv /var/lib/cassandra/commitlog /StudentData/cassandra
- Finally, Start cassandra:
systemctl start cassandra
thanjournalctl -f -u cassandra
and exit.
Thats it!
[1] A Review of Missing Data Treatment Methods. Liu Peng, Lei Lei.