-
Notifications
You must be signed in to change notification settings - Fork 0
/
BikeStuff1.Rmd
114 lines (81 loc) · 6.18 KB
/
BikeStuff1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Interesting Bike Usage"
author: "Mausam Duggal, WSP|PB, Systems Analysis"
date: "May 11, 2016"
output: html_document
---
```{r init, echo=FALSE, message=FALSE}
library(dplyr); library(ggplot2); library(knitr)
```
### Mining of UCI Bike Data
I came across this dataset on the **machine learning repository** that I visit frequently. Some background on this dataset from the website is noted below:
>Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
>Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
I thought the above was quite interesting given that we are generally putting in so much money in cycling facilities but know so little of their usage and potential patronage. This is by **no means a comprehensive analysis** and I dont intent it to be that, but more of a teaser. I have also not evaluated the wealth of information in the hour dataset. I will explore more within these datasets as and when I have the time and might also write a paper or article about it. So if anyone is interested in **teaming up** let me know. Otherwise digest this and send your comments!!!
#### INPUT DATA
We start by setting the **input directory** and loading the datasets (DAY, HOUR) and mining it in the hope of seeing some interesting patterns.
```{r Set Working Directory}
opts_knit$set(root.dir = 'c:/personal/r')
```
```{r batching in the DAY and HOUR dataset , echo=FALSE}
day <- read.csv(file = "c:/personal/r/day.csv", stringsAsFactors = FALSE)
hour <- read.csv(file = "c:/personal/r/hour.csv", stringsAsFactors = FALSE)
```
#### GO ALL IN AND SEE WHAT THE DATA SAYS
Now lets plot the variables and see what the data shows. The first attempt will just **plot the counts of rental bikes by seasons.**
```{r Now plot the data}
#' create a named list because I dont like the seasons numbered from 1-4
w.names <- c(
"1" = "Spring",
"2" = "Summer",
"3" = "Fall",
"4" = "Winter"
)
#' Not surprisingly as the x-scale approaches 1 (it is getting warmer), the usage of bikes
#' increases.
#' Zero degrees celsius on the X-scale is represented by 0.24 (red line).
#' Interestingly enough, even if the weather plummets below 0 and as low as -8 you are getting
#' bike users in Spring.
#' In Summer and Fall, 35 degree celcius (0.8) seems to be the maximum temperature bicylce users are willing
#' to accept for cycling.
#' In Winter, cyclists are only active above 0 degree celcius
ggplot(day, aes(x=atemp, y=cnt)) +
geom_point(shape=1) + # Use hollow circles
scale_colour_hue(l=50) +
geom_smooth() + facet_grid(. ~ season, labeller = as_labeller(w.names)) +
geom_vline(xintercept = 0.24, color = "red")
```
#### FOCUS ON WINTER AND SPRING
**Spring and Winter** are interesting. One would expect that these two seasons to have more unique patterns given the uncertain weather. So, now lets look at them more carefully, by evaluating it against **wind speeds.**
```{r cross check whether wind speeds make any difference in bike usage}
#' only keep spring and winter records
day.filter <- subset(day, season == "1" | season == "4")
#' plot windspeeds to see if they factor in the usage
#' windpseeds were divided by 67 (max wind). To give an
#' idea of the X-scales - a factor of 0.3 indicates a wind
#' speed 20 km/hr
#' The data shows that 0.3 on the X-scale or 20 km/hr seems to be
#' maximum speed cyclists are willing to endure during Spring and Summer
ggplot(day.filter, aes(x=windspeed, y=cnt)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth() + facet_grid(. ~ season, labeller = as_labeller(w.names)) +
geom_vline(xintercept = 0.3, color = "red")
```
#### UNERSTAND THE CAUSAL VS REGISTERED IN SPRING AND SUMMER
There yet seem to be some serious bikers in here. It will be interesting to see how they are distributed between
the **causal and registered** users in the Spring and Summer months, and against wind speeds.
```{r now lets evaluate the differences between casual and registered users}
#' create plot of casual users
ggplot(day.filter, aes(x=windspeed, y=casual)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth() + facet_grid(. ~ season, labeller = as_labeller(w.names))
#' create plot of registered users
ggplot(day.filter, aes(x=windspeed, y=registered)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth() + facet_grid(. ~ season, labeller = as_labeller(w.names))
```
#### ENOUGH ALREADY
It is expected that registered users will be braver than the causal when it comes
to extreme weather conditions.However, the **scale of the difference** between the two
user groups is quite startling.
This brief and simple data wrangling has helped quanitfy some interesting preferences in this dataset. By concentrating some of the analysis in Spring and Winter we can try and relate it to **Southern Ontario** conditions? A next step would be to generate a Multivariate Regression model that could give you direct demand of bikers and help propel AT planning in to a more mathematically robust paradigm! Or, we could tailor our AT planning by understanding if similar preferences are being exhibited by users in our cities as well. The applications are endless, and this is just the tip of the iceberg, as they say!