-
Notifications
You must be signed in to change notification settings - Fork 23
/
exercises_with_answers.Rmd
251 lines (191 loc) · 10.3 KB
/
exercises_with_answers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "R Database Exercises"
author: "Christina Maimone"
date: "`r Sys.Date()`"
output: html_document
params:
answers: TRUE
---
<!--
rmarkdown::render('exercises_with_answers.Rmd',
output_file = 'exercises.html',
params = list(answers=FALSE))
-->
```{r, echo=FALSE, eval=TRUE}
answers<-params$answers
```
```{r global_options, echo = FALSE, include = FALSE}
knitr::opts_chunk$set(echo=answers, eval=answers,
warning = FALSE, message = FALSE,
cache = FALSE, tidy = FALSE)
```
## Exercise: Selecting Random Data
One thing that isn't easy to do with SQL is selecting random rows. There are functions to generate random values, but generating a new random column on a large table, and then sorting by that column (or computing the max value and then selecting an observation) is costly. This is one scenario when working with a database from R (or another program) is useful.
Use the code below to create a table in the database. Then figure out how to select 3 random rows from that table (as if you didn't have access to the code or values that created the table). Do this without reading the entire table into R. Hint: you'll probably want to use some combination of sorting the table, limiting the number of rows you retrieve, and offsetting results (learn more [here](http://www.postgresqltutorial.com/postgresql-limit/) or [here](https://www.tutorialspoint.com/postgresql/postgresql_limit_clause.htm) if you need a refresher).
Remember to set the connection information as appropriate.
The new argument `bigint` in the connection function tells the package how to translate data from the database into R.
```{r, echo=TRUE, eval=TRUE, message=FALSE, results='hide'}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="workshop",
bigint="numeric")
make_random_name <- function() {
paste(sample(LETTERS, 5), collapse="")
}
measure <- data.frame(id=sample(1:1000, 100),
name=replicate(100, make_random_name()))
dbWriteTable(con, "measure", measure,
row.names = FALSE,
overwrite=TRUE) # in case code is run multiple times
```
```{asis}
#### Solution
First, figure out how many rows are in the table.
```
```{r}
rowcount <- dbGetQuery(con, "select count(*) from measure")[1,1]
```
```{asis}
Then select 3 random values between 1 and the number of rows. Then for each, execute a query to get that row from the database.
```
```{r}
rownums <- sample.int(rowcount, 3) - 1
selectedrows <- NULL
for (offset in rownums) {
selectedrows <-
rbind(selectedrows,
dbGetQuery(con,
paste("select * from measure
order by id
limit 1
offset", offset)))
}
selectedrows
```
```{asis}
We used paste function above because we have control over offset -- it would be better to use a prepared query, but since we aren't getting input from a user, it's not super dangerous.
An alternative approach, which could work well if the table isn't too big, is to retrieve all of the IDs, and then randomly sample the IDs, and retrieve just those rows.
```
```{r, eval=TRUE, message=FALSE, results='hide'}
dbDisconnect(con)
```
## Exercise: `dplyr`
Connect to the dvdrental database. Repeat [Exercise: Joining and Grouping 2](https://github.com/nuitrcs/databases_workshop/blob/master/sql/part2_exercises_with_answers.md) from Part 2 using `dplyr`.
```{r, echo=TRUE, eval=TRUE}
library(dplyr)
```
```{asis}
#### Solution
Set your connection information as appropriate for the workshop:
```
```{r}
con <- dbConnect(RPostgres::Postgres(),
host="localhost", dbname="dvdrental")
custtbl <- tbl(con, "customer")
paytbl <- tbl(con, "payment")
custtbl %>%
inner_join(paytbl, by="customer_id") %>%
group_by(first_name, last_name) %>%
summarize(totalamt = sum(amount)) %>%
filter(totalamt > 150) %>%
arrange(desc(totalamt))
dbDisconnect(con)
```
## Exercise: Create and Populate Tables
Connect to a database where you have permission to create tables. (If you're doing this on your own, you could use SQLite instead of PostgreSQL and write a local database.)
Create three tables with appropriate columns:
* `person`: at least an ID and name, maybe other characteristics of a person
* `relationship`: links people together and labels it with a relationship type
* `relationship_type`: a table defining the allowed set of relationship types in the `relationship` table
Populate the tables with information about your friends and/or family. Hint: think about how you want to handle IDs for people so that you can use them in the relationship table. Hint 2: think about how to make it clear in the relationship table what the direction of the relationship is (e.g. who is the child and who is the parent?). Hint 3: you can use `dbWriteTable` to add data to an existing table -- set the `append` argument.
Once you have the tables created, select data from the database and print out sentences describing the family relationships.
Note: the process of creating database tables and populating them from R is a bit messy and doesn't have a good natural workflow. This exercise is to help you get used to doing it, but you generally want to avoid creating databases from R in most circumstances. Use R to select data from databases, not create databases (at least stable, production ones).
```{asis}
#### Solution
There are obviously multiple ways to do this. Here is one.
Set your connection information as appropriate for your own database
```
```{r}
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="workshop")
```
```{asis}
Create tables. We'll set the person id ourselves to keep track of them since we need them later. For relationship, we don't need the ids, so let the database generate them.
```
```{r, results='hide', echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
dbRemoveTable(con, "relationship", fail_if_missing = FALSE)
dbRemoveTable(con, "person", fail_if_missing = FALSE)
dbRemoveTable(con, "relationship_type", fail_if_missing = FALSE)
```
```{asis}
The code below will generate warnings, but they're ok.
```
```{r, message=FALSE, results='hide'}
dbSendQuery(con, "create table person (
id int primary key,
name text not null)")
dbSendQuery(con, "create table relationship_type (
type text primary key)")
dbSendQuery(con, "create table relationship (
id serial primary key,
subject int references person(id),
predicate int references person(id),
relationship text references relationship_type(type))")
```
```{asis}
Now populate tables. For person, we're going to define ID numbers ourselves in R. This works well when ID numbers have some meaning in the real world, but if you just want an auto generated sequence of integers, this isn't great practice. Instead, you could use a serial type, insert the names into the database, and retrieve the ID numbers back into R to use to define relationships. This would work ok for moderately sized database, but also isn't great practice.
```
```{r, message=FALSE, results='hide'}
person <- data.frame(id=1:6,
name=c('Christina','Casey','Henry',
'Jessica','Denise','Bob'),
stringsAsFactors = FALSE)
dbWriteTable(con, "person", person,
row.names=FALSE, append=TRUE)
```
```{asis}
Define relationship types.
```
```{r, message=FALSE, results='hide'}
relationship_type <- data.frame(type=c('spouse of','parent of','sibling of','child of'))
dbWriteTable(con, "relationship_type", relationship_type,
row.names=FALSE, append=TRUE)
```
```{asis}
And finally the relationships. In the database, we want to define relationships using peoples' ID numbers, not their names, since names may not be unique. The columns we'll want are ID, person 1 ID (call this person the subject), person 2 ID (call this person the predicate), and relationship type.
There isn't an easy way to create a substancial data set in R by typing, and since R data frames are created by column, instead of by row, it can get tricky to make sure that the data is lined up correctly. I'd suggest creating a CSV file with the information and reading it into R to use. Here, I'm going to use `dbSendQuery` to execute individual statements instead. I'm going to look up ID numbers using names so that the statements are more human readable, but this approach wouldn't work if I had any duplicate names. Just showing a few examples, since this is tedious.
I'm using double quotes below so that I can use single quotes in SQL queries.
```
```{r, message=FALSE, results='hide'}
dbSendQuery(con, paste0("insert into relationship (subject, predicate, relationship)
values (", person$id[person$name=="Christina"], ",",
person$id[person$name=="Casey"],
", 'spouse of');"))
dbSendQuery(con, paste0("insert into relationship (subject, predicate, relationship)
values (", person$id[person$name=="Christina"], ",",
person$id[person$name=="Henry"],
", 'parent of');"))
dbSendQuery(con, paste0("insert into relationship (subject, predicate, relationship)
values (", person$id[person$name=="Casey"], ",",
person$id[person$name=="Henry"],
", 'parent of');"))
dbSendQuery(con, paste0("insert into relationship (subject, predicate, relationship)
values (", person$id[person$name=="Christina"], ",",
person$id[person$name=="Jessica"],
", 'sibling of');"))
```
```{asis}
Look at results.
```
```{r}
relations <- dbGetQuery(con, "select a.name, b.name, relationship
from person a, person b, relationship
where a.id=subject and b.id=predicate;")
for (row in 1:nrow(relations)) {
print(paste(relations[row,1], "is the", relations[row,3], relations[row,2]))
}
```
```{asis}
Disconnect:
```
```{r, message=FALSE, results='hide'}
dbDisconnect(con)
```