-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathr_databases.Rmd
379 lines (254 loc) · 10.5 KB
/
r_databases.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: 'R: Working with Databases'
author: "Christina Maimone"
date: '`r Sys.Date()`'
output:
html_document:
toc: yes
toc_depth: '3'
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Overall Note
R is generally better suited to selecting data from databases than for creating database tables or entering data into a database. But there are functions to do all operations.
# Connection
The `DBI` package has the core functionality of connecting R to database servers. There are then packages that implement the core functionality of `DBI` for each specific implementation of SQL. A package for PostgreSQL is `RPostgres`.
```{r, eval=FALSE}
if(!'RPostgres' %in% installed.packages()){
install.packages("RPostgres")
}
```
```{r}
library(RPostgres)
```
We connect with a function call like the following.
Note: this code was generated on my local machine connected to a local copy of the database. Your connection details will be different. Note I also have permissions to modify this database.
```{r}
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
We will need a connection like this for any of the methods of connecting below. You can have multiple connection objects to different databases -- just call the variables something different.
Note that the above example doesn't have a username or password because the database is local and doesn't require one. If you're using a remote database, you don't want to hard core your credentials into your code. There are ways to set environment variables or, for PostgreSQL specifically, use a .pgpass file to store this information. But RStudio also gives you the option to pop up a box to enter the information:
```{r, eval=FALSE}
con2 <- dbConnect(RPostgres::Postgres(),
host="localhost",
dbname="dvdrental",
user=rstudioapi::askForPassword("Database username"),
password=rstudioapi::askForPassword("Database password"))
```
# Using DBI
We can use the basic functions in the DBI library:
## Get Database Information
Note that the following `db-` prefixed functions are exported from the DBI and RPostgreSQL namespaces.
```{r}
dbListTables(con)
dbListFields(con, "actor")
```
## Execute Queries
```{r}
actor_subset <- dbGetQuery(con, "select * from actor where actor_id > 50")
head(actor_subset)
```
Note that we don't need a `;` at the end of the query.
If we want an entire table, there's a function for that:
```{r}
actor <- dbReadTable(con, "actor")
head(actor)
```
If you want part of your query to be determined by a variable -- especially if it's a variable supplied or defined as input or by a user (not you) -- you should guard against SQL injection (someone trying to attack your database by tricking your code into running malicious SQL statements) by using a parameterized query:
```{r}
# YES
myquery <- dbSendQuery(con, "select * from actor where actor_id = $1")
dbBind(myquery, list(4))
dbFetch(myquery)
```
The `$1` stands in for a value you'll substitute in. For multiple, you can use `$2`, etc. (The `$1` notation is for PostgreSQL -- other types of databases use `?` or other symbols.)
When you're done with the results from a prepared query, clear the result:
```{r}
dbClearResult(myquery)
```
```{r, eval=FALSE}
# NO! - at least not if the variable can be manipulated by a user
dbSendQuery(con, paste0("select * from actor where actor_id=", myvar))
```
For more, see http://db.rstudio.com/best-practices/run-queries-safely/.
Parameterizing statements also lets you reuse them with different values.
You may also need to use `dbClearResult()` after other calls to `dbSendQuery()` that return a result (select statements, table creations, inserts, updates) -- but not to `dbGetQuery()`. Otherwise you may get some warning messages like:
```{asis}
Warning message:
In result_create(conn@ptr, statement) :
Closing open result set, cancelling previous query
```
Which are ok, but could get annoying.
## Modifying a Database
If you're not a superuser on the `dvdrental` database, just try connecting to a database you can modify. Then the basic function is `dbSendQuery` for any command you want to execute where you aren't retrieving results.
Note that by default, statements take effect immediately - they are not in a transaction that you need to commit. To use transactions, see below.
```{r, eval=FALSE}
res <- dbSendQuery(con, statement="update actor set first_name='Jenn' where actor_id=4")
print(res) # contains info on result of update
dbClearResult(res) # prevent warning messages
```
To create a table, you can give it a data frame
```{r, eval=FALSE}
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
dbWriteTable(con, "mynewtable", mytbl)
```
or you could specify the table with SQL, and execute with `dbSendQuery` but this can get cumbersome.
To remove a table
```{r, eval=FALSE}
dbRemoveTable(con, "mynewtable")
```
## Transactions
There are also methods for managing transactions if you need: `dbBegin`, `dbRollback`, `dbCommit`. Transactions are key for when you need to be sure that a sequence of SQL commands (e.g. `UPDATE`, `CREATE`, `DROP`, `DELETE`, etc.) execute correctly before they're made permanent (i.e. "committed").
```{r, eval=FALSE}
dbBegin(con)
dbWriteTable(con, "mynewtable", mytbl)
dbRollback(con)
dbGetQuery(con, "SELECT * FROM mynewtable")
```
The above will produce error:
```
Error in result_create(conn@ptr, statement) :
Failed to prepare query: ERROR: relation "mynewtable" does not exist
LINE 1: SELECT * FROM mynewtable
```
because the transaction was rolled back, not committed.
## Close Connection
Connections will get closed when you quit R, but it's good practice to explicitly close them.
```{r}
dbDisconnect(con)
```
# Use `dplyr`
For more complete info, see the [RStudio databases site](http://db.rstudio.com/dplyr/).
```{r, eval=FALSE}
needToInstall <- c("tidyverse")
needToInstall <- needToInstall[which(!needToInstall %in% installed.packages())]
if(length(needToInstall) > 0){
sapply(needToInstall, install.packages)
}
```
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
First, connect like normal
```{r, echo=TRUE}
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
Get a reference to a table:
```{r, message=FALSE, warning=FALSE}
actortbl <- tbl(con, "actor")
```
If we look at this object, it doesn't have data in it:
```{r}
str(actortbl)
```
It just has connection information. `dplyr` will try to perform operations within the database where it can, instead of pulling all of the data into R.
Yet you can print the object and see observations:
```{r, echo=TRUE}
actortbl
```
It retrieves them as needed, and also gives you a nice display in notebooks (a special kind of RMarkdown file) -- output looks a litle different in the console and in RMarkdown files like this.
You can use `dplyr` verbs to work with the table objects from the database, as if they were data frames (or tibbles).
```{r, echo=TRUE}
actortbl %>%
select(actor_id, first_name, last_name) %>%
filter(actor_id > 150)
```
The above generates and executes the SQL needed to get the result. It turns `filter` into a select statement with the appropriate where clause.
```{r, echo=TRUE}
rentaltbl <- tbl(con, "rental")
rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n())
```
What does the above correspond to as a SQL query? `select customer_id, count(*) from rental group by customer_id;`
```{r, echo=TRUE}
rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n()) %>%
show_query()
```
You can use `collect` to pull down all of the data (tell `dplyr` to stop being lazy).
```{r, echo=TRUE}
# First, without collecting
df1 <- rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n())
df1
```
Looks OK, except:
```{r, eval=FALSE}
df1[1,]
```
Gives you:
`Error in df1[1, ] : incorrect number of dimensions`
It's the wrong dimensions because `df1` isn't actually a data.frame:
```{r}
str(df1)
```
It is telling us we need to collect the data first to actually pull it into R.
```{r, echo=TRUE}
# Then with collecting
df2 <- rentaltbl %>%
group_by(customer_id) %>%
summarize(count=n()) %>%
collect()
df2
```
```{r}
df2[1,]
```
You can also use `dplyr`'s commands to join:
```{r, echo=TRUE}
custtbl <- tbl(con, "customer")
addrtbl <- tbl(con, "address")
custtbl %>%
inner_join(addrtbl, by="address_id") %>%
filter(postal_code == '52137') %>%
select(first_name, last_name, postal_code)
```
You could create a table with `copy_to` (if you have the correct permissions)
```{r, scho=TRUE, eval=FALSE}
mytbl <-data.frame(number=1:10 , letter=LETTERS[1:10])
copy_to(con, mytbl, "mynewtable")
```
By default, it creates a **temporary** table. But this is a setting you can change, and you can also specify what columns to index on the table.
Disconnect like we normally do
```{r, eval=FALSE}
dbDisconnect(con)
```
# R Markdown
R Markdown lets you execute SQL queries directly. You first set up a `DBI` connection like above, and then, instead of having R chunks of code, you can have SQL chunks of code:
````r
`r ''````{r}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
````
````sql
`r ''````{sql, connection=con}
select * from actor where actor_id > 75;
```
````
````r
`r ''````{r}
dbDisconnect(con)
```
````
Here is the above, actually executed in RMarkdown:
```{r, eval=FALSE}
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost", dbname="dvdrental")
```
```{sql, connection=con}
select * from actor where actor_id > 75;
```
```{r}
dbDisconnect(con)
```
For more details, see [knitr Language Engines: SQL](http://rmarkdown.rstudio.com/authoring_knitr_engines.html#sql).
# PL/R
Database administrators can install functionality in a PostgreSQL database to allow you to write R functions directly in the database, and then call them with normal SQL queries. This is done with [PL/R](https://github.com/postgres-plr/plr). Enabling this functionality on systems can be risky, because R potentially gives users access to files on the database server. Database admins are usually conservative in allowing PL/R on the system, but it can be very useful in production systems. You can use to to generate reports, compute statistical methods, and even create plots.
We aren't covering PL/R (or even writing SQL functions more generally), but it's good to know this functionality exists if you're ever working with a large production system.