-
Notifications
You must be signed in to change notification settings - Fork 17
/
03-sql.Rmd
411 lines (277 loc) · 20.1 KB
/
03-sql.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
title: "Data Science for Economics and Finance"
subtitle: "Lecture 3: SQL and `dbplyr`"
author:
name: Dawie van Lill ([email protected]) | [Github](https://github.com/DawievLill)
date: Lecture 3 "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
theme: cosmo
highlight: pygments
always_allow_html: true
urlcolor: blue
mainfont: cochineal
sansfont: Fira Sans
monofont: Fira Code
---
<style type="text/css">
body{
font-size: 14pt;
}
</style>
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
prompt = T, ## See hook below. I basically want a "$" prompt for every bash command in this lecture.
fig.align="center", #fig.width=6, fig.height=4.5,
# out.width="748px", #out.length="520.75px",
dpi=300, #fig.path='Figs/',
cache=F#, echo=F, warning=F, message=F
)
## Next hook based on this SO answer: https://stackoverflow.com/a/39025054
knit_hooks$set(
prompt = function(before, options, envir) {
options(
prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ',
continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ '
)
})
```
# References
The primary references for today's discussion can be found [here](https://raw.githack.com/uo-ec607/lectures/master/16-databases/16-databases.html), [here](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html), [here](https://swcarpentry.github.io/sql-novice-survey/) and [here](https://db.rstudio.com/).
Once again, most of the material that we will cover for the next few weeks is **heavily** influenced by the amazing lecture notes by Grant McDermott. You can visit his [Github page](https://github.com/uo-ec607/lectures) for more information.
# Packages
General packages for SQL and R.
```{r, cache=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, DBI, dbplyr, RSQLite, bigrquery, hrbrthemes, glue)
```
For the world inequality database.
```{r, cache=FALSE, eval = FALSE}
install.packages("devtools")
devtools::install_github("WIDworld/wid-r-tool")
```
# Introduction to databases
The session from this week is going to be about databases and a quick introduction to the basics of the SQL programming language. I am assuming that most of you have only worked with data that can easily fit in into your computer's memory, since `R` (by default) takes all data that you import and places it in your Random Access Memory (RAM) The benefit of doing this is that RAM is very fast and this makes operations on data fast. The downside is that most of you don't have massive amounts of RAM on your computers so it easy for data manipulations to overwhelm your RAM capacity. You will most surely encounter a situation, in your not so distant future, where you will have to deal with large datasets that are outside of the scope of the memory contained in your laptop or desktop PC. If this happens you might want to look into storing the data in a **relational database**.
Relational databases (or databases) exist either on your local computer, or on some remote server (storage). Generally, in the world of "big data", we will access data remotely. This means that the data won't be found on your physical drive on the computer. It is important to note that with large datasets, we are often not concerned with the entire dataset itself, but simply a small subset of a larger dataset. What we will focus on this class is connecting with a database to retrieve the chunks of data that you need for your analysis. Getting information that we want is referred to as submitting a `query` to the database. This means that if you have access to a really large dataset that is available in a public or private database then you can submit a query without having to download the entire dataset onto your computer. A query is our way of telling the database that we want to work with a subset of the data, which we can then draw into the memory on our computer. Once again, we don't have to download the data directly on the computer in the case of remote servers, we simply connect to these servers and then once that connection is established we submit our query and the selected data is drawn into our computer memory.
# Using SQL directly
In the next session we will be working directly with a SQLite database. We will some SQL commands directly on the database. This is not a detailed guide on how to use SQL, it simply introduces some of the core concepts. There are many resources online for learning SQL. This lecture just provides a quick survey so that you have an idea of what SQL is and what it does.
## World inequality database
We are going to import some data from the WID database and then insert that into a SQLite database for us to work with. We could also work with the data in memory, but we want to showcase some of the SQL commands. We are going to use the command line to interface with the SQLite database at first and then we will use the `dbplyr` package to work directly through `RStudio`.
## `wid` package
There is a package that allows you to directly interface with the WID database. If you want to download some data from there you could use the following code as an example,
```r
R> data <- download_wid(
indicators = "tptinc", # Thresholds of pre-tax national income
areas = c("US", "ZA"), # United States and South Africa
perc = c("p10p100", "p50p100", "p90p100", "p99p100", "p99.9p100")
)
R> df <- tibble(data)
```
We will not use this functionality for this lecture though.
## CSV to SQLite
We could also access the full dataset for South Africa by downloading it from the website in CSV format. There are many ways to get data from websites, but in this case I just downloaded it directly by clicking on the appropriate link and storing it on my computer. The CSV file is in the `data` directory for this lecture, so you don't have to download it if you have cloned the repository. It is called `wid-data-za.csv`.
In this section we will show you how to utilise the command line to convert the CSV to an SQLite database. We can use the `sqlite3` command to do this as follows,
```bash
$ cd ./data
$ sqlite3 -separator ';' wid_data_za.db ".import wid-dta-za.csv wid-dta-za"
```
This has already been done for you. The SQLite database is contained GitHub repository for this course.
## Access SQLite database
In order to use the SQLite commands *interactively*, we need to enter into the SQLite console.
So, open up a terminal, and run
```bash
$ cd ./data
$ sqlite3 wid_data_za.db
```
My screen then displays the following information.
```sql
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite>
```
The SQLite command is `sqlite3` and you are telling SQLite to open up the `wid_table_za.db`. You need to specify the `.db` file, otherwise `SQLite` will open up a temporary, empty database.
To get out of `SQLite`, type out `.exit` or `.quit`. For some terminals, `Ctrl-D` can also work. If you forget any SQLite `.` (dot) command, type `.help`.
> **Note**: If you want nice syntax highlighting in the terminal then you can use the LiteCLI package. I will be showing this in the lecture.
## Selecting data
Remember that for this section things are interactive, so we will not get output in the notebook. We need to type the code into our terminal. The first thing we would like to do is see the tables in the database. In our case there is only going to be one table.
```sql
sqlite> .tables
```
We can also see what the structure of the table looks like by using the `schema` command.
```sql
sqlite> .schema
```
We can select all of the columns in a table using `*`. However, I would not recommend doing this, since it will show all the columns on the screen. If you have a massive dataset this will show everything.
```sql
sqlite> SELECT *
FROM "wid-dta-za";
```
You can also select certain columns by replacing the `*` with the column name.
```sql
sqlite> SELECT year
FROM "wid-dta-za";
```
Once again, we don't want to display everything on screen, so don't run these commands.
Sometimes we want to look at unique values in a column, then we can use the `DISTINCT` command.
```sql
sqlite> SELECT DISTINCT year
FROM "wid-dta-za";
```
In this case you will see that the only country in our dataset is South Africa, given by `ZA`. We can also use the `DISTINCT` keyword on multiple columns,
```sql
sqlite> SELECT DISTINCT country, year
FROM "wid-dta-za";
```
Another thing that we can do is order all the variables by `year`.
```sql
sqlite> SELECT *
FROM "wid-dta-za"
ORDER by year;
```
We can do this in either `ASC` or `DESC` order.
```sql
sqlite> SELECT *
FROM "wid-dta-za"
ORDER by year DESC;
```
## Filtering
One of the most important features of a database is the ability to filter data. We have encountered this in `R` with the `filter` command. In the case of SQL we use the `WHERE` clause in a query,
```sql
sqlite> SELECT *
FROM "wid-dta-za"
WHERE year <= '1955';
```
If we want to find out what the values in the database after the second world war AND where `GDP` was above 16.5 billion rand per year, we can use the following,
```sql
sqlite> SELECT *
FROM "wid-dta-za"
WHERE gross_domest_product > '16500' AND year >= '1945';
```
Are there any interesting patterns that you can observe here?
## New values
We can also calculate new values from existing values. Perhaps we realise that our GDP values after the year 2000 need to be corrected by 5% due to some measurement error. Rather than modifying the stored data we can do this calculation on the fly as follows.
```sql
sqlite> SELECT gross_domest_product, 1.05 * gross_domest_product
FROM "wid-dta-za"
WHERE year > '2000';
```
We can rename the new field as follows,
```sql
sqlite> SELECT gross_domest_product as gdp, 1.05 * gross_domest_product as new_gdp
FROM "wid-dta-za"
WHERE year > '2000';
```
## Missing values
Databases use a special value called `NULL` to represent missing information. Almost all operations on `NULL` produce `NULL`. Queries can test for `NULL`s using `IS NULL` and `IS NOT NULL`. Below is an example where we check for a `NULL` value:
```sql
sqlite> SELECT year
FROM "wid-dta-za"
WHERE one_perc_share_income IS NULL;
```
This reveals the years where there is a `NULL` value. These values can cause headaches wherever they appear. For example, suppose we want to the years at which the one percent share of income is greater than 0.2 and percentage share of wealth is greater than 0.5. Then we could do the following,
```sql
sqlite> SELECT year
FROM "wid-dta-za"
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5;
```
This reveals that this occurs on three different occasions, 2007, 2020 and 2021. However, if we specified that the country should also be South Africa, we get the following,
```sql
sqlite> SELECT year
FROM "wid-dta-za"
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5 AND country = 'ZA';
```
The result is that we lose the value for 2021, since there is no country value allocated for 2021. This might simply be a mistake in the data, but it prevents us from seeing an obvious instance where there is a high level of inequality.
## Aggregation
In SQL we can aggregation functions such as `min` or `max`. Each of these functions takes a set of records as input, and produces a single record as output:
```sql
sqlite> SELECT max(gross_domest_product)
FROM "wid-dta-za";
```
`min` and `max` are just two of the aggregation functions built into SQL. Three others are `avg`, `count`, and `sum`:
```sql
sqlite> SELECT avg(gross_domest_product)
FROM "wid-dta-za";
```
Interestingly, when there are no values to aggregate — for example, where there are no rows satisfying the `WHERE` clause — aggregation’s result is “don’t know” rather than zero or some other arbitrary value:
```sql
sqlite> SELECT min(gross_domest_product)
FROM "wid-dta-za"
WHERE country = "USA";
```
We have covered some of the main commands that you will encounter in SQL. However, this has truly just been an introduction to get you started. We haven't even dealt with a fundamental concept such as `JOIN`, so there is much more to learn. The reason that we aren't exploring every dimension of SQL is that the `R` interface with SQL often means that you won't need to write much SQL code to execute your queries. We will see this below.
# Databases and R
After some experience with SQL you will agree that it is a bit of an archaic language that is much less intuitive than the `R` tools that we have using thus far in the course. There is some good news: Even if you aren't comfortable with SQL you already have all the programming skills you need to start working with databases. This is because the `tidyverse` — through `dplyr` — allows for direct communication with databases from your local `R` environment.
This means that you can interact with the vast datasets that are stored in relational databases using the same `tidyverse` verbs and syntax that we already know. All of this is possible thanks to the `dbplyr` package, which provides a database backend to `dplyr`. What’s happening even further behind the scenes is that, upon installation, `dbplyr` suggests the `DBI` package as a dependency. `DBI` provides a common interface that allows `dplyr` to work with many different databases using exactly the same code. You don’t even need to leave your `RStudio` session or learn SQL!
> Aside: You will probably want to learn SQL eventually! Luckily, `dplyr` and `dbplyr` come with several features that can really help to speed up the learning and translation process. We’ll get to these later in the lecture.
While `DBI` is automatically bundled with `dbplyr`, you’ll need to install a specific backend package for the type of database that you want to connect to. There are many backends to use. We have worked with SQLite since it is quite lightweight and installed on most operating systems already. We will talk later about `bigrquery` which connects to Google BigQuery when we have our session on cloud computing and big data.
Our goal for this section is to create a makeshift database on our local computers — using the excellent SQLite backend — and then connect to it from `R`. I’ll use this to demonstrate the ease with which we can execute queries from `R`, as well as underscore some principles for working with databases in general. The lessons that we learn here will carry over to more complicated cases and much larger datasets.
# Local database
Start by opening an (empty) database connection via the `DBI::dbConnect()` function, which we’ll call `con`.
```{r}
library(DBI) ## Already loaded
con = DBI::dbConnect(RSQLite::SQLite(), ":memory:")
dbListTables(con)
```
The arguments to `DBI::dbConnect()` vary from database to database. However, the first argument is always the database backend, i.e. `RSQLite::SQLite()` in this case since we’re using SQLite. Again, while this differs depending on the database type that you’re connecting with, SQLite only needs one other argument: the path to the database. Here we use the special string, “:memory:”, which causes SQLite to make a temporary in-memory database. We’ll explore more complicated connections later on that will involve things like password prompts for remote databases.
Our makeshift database connection con is currently empty. So let’s copy across the world inequality dataset that we worked with before. There are a couple of ways to do this, but here I’ll use the `dplyr::copy_to()` convenience function. Note that we are specifying the table name (“wid_za”) that will exist within this database.
```{r}
wid_za <- read.csv(file = 'data/wid-dta-za.csv', sep = ',')
copy_to(con, wid_za)
```
Now that we’ve copied over the data, we can reference it from `R` via the `dplyr::tbl()` function. This will allow us to treat it as a normal data frame that be manipulated with `dplyr` commands.
```{r}
wid_za_tbl <- tbl(con, "wid_za")
head(wid_za_tbl)
```
It worked! Everything looks pretty good, although you may notice something slightly strange about the output. We’ll get to that in a minute.
## Generating queries
One of the best things about `dplyr` is that it automatically translates tidyverse-style code into SQL for you. In fact, many of the key `dplyr` verbs are based on SQL equivalents. Let’s try out a few queries using the typical `dplyr` syntax that we already know.
```{r}
## Select some columns
wid_za_tbl %>% select(year, one_perc_share_income, gross_domest_product)
```
```{r}
## Get the mean GDP
wid_za_tbl %>%
summarise(mean_gdp = mean(gross_domest_product), na.rm = TRUE)
```
## Laziness as a virtue
The modus operandi of `dplyr` is to be as lazy as possible. What this means in practice is that your `R` code is translated into SQL and executed in the database, not in `R`. This is a good thing, since:
1. It never pulls data into `R` unless you explicitly ask for it.
2. It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
For example, consider an example where we are interested in values in the database after the second world war and where `GDP` was above 16.5 billion rand per year.
```{r}
gdp_post_ww2_db <- wid_za_tbl %>%
select(year, gross_domest_product) %>%
filter(year > 1945 & gross_domest_product > 16500)
```
Surprisingly, this sequence of operations never touches the database. It’s not until you actually ask for the data that `dplyr` generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.
```{r}
gdp_post_ww2_db
```
## Collecting data
Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use `collect()` to pull all the data into a local data frame. I’m going to assign this collected data frame to a new object (), but only because I want to keep the queried data base object (`gdp_post_ww2`) separate for demonstrating some SQL translation principles in the next section.
```{r}
gdp_post_ww2 <- gdp_post_ww2_db %>%
collect()
gdp_post_ww2
```
Super. We have successfully pulled the queried database into our local `R` environment as a data frame. You can now proceed to use it in exactly the same way as you would any other data frame.
## Using `show_query`
Behind the scenes, `dplyr` is translating your `R` code into SQL. You can use the `show_query()` function to display the SQL code that was used to generate a queried table.
```{r}
gdp_post_ww2_db %>% show_query()
```
Note that the SQL call is much less appealing/intuitive than our piped `dplyr` code. In part, this is an artefact of the translation steps involved. The `dplyr` translation engine includes various safeguards that are designed to ensure that the resulting SQL code works. But this comes at the expense of code concision (e.g. those repeated `SELECT` commands at the top of the SQL string are redundant). However, it also reflects the simple fact that SQL is not an elegant language to work with. In particular, SQL imposes a lexical order of operations that doesn’t necessarily preserve the logical order of operations. This lexical ordering is also known as “order of execution” and is strict in the sense that every SQL query must follow the same hierarchy of commands.
I don’t really want to get into all of this now. But I do want to make you aware of the fact that SQL queries are not written in a way that you would think about them logically. Still, while it can take a while to wrap your head around, the good news is that SQL’s lexical ordering is certainly learnable.
Now, at this point, you may be wondering: Do we even need to learn SQL, given that it’s a pain and that the `dplyr` translation works so well?
That’s a fair question. The short answer is that, “yes”, at some point you will probably find yourself needing to write raw SQL code. Luckily, writing and submitting SQL queries directly from `R` and RStudio is easily done, thanks to the `DBI` package.
```{r}
dbDisconnect(con)
```
In the section on cloud computing we might have some time to discuss the ideas behind BigQuery, which is like SQL in the cloud.