-
Notifications
You must be signed in to change notification settings - Fork 4
/
09-scraping.Rmd
153 lines (94 loc) · 10.7 KB
/
09-scraping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# Web scraping
Imagine: you are invited to your significant one's parents' place for dinner. You know that his/her father is a big fan of Swiss wine and you can exploit this fact to make a good impression. You are not an expert of wine though, but you can "show off" with your data science skills and build a story around Chasselas grapes.^[[Chasselas or Chasselas blanc is a wine grape variety grown mostly in Switzerland.](https://en.wikipedia.org/wiki/Chasselas)] After an intense googling you realize that the desired dataset simply does not exist, and therefore, you have to collect it by yourself. You find a nice app Vivino, which is a great representative of the recent trends in wine. But how to get the data?
Very often (if not always) well-structured data you need for analysis does not exist and you need to collect it somehow by yourself. Nowadays, a vast majority of data are user-generated content, presented in *unstructured* HTML format. But it is not the worst part -- the data are typically dynamic and vary over the time (e.g., the rating of a particular wine or your Facebook friends list). The good thing is that all information is located on the Internet in human-readable format, that is you have an accesses to these data. Therefore, the problem is not in accessing the data, but how to convert this information into the structured (think of tabular or spreadsheet-like) format. In this chapter, we will learn what is the web scraping, how to scrape using `R`, and when it is legal.
## Web scraping overview
Web scraping is the process of collecting the data from the World Wide Web and transforming it into a structured format. Typically web scraping is referred to an automated procedure, even though formally it includes a manual human scraping. We distinguish several techniques of web scraping:
- Human manual copy-and-paste
The name is self-explanatory, that is when humans copy the content of the page and paste it into a well-formatted data structure. This method is very slow but sometimes is unavoidable as the only solution due to legal issues. For instance, one can copy all prices and names from the Vivino list directly into a data frame.
- Text pattern matching
This approach is the next step in web scraping. After fetching the page, regular expressions (regex) are used to match the pattern in the text. For example, you can get each character that starts with "Price:" and ends with "CHF".
- Using API/SDK (socket programming)
Web APIs are interfaces, introduced by large websites, to allow getting the content of the web pages (think of databases) directly by posing HTTP requests. For example, Twitter has its API to search tweets, account activities, etc., so that anyone can get these data by sending an HTTP request. Further, `R` has a package called [`twitteR`](http://geoffjentry.hexdump.org/twitteR.pdf), that generates those requests for you (i.e., it can be considered as Software Development Kit, SDK for short).
- HTML parsing
Most of web pages are generated dynamically from databases using similar templates and css selectors. Identifying such css selectors allows mimicking the structure of databases' tables.
The focus of this chapter is on the HTML parsing, and at the end of it you should be able to scrape data using `R`. But there is one more thing we should mention before getting to the nitty-gritty details of scraping.
## Denial-of-service (DoS), crawl rate, and robots.txt
Web pages are stored on servers so that if there will be too many requests per unit of time, they can become unavailable (i.e., shut down). When the server is not accessible by intended users, it is called denial-of-service (DoS).^[[What is a denial-of-serive attack (DoS)?](https://www.paloaltonetworks.com/cyberpedia/what-is-a-denial-of-service-attack-dos)] Often, this property is exploited for DoS attacks, that is when websites are flooded with traffic (buffer overflow attacks) and stop responding. However, it could happen unintentionally (e.g., the introduction of a new exciting website, which was developed to handle only a limited number of users).
Another concept we need to introduce is a *web crawling*, which has been used extensively by the search engines (such as Google, Bing, etc.) for web indexing. This process is similar to web scraping, but after fetching the content of the web page, crawler extracts only hyperlinks. It means that web crawling also accesses the web page by sending requests. The number of such requests per unit of time is called a crawl rate. Each web crawling requests is the same as a request of a user. Therefore, if your crawler sends too many requests (crawl rate is too high), the server can experience the DoS.
That is exactly what happens to Martijn Koster's server when Charles Stross's web crawler caused DoS (if rumors are trusted). This event inspired Martijn Koster to propose a new standard, a file `robots.txt` that stores a set of rules defining how bots and crawlers should interact with a website, namely which parts should not be accessed, the allowed crawl rate, etc.
The same applies to web scraping: the number of requests should not be too high, and you should access only parts not listed in *robots.txt*.
## Why web scraping could be bad?
Over the years the reputation of the web scraping has been decreased significantly. There are more and more people who ignore and violate not only *robots.txt*, but also Terms of Service (ToS) of websites.^[[Web Scraping and Crawling Are Perfectly Legal, Right?](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/)] However, that is not the only negative side of web scraping.
Remember the film "The Social Network"? At the beginning of the movie, the main character scrapes pictures of female students and then creates a service to rank students by their attractiveness. This is an example of unethical use of web scraping. The topic is very subjective and controversial, therefore, you are at your own responsibility to define what is ethical and what is not (there is no `robots.txt` to define ethics).
## How to avoid troubles?
Rules are simple:
1. Read carefully the Terms of Service of the web page you want to scrape.^[[Six Compelling Facts about Legality of Web Scraping](http://www.prowebscraper.com/blog/six-compelling-facts-about-legality-of-web-scraping/)]
2. Inspect `robots.txt` to identify which parts can be accessed.
3. Do not be too aggressive by using a high frequency of requests. If the crawl rate is not mentioned in `robot.txt` (section `crawl-delay:`), then use something reasonable, e.g. one request per second.^[[How to prevent getting blacklisted while scraping](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/)]
## Workflow
This workflow is based on two famous works by @scrap and @sg with a couple of twists.
1. Make sure that you have no other choice than to scrape. That is, if web resourse has API, then use it instead of scraping.
2. Make sure that scraping is legal and ethical for a given web page. You can inspect a particular subdomain (path) using `path_allower` from `robotstxt`:^[[Scraping Responsibly with R](https://www.r-bloggers.com/scraping-responsibly-with-r/)]
```{r, eval = FALSE}
library("robotstxt")
paths_allowed(
path = "/en/real-estate/rent/city-basel",
domain = "https://www.immoscout24.ch/"
)
```
or manually go through `robots.txt` of the domain (a function `get_robotstxt()`):
```{r, eval = FALSE}
get_robotstxt(domain = "https://www.immoscout24.ch/")
```
Sometimes `R` could fail to load this file, and it is a good idea to double check yourself by opening the link to `robots.txt` at your browser (e.g., manually inspecting [https://www.immoscout24.ch/robots.txt](https://www.immoscout24.ch/robots.txt)).
3. Load (fetch) and parse the HTML file by using function `read_html()` from `xml2` package:
```{r, eval = FALSE}
library("xml2")
real_estate <- read_html(
"https://www.immoscout24.ch/en/real-estate/rent/city-basel"
)
```
4. Using [selectorgadget](https://selectorgadget.com) define a css element of interest.
Selectorgadget is a tool in your browser (JavaScript bookmarklet) that uses the idea of back engeneering to determine which css object contains the data you want to extract. Selectorgadget is like a kind librarian who tells you in which section to look for a particular book.
First off, we need to install it. Depending on your browser you can install a Google Chrome [extenssion](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) or drag the link from [https://selectorgadget.com page](https://selectorgadget.com) to Bookmarks (shown in the picture below).
<!-- <img src="/images/sg_install.png" alt="map" width="400px"/> -->
```{r, echo = FALSE}
knitr::include_graphics("images/sg_install.png")
```
Once it has been installed, it is time to use it:
- Open the web page you want to scrap in your browser (e.g., [https://www.immoscout24.ch/en/real-estate/rent/city-basel](https://www.immoscout24.ch/en/real-estate/rent/city-basel))
- Run the selectorgadget. If you use Safari (or similar browsers), click on it in your Bookmarks:
```{r, echo = FALSE}
knitr::include_graphics("images/sg_safari.png")
```
Or click on the icon of the selectorgadget extenssion in Chrome:
```{r, echo = FALSE}
knitr::include_graphics("images/sg_chrome.png")
```
- Click on the element you want to select. The element you have selected will be highlighted in green, and all other elements with the same css selector will be higlighted in yellow. If there are redundant elements higlighted in yellow, simply click on them, and they will turn red.
```{r, echo = FALSE}
knitr::include_graphics("images/sg_select.png")
```
- Copy the css selector from the text box to `R`.
```{r, echo = FALSE}
knitr::include_graphics("images/sg_textbox.png")
```
5. Extract data from from the web page by using `html_nodes()` and `html_text()` from `rvest`:
```{r, eval = FALSE}
library("rvest")
library("magrittr")
flats <- real_estate %>%
html_nodes(".chqpnT") %>%
html_text()
```
6. (Optional) Clean and tidy the extracted data. For instance, we use regex to get the number of rooms and the prices in the following code chunk:
```{r, eval = FALSE}
flats_df <- data.frame(
rooms = gsub(pattern = " room.*", "", flats) %>%
as.numeric(),
price = gsub(".*CHF |.—.*", "", flats) %>%
gsub(pattern = ",", replacement = "") %>%
as.numeric()
)
```
As the conclusion, the web scraping is a very powerful tool for extracting data. However, it comes with a certain responsibility.