html_table: check column names? #116

r2evans · 2015-10-30T20:49:08Z

Hadley,

Thanks for the tools!

In html_table(), I find with some sites that using the defined header names can produce bad column names (including empty strings). What are your thoughts about including an option to run the column names through make.names?

Currently having to do:

library(rvest)
fix.names <- function(x, ...) {
    colnames(x) <- make.names(colnames(x), ...)
    x
}
HTML <- read_html("...")
HTML %>%
    html_node('table') %>%
    html_table() %>%
    fix.names()

Perhaps:

HTML %>%
    html_node('table') %>%
    html_table(check.names=TRUE)

Something like:

diff --git a/R/table.R b/R/table.R
index 8b98caa..73f216c 100644
--- a/R/table.R
+++ b/R/table.R
@@ -15,6 +15,8 @@
 #' @param fill If \code{TRUE}, automatically fill rows with fewer than
 #'   the maximum number of columns with \code{NA}s.
 #' @param dec The character used as decimal mark.
+#' @param check.names If \code{TRUE} and \code{header == TRUE}, verify
+#'   column names with \code{make.names}.
 #' @export
 #' @examples
 #' tdist <- read_html("http://en.wikipedia.org/wiki/Student%27s_t-distribution")
@@ -52,7 +54,7 @@ html_table.xml_nodeset <- function(x, header = NA, trim = TRUE, fill = FALSE,

 #' @export
 html_table.xml_node <- function(x, header = NA, trim = TRUE,
-                                              fill = FALSE, dec = ".") {
+                                fill = FALSE, dec = ".", check.names = FALSE) {

   stopifnot(html_name(x) == "table")

@@ -93,6 +95,9 @@ html_table.xml_node <- function(x, header = NA, trim = TRUE,
   }
   if (header) {
     col_names <- out[1, , drop = FALSE]
+    if (check.names) {
+      col_names <- make.names(col_names)
+    }
     out <- out[-1, , drop = FALSE]
   } else {
     col_names <- paste0("X", seq_len(ncol(out)))

I found no previous discussion of make.names, my apologies if this is a repeat.

Currently using R 3.2.2, rvest_0.3.0, win10_64.

The text was updated successfully, but these errors were encountered:

r2evans · 2015-10-31T06:19:17Z

Reproducible example:

library(dplyr)
library(rvest)
HTML <- read_html('<table><tr><th>Var1</th><th> </th></tr><tr><td>1</td><td>2</td></table>')
HTML %>%
    html_node('table') %>%
    html_table() %>%
    tbl_df()
## Error: All columns must be named
HTML %>%
    html_node('table') %>%
    html_table(check.names=TRUE) %>%
    tbl_df()
## Source: local data frame [1 x 2]
##    Var1     X
##   (int) (int)
## 1     1     2

hadley · 2015-12-12T16:35:47Z

I think it's a bad idea to munge column names because they can contain important information.

r2evans · 2015-12-16T08:54:34Z

I agree that the default behavior should be to not munge things. However, so many HTML tables -- with or without good column names -- do not try to adhere to good R naming conventions :-)

I certainly think the default should be to not change things, but I also think that when known-challenging tables are scraped, it would be better to offer an explicit "out" vice the unfortunate error with invalid column names. (This specific example may be masked with your new column-spanning feature, though I'm not certain that that should always be the right option.)

Is there a better and/or more elegant solution to HTML tables that fail like this one does? Doesn't keeping the default FALSE sufficiently satisfy the default purity of the source data?

hadley · 2015-12-16T13:44:25Z

Your explicit out is to make.names() yourself. We'll probably also make some changes to dplyr to make this slightly less frustrating.

r2evans · 2015-12-16T22:03:09Z

Ahh, I see, perhaps something in as_data_frame (passed from tbl_df())? Since neither has any arguments (or passing of ...), this is a bit more of a change to dplyr, but do you want me to migrate this issue to dplyr? If I'm seeing things correctly, it won't be as much difficult code as needing to be consistent with your vision of dplyr performance and semantics. Let me know if you want me to submit a PR, and I'll give it a go. If you have any preferences on where to put the change, I'd appreciate guidance now vice mid-coding :-)

hadley · 2015-12-16T22:44:08Z

Or maybe dplyr should have an explicit clean_df() or something like that (clean is a bit too strong). Please file an issue in dplyr, and I'll mull it over.

MichaelChirico · 2018-04-27T03:04:15Z

I don't think the can should be kicked down the road to dplyr as many users (e.g. me) may not have that in their workflow -- dplyr is not even in this package's DESCRIPTION.

html_table is the entrance point of the data into a standard data.frame format -- therefore having check.names argument (like read.table) seems very natural to me.

johncollins mentioned this issue Nov 15, 2015

Table parsing could be better #111

Closed

hadley closed this as completed Dec 12, 2015

r2evans mentioned this issue Dec 16, 2015

option to clean/normalize data.frames? tidyverse/dplyr#1587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html_table: check column names? #116

html_table: check column names? #116

r2evans commented Oct 30, 2015

r2evans commented Oct 31, 2015

hadley commented Dec 12, 2015

r2evans commented Dec 16, 2015

hadley commented Dec 16, 2015

r2evans commented Dec 16, 2015

hadley commented Dec 16, 2015

MichaelChirico commented Apr 27, 2018

html_table: check column names? #116

html_table: check column names? #116

Comments

r2evans commented Oct 30, 2015

r2evans commented Oct 31, 2015

hadley commented Dec 12, 2015

r2evans commented Dec 16, 2015

hadley commented Dec 16, 2015

r2evans commented Dec 16, 2015

hadley commented Dec 16, 2015

MichaelChirico commented Apr 27, 2018