Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_table: check column names? #116

Closed
r2evans opened this issue Oct 30, 2015 · 7 comments
Closed

html_table: check column names? #116

r2evans opened this issue Oct 30, 2015 · 7 comments

Comments

@r2evans
Copy link

r2evans commented Oct 30, 2015

Hadley,

Thanks for the tools!

In html_table(), I find with some sites that using the defined header names can produce bad column names (including empty strings). What are your thoughts about including an option to run the column names through make.names?

Currently having to do:

library(rvest)
fix.names <- function(x, ...) {
    colnames(x) <- make.names(colnames(x), ...)
    x
}
HTML <- read_html("...")
HTML %>%
    html_node('table') %>%
    html_table() %>%
    fix.names()

Perhaps:

HTML %>%
    html_node('table') %>%
    html_table(check.names=TRUE)

Something like:

diff --git a/R/table.R b/R/table.R
index 8b98caa..73f216c 100644
--- a/R/table.R
+++ b/R/table.R
@@ -15,6 +15,8 @@
 #' @param fill If \code{TRUE}, automatically fill rows with fewer than
 #'   the maximum number of columns with \code{NA}s.
 #' @param dec The character used as decimal mark.
+#' @param check.names If \code{TRUE} and \code{header == TRUE}, verify
+#'   column names with \code{make.names}.
 #' @export
 #' @examples
 #' tdist <- read_html("http://en.wikipedia.org/wiki/Student%27s_t-distribution")
@@ -52,7 +54,7 @@ html_table.xml_nodeset <- function(x, header = NA, trim = TRUE, fill = FALSE,

 #' @export
 html_table.xml_node <- function(x, header = NA, trim = TRUE,
-                                              fill = FALSE, dec = ".") {
+                                fill = FALSE, dec = ".", check.names = FALSE) {

   stopifnot(html_name(x) == "table")

@@ -93,6 +95,9 @@ html_table.xml_node <- function(x, header = NA, trim = TRUE,
   }
   if (header) {
     col_names <- out[1, , drop = FALSE]
+    if (check.names) {
+      col_names <- make.names(col_names)
+    }
     out <- out[-1, , drop = FALSE]
   } else {
     col_names <- paste0("X", seq_len(ncol(out)))

I found no previous discussion of make.names, my apologies if this is a repeat.

Currently using R 3.2.2, rvest_0.3.0, win10_64.

@r2evans
Copy link
Author

r2evans commented Oct 31, 2015

Reproducible example:

library(dplyr)
library(rvest)
HTML <- read_html('<table><tr><th>Var1</th><th> </th></tr><tr><td>1</td><td>2</td></table>')
HTML %>%
    html_node('table') %>%
    html_table() %>%
    tbl_df()
## Error: All columns must be named
HTML %>%
    html_node('table') %>%
    html_table(check.names=TRUE) %>%
    tbl_df()
## Source: local data frame [1 x 2]
##    Var1     X
##   (int) (int)
## 1     1     2

@hadley
Copy link
Member

hadley commented Dec 12, 2015

I think it's a bad idea to munge column names because they can contain important information.

@hadley hadley closed this as completed Dec 12, 2015
@r2evans
Copy link
Author

r2evans commented Dec 16, 2015

I agree that the default behavior should be to not munge things. However, so many HTML tables -- with or without good column names -- do not try to adhere to good R naming conventions :-)

I certainly think the default should be to not change things, but I also think that when known-challenging tables are scraped, it would be better to offer an explicit "out" vice the unfortunate error with invalid column names. (This specific example may be masked with your new column-spanning feature, though I'm not certain that that should always be the right option.)

Is there a better and/or more elegant solution to HTML tables that fail like this one does? Doesn't keeping the default FALSE sufficiently satisfy the default purity of the source data?

@hadley
Copy link
Member

hadley commented Dec 16, 2015

Your explicit out is to make.names() yourself. We'll probably also make some changes to dplyr to make this slightly less frustrating.

@r2evans
Copy link
Author

r2evans commented Dec 16, 2015

Ahh, I see, perhaps something in as_data_frame (passed from tbl_df())? Since neither has any arguments (or passing of ...), this is a bit more of a change to dplyr, but do you want me to migrate this issue to dplyr? If I'm seeing things correctly, it won't be as much difficult code as needing to be consistent with your vision of dplyr performance and semantics. Let me know if you want me to submit a PR, and I'll give it a go. If you have any preferences on where to put the change, I'd appreciate guidance now vice mid-coding :-)

@hadley
Copy link
Member

hadley commented Dec 16, 2015

Or maybe dplyr should have an explicit clean_df() or something like that (clean is a bit too strong). Please file an issue in dplyr, and I'll mull it over.

@MichaelChirico
Copy link
Contributor

I don't think the can should be kicked down the road to dplyr as many users (e.g. me) may not have that in their workflow -- dplyr is not even in this package's DESCRIPTION.

html_table is the entrance point of the data into a standard data.frame format -- therefore having check.names argument (like read.table) seems very natural to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants