Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use insights from htmltab package #63

Closed
hadley opened this issue Jan 16, 2015 · 1 comment · Fixed by #293
Closed

Use insights from htmltab package #63

hadley opened this issue Jan 16, 2015 · 1 comment · Fixed by #293
Labels
feature a feature request or enhancement table 🏓

Comments

@hadley
Copy link
Member

hadley commented Jan 16, 2015

  1. Expansion of row and column spans: I am confident that this part of the code (span_body, span_header) is working as expected for spans in the header or the body. E.g.,

    library(XML)
    library(stringr)
    library(magrittr)
    doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
    bFun <- function(node) {xmlValue(node) %>% str_replace(., '%$', '') %>% ifelse(equals(., ''), NA, .)}
    htmltable(doc = doc, which = "//table[5]", bodyFun = bFun)
    
    url  <- "http://de.wikipedia.org/wiki/Bundestagswahlkreis_Frankfurt_am_Main_II"
    htmltable(doc = url, which = 14, encoding = "UTF-8")

    I also managed to make these functions quite robust to certain misspecifications in the HTML code. For example, in this table the last column has a span of 8, but it should be 6:

    htmltable(doc = "http://en.wikipedia.org/wiki/Jamie_xx", 
              which = "/html/body/div[3]/div[3]/div[4]/table[2]", 
              encoding = "UTF-8", 
              header = 1:2, body = "tr[./td[not(@colspan = '9')]]")
  2. Identification of header and body elements: The identification of header and body elements is an
    issue that I still have not completely solved. It's the cause of nearly all the fails right now. I tried to come up with some reasonable heuristics of how to identify these elements but it's not done yet and this still needs some more work as well as more testing with 'real-life' HTML tables. Ultimately, I think it's necessary to give users more control over identifying these elements -- or just fall back on very simple decision rules.

@crubba
Copy link

crubba commented Jan 16, 2015

I had to change the name for releasing it on CRAN. It's now htmltab. In the latest version, I changed the node of reference for the header and body argument. An XPath must now treat the table node as the root. E.g, for the above example, body = "//tr[./td[not(@colspan = '9')]]".

@hadley hadley changed the title htmltable package Use insights from htmltab package Mar 17, 2019
@hadley hadley added feature a feature request or enhancement table 🏓 labels Mar 17, 2019
hadley added a commit that referenced this issue Dec 19, 2020
And make it return a tibble.

Fixes #63. Fixes #204. Fixes #215. Fixes #199.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement table 🏓
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants