-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing of ragged fixed width format files and reading a subset of columns #353
Conversation
Good solution! Gives the user the power to explicitly specify ragged data via the column position parameter. Serves the stated purpose of Also the single failing check is a positive sign as it checks for the problematic |
#' use \code{fwf_positions}. The width of the last column will be silently | ||
#' extended to the next line break. | ||
#' use \code{fwf_positions}. If the width of the last column is variable (a | ||
#' ragged fwf file), supply the last end position as NA, Inf or simply ommit it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only need to use a single sentinel value here, and I'd recommend sticking with Inf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually NA would be easier since because Inf
is only available in doubles, not integers
I think this is a reasonable approach, although it still needs quite a bit of work. @ghaarsma are you interested in continuing to work on it? @jschoeley A failing test is not a good sign for a PR - tests need to be updated too. |
@@ -164,6 +171,14 @@ Token TokenizerFwf::nextToken() { | |||
row_++; | |||
col_ = 0; | |||
|
|||
// Proceed to the end of the line. This is needed in case the last column | |||
// in the file is not being read. | |||
while(fieldEnd != end_ && *fieldEnd != '\r' && *fieldEnd != '\n') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this won't end up accidentally skipping blank lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is room for improvement.
- You don't have to proceed to the end of line if you are short.
tooShort = true
(you are already there). - You don't have to proceed to the end of line if format is ragged (reading the ragged column will do this).
- However if not short and not ragged then you will not be at the EOL if the user does not wish to read the final column. In that case you have to proceed to EOL and you don't know the width, therefore the while loop.
Hadley, The (only) way to indicate a Ragged fixed width format is to have the last position of the end vector as NA. I have checked it with empty lines inside the file (it pushes in a row of NA's), which I assume is the intended behavior. Some simple tests x <- '12345A\n67890BBBBBBBBB\n54321C'
col_names <- c('A','B','C')
start <- c(1,3,6)
end <- c(2,5,6)
# Read all columns, non Ragged
col_positions <- fwf_positions(start,end,col_names)
df1 <- read_fwf(x,col_positions = col_positions);df1
# Read subset of columns, it works!
col_positions <- fwf_positions(start[1:2],end[1:2],col_names[1:2])
df2 <- read_fwf(x,col_positions = col_positions);df2
# Read Ragged
col_positions <- fwf_positions(start,end=c(2,5,NA),col_names)
df4 <- read_fwf(x,col_positions = col_positions);df4
# Read Ragged alternate way with fwf_widths
col_positions <- fwf_widths(widths = c(2,3,NA),col_names)
df5 <- read_fwf(x,col_positions = col_positions);df5 |
@@ -76,7 +77,9 @@ fwf_widths <- function(widths, col_names = NULL) { | |||
#' @rdname read_fwf | |||
#' @export | |||
#' @param start,end Starting and ending (inclusive) positions of each field. | |||
#' Use NA or Inf as last end field when reading a ragged fwf file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed the Inf
here
This is looking good. Can you please add those examples as unit tests, and add a bullet point to NEWS? (it should include the original issue number and your github user name) |
Added new unit tests and a bullet point to the NEWS.md. Let me know if this is all correct. Some of the work process is a first time for me. |
This looks great! There's one last thing for you to attempt - can you please rebase/merge to bring your branch up to date with the other changes? If you get stuck, don't worry too much, as I can do it by hand, but it's a good learning experience for future PRs if you want to give it a shot. |
You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted.
You can properly read a subset of columns out of any fwf file. You can also read a ragged fwf file when the last element in the end position is NA,Inf or simply omitted. Updated documentation inf read_fwf to reflect changes.
…Hadley The only way to assume a fwf ragged file is for the last end position to be NA.
…Hadley The only way to assume a fwf ragged file is for the last end position to be NA.
Hadley, I tried to follow: https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request. Had some minor issues, but I think I got it. Please check when merging the pull request. |
Current coverage is 70.00%@@ master #353 diff @@
==========================================
Files 56 56
Lines 2803 2807 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 1961 1965 +4
Misses 842 842
Partials 0 0
|
Perfect - thanks! |
This pull request should fix items 300 and 326. The current fwf format is kind of broken if you want to read a subset of columns.
Ragged fwf (where the last column width if variable) do exists, but they should be a minority. This implementation assumes ragged fwf only when the last end position is NA, Inf or omitted.