-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_excel() converts date to number #716
Comments
I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your Excel file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by I created an imitation of your file by typing in the values that I see above by hand. I observed the following, and am making the assumption that these things are true for your file as well:
Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), Cell B2 is imported as "17995" because that is what is actually stored in the Excel spreadsheet. When Excel interpreted what you typed in as a date, it implicitly converted it to the number 17995, which is the number of days between December 30, 1899 and the date entered (this is Microsoft's convention for storing date data). With the Date type applied, Excel knows to display the value as a date, while still storing the number "under the hood". This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R: library(readxl)
library(dplyr)
library(stringr)
library(lubridate)
# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
function(x){
case_when(
str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x), # handles values imported as text values in the format "MM/DD/YYYY"
str_detect(x, "^[0-9]{5}$") ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")), # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
TRUE ~ NA_Date_ # default case, no applicable date format, returns a missing date value
)
}
read_excel("test1.xlsx", col_types = "text") |> # instruct read_excel() to simply import all columns as text, so that I can make tricky type conversions myself
mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#> NAME DOB DOB_Converted_to_Date
#> <chr> <chr> <date>
#> 1 Jack 17995 1949-04-07
#> 2 Tom 12/37/38 NA
#> 3 George 02/17/80 1980-02-17 Handling of additional date formats can be added to |
Thank you for the explanation and walk around. I wonder how MS excel keeps
original character format for B2 ("4/7/1949") when saving test1.xslx as
.csv file. "4/7/1949" should be stored somewhere in excel. It would be nice
for R to retrieve it ("4/7/1949") when R realizes column B is character.
…On Sun, Jan 8, 2023, 8:46 AM Matt Nield ***@***.***> wrote:
I will attempt to answer this question, having done my best to recreate
your circumstances. This is not ideal because I don't have a copy of your
file, and most importantly, I don't know the Excel data types of cells B2
and B3. Whether they are General, Date, or Text affects what data type they
are imported into R as by read_excel(). Any of the above that you can
provide will help me give a more accurate explanation of your output.
I created an imitation of your file by typing in the values that I see
above by hand and observed the following; I am making the assumption that
these things are true for your file as well:
- Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and
was automatically assigned the *Date type*
- Cell B3 could not be interpreted by Excel as anything other than
just text, since it does not represent a valid date or numeric value, and
so was kept as the default *General type*
Moving now to what happens in R: Because column B contains at least one
cell that cannot be interpreted as anything other than text (cell B3),
read_excel() guessed that the entire column should be treated as
character (text) data. This behavior avoids data loss; if it had guessed
differently and applied a date type, cell B3 would have to be NA (a
blank/missing value), and information that you had before would not make it
into your data.frame.
This article from the readxl site is highly relevant and is my go-to
reference for understanding why my Excel data gets imported the way it
does: https://readxl.tidyverse.org/articles/cell-and-column-types.html
Here is my personal approach for ensuring that date data from Excel
spreadsheets ends up as the intended date values in R:
library(readxl)
library(dplyr)
library(stringr)
library(lubridate)
# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
function(x){
case_when(
str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x), # handles values imported as text values in the format "MM/DD/YYYY"
str_detect(x, "^[0-9]{5}$") ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")), # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
TRUE ~ NA_Date_ # default case, no applicable date format, returns a missing date value
)
}
read_excel("test1.xlsx", col_types = "text") |> # instruct read_exce() to simply import all columns as text, so that I can make tricky type conversions myself
mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#> NAME DOB DOB_Converted_to_Date
#> <chr> <chr> <date>
#> 1 Jack 17995 1949-04-07
#> 2 Tom 12/37/38 NA
#> 3 George 02/17/80 1980-02-17
Handling of additional date formats can be added to case_when() as
needed. I added one more row for this example to show that a value that is
imported as text but still follows the "MM/DD/YY" format but actually
represents a valid date still gets converted correctly. This can happen if
someone types in a valid date like "02/17/80" when the cell's Excel data
type is set to Text, and Excel makes no attempt to interpret the value as a
Date or number.
—
Reply to this email directly, view it on GitHub
<#716 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC32KONZP6AFA36LCQX5RV3WRLHNBANCNFSM6AAAAAAR427YVI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
test1.xlsx
I try to covert test1.xlsx to .csv file by using read_excel(), then write.csv()
The excel file test1.xlsx contains date column (DOB) with some typo. After read, the correct DOB date become number (in character type). The read_excel() should at least keep DOB in original text format. If I use MS excel to save as .csv the date format is preserved.
The text was updated successfully, but these errors were encountered: