-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Float column type inferred as utf8 when reading csv #1488
Comments
Interestingly, sometimes it seems to work. with one of the tables i get the expected results
|
Can you post some CSV data in this issue? |
creating
|
@liukun4515 the data is available for download at the link I posted. And I agree, that was my work around to get the data type I needed. But it would be nice if that wasn't needed. I'm less familiar with the internals of reading csvs. Maybe it has to do with how many rows are used to infer schema. |
I think infer_schema_from_files and other |
The DECIMAL_RE IS @matthewmturner |
If your format of the float is like this |
@liukun4515 thanks for the help looking into this. I'm aligned with your thinking - but it does seem something else is still off. below is the head of the csv file. the float field is not quoted.
|
I have copy your data to my laptop and repeat your steps.
the result of query
|
@liukun4515 to confirm - did you copy the entire dataset or just the 25 rows i pasted? when i make a csv of just the 25 rows i get the same results as you. also can you confirm you are using datafusion-cli based on datafusion 6.0.0? |
I just copy 25 rows as a new csv file. |
@liukun4515 thx. I can try with that commit. If it's not an issue could you try with full dataset? I can post it publicly if it would help. |
Yes @matthewmturner |
I can't open this link, there may be a problem with the network. |
@liukun4515 can you download with curl?
|
It looks like the type inference issues are caused by some numbers in scientific notation:
|
maybe arrow-rs can't handle this situation. |
I downloaded https://matthewmturner-oss.s3.amazonaws.com/public/db-benchmark/J1_1e7_NA_0_0.csv and ran this locally:
Given we have
I think this issue is now done! |
Describe the bug
I am working on adding datafusion to db-benchmarks (#147). As part of that I am using datafusion-cli to test writing queries on the db-benchmark data (can be generated here https://github.com/h2oai/db-benchmark/tree/master/_data).
while i was creating a table of one of the datasets i noticed that one of the column types was inferred incorrectly. specifically, column v1 was picked up as utf8 instead of float / decimal.
To Reproduce
Expected behavior
A clear and concise description of what you expected to happen.
Column v1 should have data_type of float or decimal
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: