Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float column type inferred as utf8 when reading csv #1488

Closed
matthewmturner opened this issue Dec 26, 2021 · 19 comments
Closed

Float column type inferred as utf8 when reading csv #1488

matthewmturner opened this issue Dec 26, 2021 · 19 comments
Labels
bug Something isn't working

Comments

@matthewmturner
Copy link
Contributor

Describe the bug
I am working on adding datafusion to db-benchmarks (#147). As part of that I am using datafusion-cli to test writing queries on the db-benchmark data (can be generated here https://github.com/h2oai/db-benchmark/tree/master/_data).

while i was creating a table of one of the datasets i noticed that one of the column types was inferred incorrectly. specifically, column v1 was picked up as utf8 instead of float / decimal.

To Reproduce

DataFusion CLI v5.1.0

❯ CREATE EXTERNAL TABLE x STORED AS CSV WITH HEADER ROW LOCATION "data/J1_1e7_NA_0_0.csv";
0 rows in set. Query took 5.536 seconds.
❯ SHOW COLUMNS FROM x;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | x          | id1         | Int64     | NO          |
| datafusion    | public       | x          | id2         | Int64     | NO          |
| datafusion    | public       | x          | id3         | Int64     | NO          |
| datafusion    | public       | x          | id4         | Utf8      | NO          |
| datafusion    | public       | x          | id5         | Utf8      | NO          |
| datafusion    | public       | x          | id6         | Utf8      | NO          |
| datafusion    | public       | x          | v1          | Utf8      | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
7 rows in set. Query took 0.002 seconds.

Expected behavior
A clear and concise description of what you expected to happen.
Column v1 should have data_type of float or decimal

Additional context
Add any other context about the problem here.

@matthewmturner matthewmturner added the bug Something isn't working label Dec 26, 2021
@matthewmturner
Copy link
Contributor Author

Interestingly, sometimes it seems to work. with one of the tables i get the expected results

❯ CREATE EXTERNAL TABLE x STORED AS CSV WITH HEADER ROW LOCATION "data/J1_1e7_NA_0_0.csv";
0 rows in set. Query took 5.536 seconds.
❯ SHOW COLUMNS FROM x;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | x          | id1         | Int64     | NO          |
| datafusion    | public       | x          | id2         | Int64     | NO          |
| datafusion    | public       | x          | id3         | Int64     | NO          |
| datafusion    | public       | x          | id4         | Utf8      | NO          |
| datafusion    | public       | x          | id5         | Utf8      | NO          |
| datafusion    | public       | x          | id6         | Utf8      | NO          |
| datafusion    | public       | x          | v1          | Utf8      | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
7 rows in set. Query took 0.002 seconds.
❯ CREATE EXTERNAL TABLE large STORED AS CSV WITH HEADER ROW LOCATION "data/J1_1e7_1e7_0_0.csv";
0 rows in set. Query took 5.410 seconds.
❯ SHOW COLUMNS FROM large;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | large      | id1         | Int64     | NO          |
| datafusion    | public       | large      | id2         | Int64     | NO          |
| datafusion    | public       | large      | id3         | Int64     | NO          |
| datafusion    | public       | large      | id4         | Utf8      | NO          |
| datafusion    | public       | large      | id5         | Utf8      | NO          |
| datafusion    | public       | large      | id6         | Utf8      | NO          |
| datafusion    | public       | large      | v2          | Utf8      | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
7 rows in set. Query took 0.003 seconds.
❯ CREATE EXTERNAL TABLE medium STORED AS CSV WITH HEADER ROW LOCATION "data/J1_1e7_1e4_0_0.csv";
0 rows in set. Query took 0.010 seconds.
❯ SHOW COLUMNS FROM medium;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | medium     | id1         | Int64     | NO          |
| datafusion    | public       | medium     | id2         | Int64     | NO          |
| datafusion    | public       | medium     | id4         | Utf8      | NO          |
| datafusion    | public       | medium     | id5         | Utf8      | NO          |
| datafusion    | public       | medium     | v2          | Float64   | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
5 rows in set. Query took 0.002 seconds.

@liukun4515
Copy link
Contributor

Can you post some CSV data in this issue?
@matthewmturner

@liukun4515
Copy link
Contributor

creating external table with CSV file, you can give the schema by sql just like

crete external table test_table(c1 int, c2 float, c3 ....)  .... loacation '.........'

@matthewmturner
Copy link
Contributor Author

@liukun4515 the data is available for download at the link I posted.

And I agree, that was my work around to get the data type I needed. But it would be nice if that wasn't needed.

I'm less familiar with the internals of reading csvs. Maybe it has to do with how many rows are used to infer schema.

@alamb @houqp would you consider this issue a bug?

@liukun4515
Copy link
Contributor

@liukun4515 the data is available for download at the link I posted.

And I agree, that was my work around to get the data type I needed. But it would be nice if that wasn't needed.

I'm less familiar with the internals of reading csvs. Maybe it has to do with how many rows are used to infer schema.

@alamb @houqp would you consider this issue a bug?

I think infer_schema_from_files and other infer schema functions can help you.

@liukun4515
Copy link
Contributor

/// Infer the data type of a record
fn infer_field_schema(string: &str) -> DataType {
    // when quoting is enabled in the reader, these quotes aren't escaped, we default to
    // Utf8 for them
    if string.starts_with('"') {
        return DataType::Utf8;
    }
    // match regex in a particular order
    if BOOLEAN_RE.is_match(string) {
        DataType::Boolean
    } else if DECIMAL_RE.is_match(string) {
        DataType::Float64
    } else if INTEGER_RE.is_match(string) {
        DataType::Int64
    } else if DATETIME_RE.is_match(string) {
        DataType::Date64
    } else if DATE_RE.is_match(string) {
        DataType::Date32
    } else {
        DataType::Utf8
    }
}

The DECIMAL_RE IS static ref DECIMAL_RE: Regex = Regex::new(r"^-?(\d*\.\d+|\d+\.\d*)$").unwrap();

@matthewmturner
I think the format of the float may be not matched the DECIMAL_RE.

@liukun4515
Copy link
Contributor

If your format of the float is like this "123.123", it will be inferred as the UTF-8.
If your format of the float is like this 123.123, it will be inferred as float64.

@matthewmturner
Copy link
Contributor Author

@liukun4515 thanks for the help looking into this.

I'm aligned with your thinking - but it does seem something else is still off. below is the head of the csv file. the float field is not quoted.

$ head -n 25 J1_1e7_NA_0_0.csv
id1,id2,id3,id4,id5,id6,v1
10,5159,8771892,id10,id5159,id8771892,91.827108
7,8469,6597185,id7,id8469,id6597185,48.147306
6,7510,10742226,id6,id7510,id10742226,53.125234
10,3114,4468842,id10,id3114,id4468842,96.488041
3,6139,7725507,id3,id6139,id7725507,56.095139
8,5330,3935410,id8,id5330,id3935410,0.349572
10,9278,1617200,id10,id9278,id1617200,68.425456
3,2328,9892050,id3,id2328,id9892050,62.356196
4,4500,4713401,id4,id4500,id4713401,57.149024
3,1782,2296556,id3,id1782,id2296556,94.777187
3,8728,5951049,id3,id8728,id5951049,67.679287
8,4718,9297336,id8,id4718,id9297336,24.238334
11,8211,5318127,id11,id8211,id5318127,66.530629
2,2709,8769384,id2,id2709,id8769384,52.903976
3,5280,1083694,id3,id5280,id1083694,12.481477
8,2883,9801933,id8,id2883,id9801933,49.100595
2,8989,6380263,id2,id8989,id6380263,40.005564
11,9768,9715707,id11,id9768,id9715707,40.54798
5,10744,9149552,id5,id10744,id9149552,3.472602
6,1194,6141889,id6,id1194,id6141889,3.986312
5,1270,7381972,id5,id1270,id7381972,98.386565
4,8238,1702432,id4,id8238,id1702432,81.020402
4,8459,6190403,id4,id8459,id6190403,87.17891
11,205,9363336,id11,id205,id9363336,94.937556

@liukun4515
Copy link
Contributor

@liukun4515 thanks for the help looking into this.

I'm aligned with your thinking - but it does seem something else is still off. below is the head of the csv file. the float field is not quoted.

$ head -n 25 J1_1e7_NA_0_0.csv
id1,id2,id3,id4,id5,id6,v1
10,5159,8771892,id10,id5159,id8771892,91.827108
7,8469,6597185,id7,id8469,id6597185,48.147306
6,7510,10742226,id6,id7510,id10742226,53.125234
10,3114,4468842,id10,id3114,id4468842,96.488041
3,6139,7725507,id3,id6139,id7725507,56.095139
8,5330,3935410,id8,id5330,id3935410,0.349572
10,9278,1617200,id10,id9278,id1617200,68.425456
3,2328,9892050,id3,id2328,id9892050,62.356196
4,4500,4713401,id4,id4500,id4713401,57.149024
3,1782,2296556,id3,id1782,id2296556,94.777187
3,8728,5951049,id3,id8728,id5951049,67.679287
8,4718,9297336,id8,id4718,id9297336,24.238334
11,8211,5318127,id11,id8211,id5318127,66.530629
2,2709,8769384,id2,id2709,id8769384,52.903976
3,5280,1083694,id3,id5280,id1083694,12.481477
8,2883,9801933,id8,id2883,id9801933,49.100595
2,8989,6380263,id2,id8989,id6380263,40.005564
11,9768,9715707,id11,id9768,id9715707,40.54798
5,10744,9149552,id5,id10744,id9149552,3.472602
6,1194,6141889,id6,id1194,id6141889,3.986312
5,1270,7381972,id5,id1270,id7381972,98.386565
4,8238,1702432,id4,id8238,id1702432,81.020402
4,8459,6190403,id4,id8459,id6190403,87.17891
11,205,9363336,id11,id205,id9363336,94.937556

I have copy your data to my laptop and repeat your steps.

  1. copy your data and store as the csv file whose name is simple.csv
  2. create table CREATE EXTERNAL TABLE food STORED AS CSV WITH HEADER ROW LOCATION 'simple.csv';
  3. desc the table \d food
    below is my result
❯ \d food
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | food       | id1         | Int64     | NO          |
| datafusion    | public       | food       | id2         | Int64     | NO          |
| datafusion    | public       | food       | id3         | Int64     | NO          |
| datafusion    | public       | food       | id4         | Utf8      | NO          |
| datafusion    | public       | food       | id5         | Utf8      | NO          |
| datafusion    | public       | food       | id6         | Utf8      | NO          |
| datafusion    | public       | food       | v1          | Float64   | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+
7 rows in set.

the result of query select * from food limit 2

❯ select * from food limit 2
;
+-----+------+---------+------+--------+-----------+-----------+
| id1 | id2  | id3     | id4  | id5    | id6       | v1        |
+-----+------+---------+------+--------+-----------+-----------+
| 10  | 5159 | 8771892 | id10 | id5159 | id8771892 | 91.827108 |
| 7   | 8469 | 6597185 | id7  | id8469 | id6597185 | 48.147306 |
+-----+------+---------+------+--------+-----------+-----------+
2 rows in set. Query took 0.015 seconds.
❯

@matthewmturner

@matthewmturner
Copy link
Contributor Author

@liukun4515 to confirm - did you copy the entire dataset or just the 25 rows i pasted? when i make a csv of just the 25 rows i get the same results as you.

also can you confirm you are using datafusion-cli based on datafusion 6.0.0?

@liukun4515
Copy link
Contributor

@liukun4515 to confirm - did you copy the entire dataset or just the 25 rows i pasted? when i make a csv of just the 25 rows i get the same results as you.

also can you confirm you are using datafusion-cli based on datafusion 6.0.0?

I just copy 25 rows as a new csv file.
I build the datafusion-cli from the master, the commit log is 91ee5a4682f58b0aeca.

@matthewmturner
Copy link
Contributor Author

@liukun4515 thx. I can try with that commit. If it's not an issue could you try with full dataset? I can post it publicly if it would help.

@liukun4515
Copy link
Contributor

datase

Yes @matthewmturner
It's better to upload the whole file, and I can try and debug it In my local env.
Thanks.

@matthewmturner
Copy link
Contributor Author

@liukun4515
Copy link
Contributor

@liukun4515

here you go https://matthewmturner-oss.s3.amazonaws.com/public/db-benchmark/J1_1e7_NA_0_0.csv

I can't open this link, there may be a problem with the network.

@matthewmturner
Copy link
Contributor Author

@liukun4515 can you download with curl?

curl https://matthewmturner-oss.s3.amazonaws.com/public/db-benchmark/J1_1e7_NA_0_0.csv -o test.csv

@pjmore
Copy link
Contributor

pjmore commented Jan 20, 2022

It looks like the type inference issues are caused by some numbers in scientific notation:

❯ SELECT v1 from x where v1 not like '%.%';
+-------+
| v1    |
+-------+
| 26    |
| 2e-06 |
| 2     |
| 47    |
| 11    |
| 13    |
| 68    |
| 83    |
| 1e-04 |
| 8e-04 |
| 54    |
| 8e-06 |
| 96    |
+-------+
13 rows in set. Query took 0.981 seconds. 

@liukun4515
Copy link
Contributor

maybe arrow-rs can't handle this situation.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

I downloaded https://matthewmturner-oss.s3.amazonaws.com/public/db-benchmark/J1_1e7_NA_0_0.csv and ran this locally:

datafusion-cli

❯ CREATE EXTERNAL TABLE x STORED AS CSV WITH HEADER ROW LOCATION "J1_1e7_NA_0_0.csv";
0 rows in set. Query took 0.026 seconds.

❯ SHOW COLUMNS FROM x;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | x          | id1         | Int64     | YES         |
| datafusion    | public       | x          | id2         | Int64     | YES         |
| datafusion    | public       | x          | id3         | Int64     | YES         |
| datafusion    | public       | x          | id4         | Utf8      | YES         |
| datafusion    | public       | x          | id5         | Utf8      | YES         |
| datafusion    | public       | x          | id6         | Utf8      | YES         |
| datafusion    | public       | x          | v1          | Float64   | YES         |
+---------------+--------------+------------+-------------+-----------+-------------+
7 rows in set. Query took 0.032 seconds.

Given we have

| datafusion    | public       | x          | v1          | Float64   | YES         |

I think this issue is now done!

@alamb alamb closed this as completed Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants