-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Legacy Presto NaN behavior
Rebecca Schlussel edited this page Jun 17, 2024
·
1 revision
Below we provide a reference of Presto's previous behavior with regard to ordering and comparison of NaN values. The new behavior is consistent definition of nan=nan and nan is greater than all other values. See https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md
for more details.
Function | Behavior | Examples |
---|---|---|
Top level operators | ||
order by | NaN is biggest | |
SELECT distinct, group by one column, group by multiple columns | deduplicates NaNs for double types. for real types, it does not (but if you do IS DISTINCT FROM directly it will say they are not distinct). this is because of a bug in the block version of real's isDistinctFrom method | select distinct x from (values 1, nan(), 2, nan(), null, null)t(x); x ------ NaN NULL 1.0 2.0 select count(x), y from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by y; _col0 | y -------+----- 2 | NaN 2 | 1.0 (2 rows) select x, y, count(*) as count from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by x, y; x | y | count ---+-----+------- 1 | NaN | 2 1 | 1.0 | 2 (2 rows) |
join on nan() key | does not match. Also note that dynamic filtering has special handling for nan() based on this behavior https://github.com/prestodb/presto/pull/16314/files | SELECT * FROM (values 1, 2, nan()) t1(x) join (values 1, 3, nan()) t2(x) on t1.x = t2.x; x | x -----+----- 1.0 | 1.0 (1 row) |
Simple Scalar Functions and operators | ||
greatest/least | Throws an invalid argument error for NaN in any position | select greatest(1, nan());Query 20240228_144800_58951_3rsv7 failed: Invalid argument to greatest(): NaN select least(1, nan());Query 20240228_144824_58965_3rsv7 failed: Invalid argument to least(): NaN |
>, <, >=, <= |
false for NaN in any position | select nan() > 0; _col0 ------- false (1 row)select 0 > nan(); _col0 ------- false (1 row)select nan() >= 0; _col0 ------- false (1 row)select nan() < 0; _col0 ------- false (1 row)select nan() <= 0; _col0 ------- false (1 row) select nan() <= nan(); _col0 ------- false (1 row) |
between | false for NaN in any position | select nan() between -Infinity() and infinity(); _col0 ------- false (1 row)select 3 between 0 and nan(); _col0 ------- false (1 row) |
= | false | select nan() = nan(); _col0 ------- false (1 row) |
<>, != | TRUE | select nan() <> nan(); _col0 ------- true |
nan() IN (...) | false | select nan() in (1, 2, nan()); _col0 ------- false (1 row) |
nan() NOT IN (....) | TRUE | select nan() not in (nan(), 1, 2); _col0 ------- true (1 row) |
Aggregation Functions | ||
count(distinct) | nan counted only once | SELECT count(distinct x) from (values 1, 2, nan(), nan()) t(x); _col0 ------- 3 (1 row)select y, count(distinct x) as count_distinct, count(x) as count from (values(nan(), 1), (nan(), 1), (1, 1), (1, 2)) t(x,y) group by y; y | count_distinct | count ---+----------------+------- 2 | 1 | 1 1 | 2 | 3 (2 rows) |
min, max, min_by, max_by | If NaN is first, then NaN. Else the min or max other than NaN (https://github.com/prestodb/presto/issues/21877) | |
set_agg() | Nan appears multiple times | select set_agg(x) from (Values 1, 2, nan(), nan(), 3) as t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, 3.0] (1 row) |
set_union() | Nan appears multiple times | SELECT set_union(x) from (values array[1, 2, nan()], array[1, 3, nan()]) t(x); _col0 --------------------------- [1.0, 2.0, NaN, 3.0, NaN] (1 row) |
histogram | each nan appears seperately in the histogram. in display it looks like there's only one with an incorrent count, but if you get map_keys you can see it appears n times | SELECT histogram(x) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 ----------------------- {1.0=2, 2.0=1, NaN=1} (1 row)SELECT map_keys(histogram(x)) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, NaN] (1 row) |
map_agg | nan appears multiple times (in display looks like it only appears 1x, but if you get map_keys you can see multiple nans) | select map_agg(x,y) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ----------------------- {1.0=a, 2.0=b, NaN=e} (1 row)select map_keys(map_agg(x,y)) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ---------------------- [1.0, 2.0, NaN, NaN] |
multimap_agg | fails with an index out of bounds exception if there are any nan() keys https://github.com/prestodb/presto/issues/22679 | SELECT multimap_agg(x, y) FROM (VALUES (nan(), 2), (2, 3)) t(x,y); -> Query 20240506_161354_39491_uuek4 failed: Index -1 out of bounds for length 1024. |
map_union | nan keys appear multiple times. in display it looks deduplicated, but if you get map_keys you can see they are not | SELECT map_union(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ---------------------------------- {1.0=3, 2.0=4, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row) |
map_union_sum | nan keys are duplicated and values not summed. in display it looks deduplicated, but if you get map_keys you can see they are not | SELECT map_union_sum(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------------- {1.0=3, 2.0=24, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union_sum(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row) |
Array functions | ||
array_min | NaN if NaN is in any position in the array | select array_min(array[nan(),1, 2, 3]); _col0 ------- NaN (1 row) select array_min(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row) |
array_max | NaN if NaN is in any position in the array | select array_max(array[nan(), 1, 2, 3]); _col0 ------- NaN (1 row)select array_max(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row) |
array_sort | NaN is biggest | select array_sort(array[nan(), 1, 3, 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row)select array_sort(array[1, 3, nan(), 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row) |
array_distinct() | Doesn't deduplicate NaN elements | select array_distinct(array[1, nan(), 2, nan(), null, null]); _col0 ---------------------------- [1.0, NaN, 2.0, NaN, null] (1 row) |
array_duplicates() | Doesn't list NaN as duplicate elements | select array_duplicates(array[1, nan(), 2, nan(), null, null, 1]); _col0 ------------- [null, 1.0] (1 row) |
array_except() | Always includes NaN, even if it's in both arrays | select array_except(array[1, 2, nan()], array[3, 4, nan()]); _col0 ----------------- [1.0, 2.0, NaN] |
array_has_duplicates | returns false if NaN is the only duplicate | select array_has_duplicates(array[1, nan(), 2, nan()]); _col0 ------- false (1 row) |
array_intersect | nan() is excluded from the intersection even if it's in both | select array_intersect(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- [] (1 row) |
array_union() | nan() is included multiple times | select array_union(array[1, 2, nan()], array[3, 4, nan()]); _col0 -------------------------------- [1.0, 2.0, NaN, 3.0, 4.0, NaN] |
array_least_frequent | nan() are all considered different from each other, so even if they appear multiple times, each one is considered as appearing once. | select array_least_frequent(array[1,1, 2,2, nan(), nan(), nan()]); _col0 ------- [NaN] (1 row) select array_least_frequent(array[1, 2,2, nan(), nan(), nan()]); _col0 ------- [1.0] (1 row) |
array_least_frequent(x, n) | nan() appears however many times it appears in the array. For ties in frequency it is considered the largest number. | select array_least_frequent(array[1, 1, 2,2, nan(), nan(), nan()], 5); _col0 --------------------------- [NaN, NaN, NaN, 1.0, 2.0] (1 row)select array_least_frequent(array[2,2, nan(), nan(), nan(), 1.0], 5); _col0 --------------------------- [1.0, NaN, NaN, NaN, 2.0] (1 row) |
array_position | nan() is not found | presto:di> select array_position(array[2,2, nan(), nan(), nan(), 1.0], nan()); _col0 ------- 0 (1 row) |
array_remove | does not remove nan | select array_remove(array[1, 2, nan()], nan()); _col0 ----------------- [1.0, 2.0, NaN] (1 row) |
array_top_n | nan is biggest | select array_top_n(array[1, 2, nan(), infinity()],3); _col0 ---------------------- [NaN, Infinity, 2.0] (1 row) |
arrays_overlap | returns false if NaN is the only overlap | select arrays_overlap(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- false (1 row) |
contains(x, nan()) | checking for contains nan() always returns false | select contains(array[nan(), nan(), nan()], nan()); _col0 ------- false (1 row) |
Maps | ||
creating a map with duplicate nan keys | does not throw an error (unlike other duplicate values). The display shows only one of them, but if you try to get map_keys you'll see nan 2xNote that using double keys will always return a warning that floating point types can be non-deterministic | select map(array[1, nan(), nan()], array['a', 'b','c']); _col0 ---------------- {1.0=a, NaN=c} (1 row)vs.select map(array[1, nan(), 1], array['a', 'b','c'])[nan()]; Query 20240227_210132_03209_xptub failed: Duplicate map keys (1.0) are not allowed |
subscript, element_at, map_subset | returns null for nan keys (i.e. doesn't find them) | select map(array[1, nan(), nan()], array['a', 'b','c'])[nan()]; _col0 ------- NULL (1 row) select element_at(map(array[1, nan(), 2], array['a', 'b','c']), nan()); _col0 ------- NULL (1 row)select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row) |
map_subset | doesn't include nan keys in the returned map (i.e. doesn't find them) | select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row) |
map_keys | returns all keys including duplicate NaN keys | select map_keys(map(array[1, nan(), nan()], array['a', 'b','c'])); _col0 ----------------- [1.0, NaN, NaN] (1 row) |
map_top_n_keys | nan keys are biggest regardless of position | select map_top_n_keys(map(array[1, nan(), nan()], array['a', 'b','c']),3); _col0 ----------------- [NaN, NaN, 1.0] (1 row) |
map_top_n | [BUG] nan ordering is dependent on map order. elements are evaluated in order, and nan is larger than anything before it, and smaller than anything after it. That means something after nan may be considered biggest even if something before nan was bigger. https://github.com/prestodb/presto/issues/22040 | select map_top_n(map(array[1, 2, 3], array[nan(), 2, -infinity()]),2); _col0 ---------------------- {2=2.0, 3=-Infinity} (1 row)select map_top_n(map(array[1, 2, 3, 4], array[1, nan(), infinity(), -infinity()]),1); _col0 -------------- {3=Infinity}select map_top_n(map(array[1, 2, 3], array[1, infinity(), nan()]),1); _col0 --------- {3=NaN} (1 row)-- bug!! -infinity is reported as the largest here because it's the only thing after nan select map_top_n(map(array[1, 2, 3, 4], array[1, infinity(), nan(), -infinity()]),1); _col0 --------------- {4=-Infinity} (1 row) |
map_top_n_values | nan is the biggest regardless of position | select map_top_n_values(map(array[1, 2 ,3, 4], array[1,nan(), 2, 0]),2); _col0 ------------ [NaN, 2.0] select map_top_n_values(map(array[1, 2 ,3, 4], array[nan(), 1, 2, 0]),2); _col0 ------------ [NaN, 2.0] (1 row)select map_top_n_values(map(array[1, 2 ,3, 4], array[1, 2, 2, nan()]),2); _col0 ------------ [NaN, 2.0] (1 row) |
File format min/max (and sorted columns) | We can/do use min and max values supplied by the file format for the hive readers (e.g. partial aggregation pushdown, filter pushdown). The handling of nan in regards to min/max may be different across different file formats and may or may not align to whatever decision we make. Additionally, we may read from files that are not written by our readers/writers and therefore have different behavior. Changes may be required to these readers to make our behaviors fully consistent. I don't think Presto Java currently uses sortedness of files, but I think velox does. In addition to ensuring correctness/consistency for filtering related usages fo sortedness, we would also need to make sure e.g. for sort merge joins that the inputs sort nans in the same way | |
parquet | Presto Parquet writers do not include min/max stats when there are nans. Velox Parquet writers will write min/max stats but exclude NaN values. Other parquet implementations do otherwise, and there are also other proposals for what to do, e.g. see: https://github.com/apache/parquet-cpp/pull/444, https://github.com/apache/arrow-rs/issues/264, https://github.com/apache/parquet-format/pull/221, https://github.com/apache/parquet-format/pull/196 | |
orc/dwrf | The Presto orc/dwrf writer will exclude all statistics if any of the values are NaN. https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/metadata/statistics/DoubleStatisticsBuilder.java#L75-L78. The Velox dwrf writer does the same https://github.com/facebookincubator/velox/blob/main/velox/dwio/dwrf/writer/StatisticsBuilder.h#L270-L273. don't know about other implementations | |
other file formats | didn't check, needs more investigation | |
Tuple domains | TupleDomain.compareTo will use the compareTo method for the type, which for DoubleType is Double.compare(), so NaN is largest | |
TupleDomainFilter (filter pushdown) | tuple domain filters behave the same as <, >, =, and <> operators. They return false except for testing whether NaN is NOT IN the filter | |
Other stats | ||
metastore provided table/partition statistics | We do not currently use metastore stats for correctness. only for estimation, and do not consider them fully reliable, so this is not a concern. If we were to change that assumption, we would need to deal with this case (and how we evaluate them in the code). It is worth considering what the metastore expects when we provide min/max stats |