-
Notifications
You must be signed in to change notification settings - Fork 224
Read Decimal from Parquet File #444
Comments
Currently, here is one of the make-shift code I made to deal with reading decimal, seem to work in my case, with a lot of non-nested decimal-having parquet data file: use io::parquet::read::{page_iter_to_array, ParquetType, PhysicalType};
...
match metadata.descriptor().type_() {
ParquetType::PrimitiveType { physical_type, .. } => {
match physical_type {
PhysicalType::Int32 => page_iter_to_array(
&mut pages,
metadata,
DataType::Int32,
)
.map(|ar| {
Box::new(
PrimitiveArray::<i128>::from_trusted_len_iter(
ar.as_any()
.downcast_ref::<PrimitiveArray<i32>>()
.unwrap()
.iter()
.map(|e| e.map(|e| *e as i128)),
)
.to(data_type),
)
as Box<dyn Array>
}),
PhysicalType::Int64 => page_iter_to_array(
&mut pages,
metadata,
DataType::Int64,
)
.map(|ar| {
Box::new(
PrimitiveArray::<i128>::from_trusted_len_iter(
ar.as_any()
.downcast_ref::<PrimitiveArray<i64>>()
.unwrap()
.iter()
.map(|e| e.map(|e| *e as i128)),
)
.to(data_type),
)
as Box<dyn Array>
}),
PhysicalType::FixedLenByteArray(n) => {
page_iter_to_array(
&mut pages,
metadata,
DataType::FixedSizeBinary(*n),
)
.map(
|ar| {
let v = ar
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap()
.iter()
.map(|e| {
e.and_then(|e| {
match e
.into_iter()
.rev()
.map(|e| *e)
.pad_using(16, |_| 0u8)
.rev()
.collect_vec()
.try_into()
{
Ok(v) => {
Some(i128::from_be_bytes(v))
}
Err(_) => None,
}
})
}).collect_vec();
Box::new(
PrimitiveArray::<i128>::from_trusted_len_iter(v.into_iter()).to(data_type)
) as Box<dyn Array>
},
)
}
_ => unreachable!(),
}
}
_ => unreachable!(),
}, |
That snipped seems correct! I would add one roundtrip test on Let me know if you would like to work on this or if you would like me to take it =) |
Hello, sorry for the late reply, been a busy week. I would take this one, including the tests.
|
Hey @potter420 , no worries. Exactly, so that we demonstrate interoperability with the 3 physical types. We have a script here where we generate parquet files as part of the tests. The data placed there is then replicated around here, which are compared in tests such as this one. So, something like
Note that we also test statistics, so those are also needed. |
Ah, for testing out locally, I usually create a venv as we do it in the CI here |
Hmm, my test failed miserably. According to
Seem like we have to implement FIXED_LEN_BYTE_ARRAY statistics as well, or perhaps I can set statistics to null for the moments? |
Hi, upon further investigation, it's seem to me that the So if decimal are not the only type that comes from |
Well, that is a great summayr: there is indeed a missing descriptor on the statistics to differentiate between logical types. I totally forgot about using FixedLen in parquet for multiple logical types :( |
Hello, I've been looking at statistics, should we implement separate statistics struct for decimals? pub struct DecimalStatistics {
pub null_count: Option<i64>,
pub distinct_count: Option<i64>,
pub min_value: Option<i128>,
pub max_value: Option<i128>,
pub data_type: DataType
} As I can see from the |
Imo the design of the stats in This way, I.e. the route to deserialize should be something like:
The first two items are done by the |
Thanks, I will implement something like And put Is that correct and help you with the rewrite laters? |
Seem to me the scope getting bigger and bigger, and I'm not complaining 😄. Is there any where else I have to watch out for as well? Thanks |
That is awesome! I agree that the deserialization is equivalent (it also works for We do not support decimal256 yet, so we will have to panic, truncate or simply use Let me know if you would like me to implement something and I will follow your lead. :) |
I go ahead and create a pull request to add writing reduced As for Err(ArrowError::NotYetImplemented(format!(
"Can't decode Decimal128 type from Fixed Size Byte Array of len {:?}",
n
))) I'll just put everything in And the deserialization bit goes like this, since I don't want to introduce let paddings = (0..(16-*n)).map(|_| 0u8).collect::<Vec<_>>();
fixed_size_binary::iter_to_array(iter, DataType::FixedSizeBinary(*n), metadata)
.map(|e|{
let a = e.into_iter().map(|v|
v.and_then(|v1| {
[&paddings, v1].concat().try_into().map(
|pad16| i128::from_be_bytes(pad16)
).ok()
}
)
).collect::<Vec<_>>();
Box::new(PrimitiveArray::<i128>::from(a).to(data_type)) as Box<dyn Array>
}
) The |
Resolved at #489 |
Hi,
So far we have the capability to write decimal to parquet. I wonder if we can implement reading decimal value from parquet file as well.
Thank you very much.
The text was updated successfully, but these errors were encountered: