-
Notifications
You must be signed in to change notification settings - Fork 224
Panic when printing debug arrays with invalid timezones #903
Comments
Thanks a lot for reporting it 🙇 Do you have a suggestion on what the intended behavior should be? Some ideas:
ideas? |
Based on arrow spec, empty string as timezone should be considered as invalid datatype? It seems reasonable to panic or return a Result::Err when invalid data are encountered. The timestamp datatype validation probably should have happened much earlier in the pipeline, ideally when the timestamp array is being created. |
Hmm I'm not sure - with the 0.8 API of However with the recent changes to use this code in I think I'd expect the debug output to include as "raw" a version of the data as possible (so not do any parsing of the timezone string at all - so if the timezone is set to an empty string or a 🦝 emoji or whatever, then, just display it)
Interesting. I'm not too familiar with the spec, but after a quick skim over the C data API docs (which is where this data is coming from in my case), it just says "The timezone string is appended as-is after the colon character :, without any quotes. If the timezone is empty, the colon : must still be included." - I guess this might be where the empty timezone string is coming from I still need to trace exactly which code is doing the conversion of the Python datetime objects (it might potentially be a bug in oldish version of pyarrow we are using 🤔 ) |
The challenge with the validation on the array is that although we may not support printing a timezone (e.g. The problem is that some of them can't be represented in debug. We could just make debug not print times and dates, but I also feel that seeing As a quick fix, we could make |
This would work for me; although I wonder if perhaps the debug output could just try to format the timestamp, and if it fails then fallback to the raw |
Good idea, I like it. Would you like to PR it? |
Sorry for the very belated update! I've made a few quick attempts at making this change, but it is tricky Changing The problem is it requires changing lots of other functions, e.g
..which is fine, but some of these methods are also used in boxed-fn's which return I suspect with a bit more persistence this approach will "almost" work, but also have a suspicion it might get stuck on the nested types or something. It also starts to make the string-formatting-code pretty hard to follow, and resulted in some questionable looking code like |
Looking into why the empty timezone string occurs in the first place, I think it is a bug in the If there is no timezone info, the format string is something like https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
However the ffi code treats this as timezone named an empty string, instead of mapping it to https://github.com/jorgecarleitao/arrow2/blob/v0.11.2/src/ffi/schema.rs#L291-L293 let parts = other.split(':').collect::<Vec<_>>();
if parts.len() == 2 && parts[0] == "tss" {
DataType::Timestamp(TimeUnit::Second, Some(parts[1].to_string())) Shall make a PR for this shortly - assuming I'm not mistaken, I think it's still worth removing the panicing pathways from Debug, but fixing the parsing is much simpler for now! |
Hey! Awesome analysis of the problem! I have a draft PR with a proposal to address this, #1013 . Would you be willing to review it? |
Bit tricky to untangle into a simple repro case, but hopefully enough info:
If I create a native datetime object via a convoluted path of Python into arrow2, something vaguely along these lines:
..then I get the following panic:
..which is happening on the 0.9 equivelant of this line:
arrow2/src/array/primitive/fmt.rs
Line 69 in 4b893b7
Oddly, this wasn't happening when we were using 0.8.1.. but the
get_display
code seems basically identical in that version.. That said I've changed quite a lot of our arrow/Python integration code between then, so I need to investigate further what caused this - regardless, it seems surprising theget_display
method would panic like thisThe text was updated successfully, but these errors were encountered: