Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Panic when printing debug arrays with invalid timezones #903

Closed
dbr opened this issue Mar 11, 2022 · 9 comments · Fixed by #1013
Closed

Panic when printing debug arrays with invalid timezones #903

dbr opened this issue Mar 11, 2022 · 9 comments · Fixed by #1013
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@dbr
Copy link
Contributor

dbr commented Mar 11, 2022

Bit tricky to untangle into a simple repro case, but hopefully enough info:

If I create a native datetime object via a convoluted path of Python into arrow2, something vaguely along these lines:

import pandas
import datetime

df = pandas.DataFrame.from_dict({'a': [datetime.datetime.now()]})

custom_method_from_pandas_to_pyarrow_to_arrow2(df)

..then I get the following panic:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("timezone \"\" cannot be parsed")', /home/dbr/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.2/src/array/display.rs:68:82

..which is happening on the 0.9 equivelant of this line:

let timezone = temporal_conversions::parse_offset_tz(tz).unwrap();

Oddly, this wasn't happening when we were using 0.8.1.. but the get_display code seems basically identical in that version.. That said I've changed quite a lot of our arrow/Python integration code between then, so I need to investigate further what caused this - regardless, it seems surprising the get_display method would panic like this

@jorgecarleitao
Copy link
Owner

Thanks a lot for reporting it 🙇

Do you have a suggestion on what the intended behavior should be? Some ideas:

  • write something like "ERROR: can't parse timezone "" in each array's item
  • present the value in rfc3339 but present an "ERROR" in the timezone

ideas?

@houqp
Copy link
Collaborator

houqp commented Mar 13, 2022

Based on arrow spec, empty string as timezone should be considered as invalid datatype? It seems reasonable to panic or return a Result::Err when invalid data are encountered. The timestamp datatype validation probably should have happened much earlier in the pipeline, ideally when the timestamp array is being created.

@dbr
Copy link
Contributor Author

dbr commented Mar 15, 2022

Do you have a suggestion on what the intended behavior should be?

Hmm I'm not sure - with the 0.8 API of arrow2::array::get_display(col)(idx) I would expect it to return a Result if it does any non-trivial parsing (as is done with the timezones)

However with the recent changes to use this code in Debug, it's harder to say. Panicing is not ideal at all for us (since the data is coming from user, and we are using the get_display call in a UI)

I think I'd expect the debug output to include as "raw" a version of the data as possible (so not do any parsing of the timezone string at all - so if the timezone is set to an empty string or a 🦝 emoji or whatever, then, just display it)

Based on arrow spec, empty string as timezone should be considered as invalid datatype?

Interesting. I'm not too familiar with the spec, but after a quick skim over the C data API docs (which is where this data is coming from in my case), it just says "The timezone string is appended as-is after the colon character :, without any quotes. If the timezone is empty, the colon : must still be included." - I guess this might be where the empty timezone string is coming from

I still need to trace exactly which code is doing the conversion of the Python datetime objects (it might potentially be a bug in oldish version of pyarrow we are using 🤔 )

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Mar 15, 2022

The challenge with the validation on the array is that although we may not support printing a timezone (e.g. chrono-tz feature is not active), users may still want to use it, e.g. for transparent conversion of parquet to arrow or something, or when they know that the string is invalid but they want to load it from parquet to cast it to another type or specific timezone.

The problem is that some of them can't be represented in debug. We could just make debug not print times and dates, but I also feel that seeing 1647328880 is quite a mental load when compared to 2022-03-15 08:22 CET.

As a quick fix, we could make get_display to return an error and only panic on Debug, until we find a way forward for how to represent in Debug.

@dbr
Copy link
Contributor Author

dbr commented Mar 15, 2022

As a quick fix, we could make get_display to return an error and only panic on Debug, until we find a way forward for how to represent in Debug.

This would work for me; although I wonder if perhaps the debug output could just try to format the timestamp, and if it fails then fallback to the raw 1647328880 output (or even Timestamp 1647328880 with unparable timestamp "" or something)

@jorgecarleitao
Copy link
Owner

Good idea, I like it. Would you like to PR it?

@jorgecarleitao jorgecarleitao added enhancement An improvement to an existing feature bug Something isn't working and removed enhancement An improvement to an existing feature labels Mar 22, 2022
@jorgecarleitao jorgecarleitao changed the title Panic with naive timezones Panic when printing debug arrays with invalid timezones Mar 22, 2022
@dbr
Copy link
Contributor Author

dbr commented May 25, 2022

Sorry for the very belated update! I've made a few quick attempts at making this change, but it is tricky

Changing get_display to return something like arrow2::error::Result<Box dyn Fn ... and using this in the impl Debug looks promising. This way, initial problems like "unparsable timezone" can be turned from panics into descriptive Err(ArrowError::...), while the boxed-fns containign all the write! calls write can return std::fmt::Error as they do now

The problem is it requires changing lots of other functions, e.g

pub fn get_value_display<'a, F: Write + 'a>(
    array: &'a dyn Array,
    null: &'static str,
) -> crate::error::Result<Box<dyn Fn(&mut F, usize) -> std::fmt::Result + 'a>> {

..which is fine, but some of these methods are also used in boxed-fn's which return std::fmt::Result (which then can't propogate the ArrowError)

I suspect with a bit more persistence this approach will "almost" work, but also have a suspicion it might get stuck on the nested types or something. It also starts to make the string-formatting-code pretty hard to follow, and resulted in some questionable looking code like Ok(Ok(())) - so I'm increasingly skeptical it's the right way to go about this.

@dbr
Copy link
Contributor Author

dbr commented May 26, 2022

Looking into why the empty timezone string occurs in the first place, I think it is a bug in the arrow2::ffi::schema::to_data_type method

If there is no timezone info, the format string is something like "tss:"

https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

The timezone string is appended as-is after the colon character :, without any quotes. If the timezone is empty, the colon : must still be included

However the ffi code treats this as timezone named an empty string, instead of mapping it to None

https://github.com/jorgecarleitao/arrow2/blob/v0.11.2/src/ffi/schema.rs#L291-L293

let parts = other.split(':').collect::<Vec<_>>();
if parts.len() == 2 && parts[0] == "tss" {
    DataType::Timestamp(TimeUnit::Second, Some(parts[1].to_string()))

Shall make a PR for this shortly - assuming I'm not mistaken, I think it's still worth removing the panicing pathways from Debug, but fixing the parsing is much simpler for now!

@jorgecarleitao
Copy link
Owner

Hey! Awesome analysis of the problem!

I have a draft PR with a proposal to address this, #1013 . Would you be willing to review it?

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label May 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants