Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataType::Decimal Non-Compliant? #1779

Closed
tustvold opened this issue Jun 2, 2022 · 7 comments
Closed

DataType::Decimal Non-Compliant? #1779

tustvold opened this issue Jun 2, 2022 · 7 comments
Labels
arrow Changes to the arrow crate question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2022

Describe the bug

DataType::Decimal is defined as

/// Exact decimal value with precision and scale
///
/// * precision is the total number of digits
/// * scale is the number of digits past the decimal
///
/// For example the number 123.45 has precision 5 and scale 2.
Decimal(usize, usize),

This appears to be at odds with both the C++ and python implementations (I can't actually find the specification).

These define it as

Arrow decimals are fixed-point decimal numbers encoded as a scaled integer. The precision is the number of significant digits that the decimal type can represent; the scale is the number of digits after the decimal point (note the scale can be negative).

i.e. unscaledValue * 10^(-scale)

In particular with the current rust definition it is unclear how to represent numbers with more than 38 digits, either because of leading or trailing 0s.

To Reproduce

Inspect code

Expected behavior

We should be conforming to the other arrow implementations

Additional context

Noticed whilst reviewing apache/datafusion#2680

The parquet logical type is similarly defined - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2022

To add to the confusion value and value_as_string seem to interpret the data differently, with value_as_string inline with the arrow specification, and value interpreting it as defined on DataType.

#[test]
fn test_decimal_array() {
    // let val_8887: [u8; 16] = [192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    // let val_neg_8887: [u8; 16] = [64, 36, 75, 238, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255];
    let values: [u8; 32] = [
        192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 36, 75, 238, 253,
        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    ];
    let array_data = ArrayData::builder(DataType::Decimal(23, 6))
        .len(2)
        .add_buffer(Buffer::from(&values[..]))
        .build()
        .unwrap();
    let decimal_array = DecimalArray::from(array_data);
    assert_eq!(8_887_000_000, decimal_array.value(0));
    assert_eq!("8887.000000", decimal_array.value_as_string(0));
    assert_eq!(-8_887_000_000, decimal_array.value(1));
    assert_eq!(16, decimal_array.value_length());
}

Decimal support appears to have been added back in apache/arrow#8640, perhaps @nevi-me, @alamb or @jorgecarleitao have some recollection of what is going on here

@alamb
Copy link
Contributor

alamb commented Jun 3, 2022

I don't remember any details -- the discrepancy is likely my ignorance

@jorgecarleitao
Copy link
Member

Sorry, I am probably misreading this - I am not following what is the non-compliance.

The Python documentation exemplifies:

As an example, decimal128(7, 3) can exactly represent the numbers 1234.567 and -1234.567 (encoded internally as the 128-bit integers 1234567 and -1234567, respectively)

(so, 7 digits, 3 of them are decimal)

we offer the example

For example the number 123.45 has precision 5 and scale 2.

The spec is https://github.com/apache/arrow/blob/master/format/Schema.fbs#L185 (copied for readability below)

/// Exact decimal value represented as an integer value in two's
/// complement. Currently only 128-bit (16-byte) and 256-bit (32-byte) integers
/// are used. The representation uses the endianness indicated
/// in the Schema.
table Decimal {
  /// Total number of decimal digits
  precision: int;

  /// Number of digits after the decimal point "."
  scale: int;

  /// Number of bits per value. The only accepted widths are 128 and 256.
  /// We use bitWidth for consistency with Int::bitWidth.
  bitWidth: int = 128;
}

Aren't they all aligned?

In particular with the current rust definition it is unclear how to represent numbers with more than 38 digits, either because of leading or trailing 0s.

The Python docs continue

Decimal128Type has a maximum precision of 38 significant digits

which also seems aligned?

@tustvold
Copy link
Contributor Author

tustvold commented Jun 3, 2022

The spec is

Thank you for the link, that is very unfortunately worded but explains where this confusion has originated from. My interpretation of those comments would be that the represented values would have precision digits, with the decimal point located somewhere in that range. This also was the interpretation initially taken on apache/datafusion#2680, and appears also to at least initially have been the interpretation of the decimal implementation in arrow-rs.

If one takes the python and C++ implementations as correct, what the are actually saying is that the mantissa has precision digits, and scale is the exponent, but then talks about digits past a decimal point, which is somewhat inconsistent....

To highlight the distinction, consider the number 1000 or 0.0001. Do these have 4 digits or 1 digit? If they have 1 digit, how many digits do they have past the decimal point?

Ultimately I think there are at least these concrete issues that should be fixed:

  • The scale field within DataType::Decimal should be signed
  • The documentation should refer to scale as an exponent, and not potentially misleading notions of decimal points and digits
  • DecimalArray::value shouldn't be applying the scale to the value at all
  • We should probably audit the other places that interpret decimal values for consistency

@alamb
Copy link
Contributor

alamb commented Jun 4, 2022

I think "Fixed Width Decimal" is a fairly common concept and is used for some specialized usecases (like currency), and I think the arrow spec basically follow the standard practice. https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Representation

Like @jorgecarleitao , I don't see the problem.

1000 or 0.0001. Do these have 4 digits or 1 digit? If they have 1 digit, how many digits do they have past the decimal point?

I think the better question is "how would these be represented in different Decimal types"

For Decimal(5, 1) they would be represented by i128s with the values 10000 and 00001 respectively

You could also represent 1000 as Decimal(5, 0) with i128 value 10000 or Decimal(4,0) with i128 value 1000.

In particular with the current rust definition it is unclear how to represent numbers with more than 38 digits

I do not think that is possible. The reason being that i128::MAX is 170141183460469231731687303715884105727, which can represent only 38 decimal digits

@alamb
Copy link
Contributor

alamb commented Jun 4, 2022

The scale field within DataType::Decimal should be signed

I agree

The documentation should refer to scale as an exponent, and not potentially misleading notions of decimal points and digits

I am not sure about this -- I think the documentation could be clarified but I think talking about scale/exponent/mantissa for fixed width representation will be very confusing as they are floating point concepts.

DecimalArray::value shouldn't be applying the scale to the value at all

Definitely worth discussing -- I suspect some people would like DecimalArray to convert to f64 and some would like to do the conversion themselves. I can see a need for access to the raw i128 as well as a conversion

We should probably audit the other places that interpret decimal values for consistency

👍

@tustvold
Copy link
Contributor Author

tustvold commented Jun 4, 2022

Apologies this appears to have been my confusion, we should create a separate ticket to track making scale negative and clarifying the docs

@tustvold tustvold closed this as completed Jun 4, 2022
@alamb alamb changed the title DataType::Decimal Non-Compliant DataType::Decimal Non-Compliant Jun 9, 2022
@alamb alamb added arrow Changes to the arrow crate and removed bug labels Jun 9, 2022
@alamb alamb changed the title DataType::Decimal Non-Compliant DataType::Decimal Non-Compliant? Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants