Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy #24

Merged
merged 51 commits into from
Mar 10, 2019
Merged

Entropy #24

merged 51 commits into from
Mar 10, 2019

Conversation

LukeMathWalker
Copy link
Member

This PR adds a new extension trait for Information Theory quantities - so far, entropy, cross entropy and Kullback-Leibler divergence.

Open question: returning an Option for cross entropy and KL divergence conflates two different "failure" modes - None when the arrays are empty, None when we have a dimension mismatch.
Would it be better to return a Result<Option<A>, E>, with a custom DimensionMismatch error type?

@LukeMathWalker LukeMathWalker mentioned this pull request Jan 27, 2019
17 tasks
Copy link
Member

@jturner314 jturner314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like useful functionality. I've added some questions and comments.

One other thought – in general, I like matching the behavior of NumPy/SciPy unless we have a specific reason to do something different, because doing so makes it easier to port code from Python to Rust.

src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Show resolved Hide resolved
src/entropy.rs Show resolved Hide resolved
src/entropy.rs Outdated
/// i=1
/// ```
///
/// If the arrays are empty or their lengths are not equal, `None` is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not good to combine these two cases. A few options:

  • Result<Option<A>, DimensionMismatch> would be fine.

  • I think Result<A, ShapeError> is a bit cleaner (where ShapeError is an enum with Empty and ShapeMismatch variants).

  • Another approach is Option<A>, returning None in the empty case and panicking if q cannot be broadcast to the shape of p. I generally avoid panicking, but I think this is the best option in this case because:

    • It's more consistent with ndarray's methods that operate on two arrays.
    • If the caller passes in arrays of mismatched shapes, it's clearly a bug in their code.
    • If the arrays have mismatched shapes, it's hard to see how the caller would recover from this error case.

    The question then is why we should return None in the empty case instead of panicking. I think this makes sense because it's much more likely than a shape mismatch, doesn't really indicate a bug in the caller, and is more likely to be recoverable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather go for Result<A, ShapeError> - in the end, the caller can choose how to recover from errors (free to call unwrap or expect).
I can foresee various scenarios for both failure modes:

  • empty because you filtered some values out of another array/vec, but now nothing is left;
  • shape mismatch, because you read two different samples from files on disk which you expected to have the same number of values but... they don't (real life stories here 😅).

Copy link
Member Author

@LukeMathWalker LukeMathWalker Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide a plausible use case: it might be impossible to recover (let the program resume its expected execution flow) but there might be some actions one might want to perform before panicking for example (logging is the first that comes to my mind or sending a request somewhere in a web application context).
One can always catch the panic using stuff like this but it's less pleasant and not clear at a first glance looking at the function signature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. We have to return an Option/Result anyway, so we might as well handle all the error cases without panicking. Along the same lines, should we return an error on negative values instead of panicking? I'd expect negative values to be a bigger issue than shape mismatches, and we're already checking if values are zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that one I am torn, because the cost of checking for negative values scales with the number of elements given that it's an additional check. Do you think it's worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty. Consistency is probably worth the double wrap (unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize earlier that the ln of a negative number is NaN (except for the noisy float types, which panic). That behavior is fine with me.

Fwiw, I did a quick benchmark of the cost of adding a check for negative numbers, and it's pretty negligible (<2%) (the horizontal axis is arr.len(), and vertical axis is time to compute arr.entropy(); 'entropy2' is the checked implementation):

lines

    fn entropy2(&self) -> Option<A>
        where
            A: Float
    {
        if self.len() == 0 {
            None
        } else {
            let mut negative = false;
            let entropy = self.mapv(
                |x| {
                    if x < A::zero() {
                        negative = true;
                        A::zero()
                    } else if x == A::zero() {
                        A::zero()
                    } else {
                        x * x.ln()
                    }
                }
            ).sum();
            if negative {
                None
            } else {
                Some(-entropy)
            }
        }
    }

(Note that I wouldn't actually return an Option in this case; I'd return a Result instead.)

Of course, if we check for negative numbers, someone might wonder why we don't check for numbers greater than 1 too.

I could go either way on the negative number checks. The explicit check is nice, but it adds complexity and we aren't checking other things such as values greater than 1 or the sum of values not being 1, so I guess I'd lean towards leaving off negative number checks for right now.

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty.

Okay, that's fine.

unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API

That wouldn't be too bad, although I don't really like returning an error enum when only one of the variants is possible.

src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
@LukeMathWalker
Copy link
Member Author

I have addressed all issues I think - it should be ready for merging @jturner314

Copy link
Member

@jturner314 jturner314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few minor comments and one more important one (using Error instead of Fail). Otherwise, everything looks good.

src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Outdated Show resolved Hide resolved
src/entropy.rs Show resolved Hide resolved
src/errors.rs Outdated Show resolved Hide resolved
@LukeMathWalker
Copy link
Member Author

All fixed, I'll squash and merge 👍

@LukeMathWalker LukeMathWalker merged commit 8e0e1cf into rust-ndarray:master Mar 10, 2019
@LukeMathWalker LukeMathWalker deleted the entropy branch March 10, 2019 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants