-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy #24
Entropy #24
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like useful functionality. I've added some questions and comments.
One other thought – in general, I like matching the behavior of NumPy/SciPy unless we have a specific reason to do something different, because doing so makes it easier to port code from Python to Rust.
src/entropy.rs
Outdated
/// i=1 | ||
/// ``` | ||
/// | ||
/// If the arrays are empty or their lengths are not equal, `None` is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not good to combine these two cases. A few options:
-
Result<Option<A>, DimensionMismatch>
would be fine. -
I think
Result<A, ShapeError>
is a bit cleaner (whereShapeError
is an enum withEmpty
andShapeMismatch
variants). -
Another approach is
Option<A>
, returningNone
in the empty case and panicking ifq
cannot be broadcast to the shape ofp
. I generally avoid panicking, but I think this is the best option in this case because:- It's more consistent with
ndarray
's methods that operate on two arrays. - If the caller passes in arrays of mismatched shapes, it's clearly a bug in their code.
- If the arrays have mismatched shapes, it's hard to see how the caller would recover from this error case.
The question then is why we should return
None
in the empty case instead of panicking. I think this makes sense because it's much more likely than a shape mismatch, doesn't really indicate a bug in the caller, and is more likely to be recoverable. - It's more consistent with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather go for Result<A, ShapeError>
- in the end, the caller can choose how to recover from errors (free to call unwrap
or expect
).
I can foresee various scenarios for both failure modes:
- empty because you filtered some values out of another array/vec, but now nothing is left;
- shape mismatch, because you read two different samples from files on disk which you expected to have the same number of values but... they don't (real life stories here 😅).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide a plausible use case: it might be impossible to recover (let the program resume its expected execution flow) but there might be some actions one might want to perform before panicking for example (logging is the first that comes to my mind or sending a request somewhere in a web application context).
One can always catch the panic using stuff like this but it's less pleasant and not clear at a first glance looking at the function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. We have to return an Option
/Result
anyway, so we might as well handle all the error cases without panicking. Along the same lines, should we return an error on negative values instead of panicking? I'd expect negative values to be a bigger issue than shape mismatches, and we're already checking if values are zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that one I am torn, because the cost of checking for negative values scales with the number of elements given that it's an additional check. Do you think it's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking over it again, Result<Option<A>, DimensionMismatch>
is probably the best signature, given that most of our methods return an Option<A>
with None
when arrays are empty. Consistency is probably worth the double wrap (unless we want to use DimensionMismatch::Empty
instead of Option
in the rest of the API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize earlier that the ln
of a negative number is NaN
(except for the noisy float types, which panic). That behavior is fine with me.
Fwiw, I did a quick benchmark of the cost of adding a check for negative numbers, and it's pretty negligible (<2%) (the horizontal axis is arr.len()
, and vertical axis is time to compute arr.entropy()
; 'entropy2' is the checked implementation):
fn entropy2(&self) -> Option<A>
where
A: Float
{
if self.len() == 0 {
None
} else {
let mut negative = false;
let entropy = self.mapv(
|x| {
if x < A::zero() {
negative = true;
A::zero()
} else if x == A::zero() {
A::zero()
} else {
x * x.ln()
}
}
).sum();
if negative {
None
} else {
Some(-entropy)
}
}
}
(Note that I wouldn't actually return an Option
in this case; I'd return a Result
instead.)
Of course, if we check for negative numbers, someone might wonder why we don't check for numbers greater than 1 too.
I could go either way on the negative number checks. The explicit check is nice, but it adds complexity and we aren't checking other things such as values greater than 1 or the sum of values not being 1, so I guess I'd lean towards leaving off negative number checks for right now.
Thinking over it again,
Result<Option<A>, DimensionMismatch>
is probably the best signature, given that most of our methods return anOption<A>
withNone
when arrays are empty.
Okay, that's fine.
unless we want to use
DimensionMismatch::Empty
instead ofOption
in the rest of the API
That wouldn't be too bad, although I don't really like returning an error enum when only one of the variants is possible.
Co-Authored-By: LukeMathWalker <[email protected]>
I have addressed all issues I think - it should be ready for merging @jturner314 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few minor comments and one more important one (using Error
instead of Fail
). Otherwise, everything looks good.
Co-Authored-By: LukeMathWalker <[email protected]>
Co-Authored-By: LukeMathWalker <[email protected]>
Co-Authored-By: LukeMathWalker <[email protected]>
All fixed, I'll squash and merge 👍 |
This PR adds a new extension trait for Information Theory quantities - so far, entropy, cross entropy and Kullback-Leibler divergence.
Open question: returning an
Option
for cross entropy and KL divergence conflates two different "failure" modes -None
when the arrays are empty,None
when we have a dimension mismatch.Would it be better to return a
Result<Option<A>, E>
, with a customDimensionMismatch
error type?