Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Change indexin sentinel to nothing #25662

Merged
merged 4 commits into from
Jan 29, 2018
Merged

Conversation

garrison
Copy link
Member

@garrison garrison commented Jan 20, 2018

This changes indexin's sentinel to nothing, following #25472 (comment). Also includes the improvements in #23845 (which this pull request now supercedes).

At the moment, inference seems to be broken for this, as the first example in the doctest results in Array{Any,1}, hence the WIP title.

@garrison garrison added the search & find The find* family of functions label Jan 20, 2018
@nalimilan
Copy link
Member

nalimilan commented Jan 20, 2018

It's not an inference issue, it's due to #25553. I guess an intermediate solution would be to pre-allocate the resulting array with the correct type.

@nalimilan
Copy link
Member

BTW, regarding the support for any iterators evoked at #23845, you could use the recently-introduced _pairs function (see e.g. at #25655).

@garrison
Copy link
Member Author

This PR (and #23845) generalize indexin so the first argument can be any iterable. If I understand you correctly, you are suggesting using _pairs to also allow the second argument to be any iterable (i.e., bdict = Dict(j => i for (i, j) in _pairs(b))). This could indeed be a possibility. One thing to note is that the phrase "highest index" in the docstring should be changed to "last index" as well, since what really matters is the order the indices are iterated in, not the relative ordering of their values.

@nalimilan
Copy link
Member

If I understand you correctly, you are suggesting using _pairs to also allow the second argument to be any iterable (i.e., bdict = Dict(j => i for (i, j) in _pairs(b))).

Yes, but the other/main advantage of using _pairs instead of eachindex would be that the most "natural" index type would be returned (e.g. cartesian indices for matrices and higher-dimensional arrays instead of linear indices). This would be consistent with what other find* functions now do.

@nalimilan
Copy link
Member

Also, it would make sense to add a third argument indicating the value to return when there is no match. It would default to nothing, but it could be set e.g. to 0, which is sometimes useful. R's equivalent match function supports it (and defaults to NA, which is kind of equivalent to nothing in the present case given that R doesn't have a scalar nothing). That would also make the replacement trivial to write in Compat.

While we're at it, I wonder whether returning the smallest/first index wouldn't be more natural than the highest/last. The only reason to return the last index seems to be that the implementation is somewhat easier (indeed it's really simple), but it shouldn't be hard nor slow to keep the first index if it's already in the dict. Thoughts?

@garrison
Copy link
Member Author

It looks like indexin was originally meant to be a substitute for Matlab's ismember function, which itself returns the lowest index. I agree that it is more logical, the main downside being that it is an additional breaking change.

@yurivish
Copy link
Contributor

Is it conceivable that an indexable container can be indexed by nothing?

@garrison
Copy link
Member Author

Is it conceivable that an indexable container can be indexed by nothing?

I believe this would be addressed by @nalimilan's proposed third argument.

@nalimilan
Copy link
Member

A dict could have nothing as a valid key, but I'm not sure what's the question.

@yurivish
Copy link
Contributor

yurivish commented Jan 21, 2018

Sorry, I should have been clearer but didn't want to come across as presumptuous. 😄 I was trying to understand if there was any reason why using an Optional (Union{Nothing, Some{T}}) wouldn't make sense.

The idea would be to force the user to handle the case where the value isn't present explicitly via unwrapping, rather implicitly by getting an error at the point when a value they get is passed to a method that doesn't accept Nothing as an argument.

Unlike propagation with missing, the software engineer's null typically wants to be handled close to its origin, and forcing the user to unwrap a Some means that they've also explicitly considered what to do in the nothing case.

The idea of a third argument would also work, but it would be nice if all of the methods that have this kind of potential-for-null were handled in similar ways.

@nalimilan
Copy link
Member

OK. The use of Some has been discussed at JuliaLang/Juleps#47, and we've decided against it because it would be too annoying when working with arrays. In general we're quite consistent in using simply Union{T, Nothing} everywhere, and Union{Some{T}, Nothing} only in cases where Nothing<:T.

@yurivish
Copy link
Contributor

The conversation in that issue was enlightening. Thanks for the explanation and the link!

@JeffBezanson
Copy link
Member

Should #23845 be closed?

base/array.jl Outdated

julia> indexin(a,b)
julia> indexin(a, b)
6-element Array{Int64,1}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Element type needs to change.

base/array.jl Outdated
bdict = Dict(zip(b, 1:length(b)))
[get(bdict, i, 0) for i in a]
function indexin(a, b::AbstractArray)
bdict = Dict(zip(b, eachindex(b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use keys(b).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even _pairs, see my comment above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairs returns index=>value pairs; this needs them the other way around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I guess we could define _keys similar to _pairs, but for now using keys should be enough, we can always support non-AbstractVector arguments later.

@JeffBezanson JeffBezanson added this to the 1.0 milestone Jan 23, 2018
@garrison garrison force-pushed the jrg/indexin-sentinel branch from 8d97a94 to ca1423c Compare January 26, 2018 03:01
@garrison
Copy link
Member Author

Also, it would make sense to add a third argument indicating the value to return when there is no match.

Why do this for indexin but not for findfirst, findlast, findnext, and findprev? Perhaps it would make sense to add an optional final argument to all of these in a subsequent pull request?

While we're at it, I wonder whether returning the smallest/first index wouldn't be more natural than the highest/last.

In the interest of keeping this PR non-breaking (and non-controversial), I'm going to keep the current behavior here. But I would have no objection to this in a follow-up pull request.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's go with that and discuss other changes after. I think the first match vs. last match change has the highest priority since it's breaking.

base/array.jl Outdated
[get(bdict, i, 0) for i in a]
function indexin(a, b::AbstractArray)
bdict = Dict(zip(b, keys(b)))
map(i -> get(bdict, i, nothing), a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until #25553 this is going to return an Array{Any}, but I guess it's OK as a temporary situation. We could specify the element type in advance, but then it would always be Union{T, Nothing} even in the absence of nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One benefit of map is that it returns a scalar if passed a scalar argument (see here for previous discussion). So I think it makes sense to keep this as is and rely on #25553.

Also, I would have expected tests to fail due to the (currently) inaccurate types in the doctests, but it seems these doctests are not actually being tested.

We could specify the element type in advance, but then it would always be Union{T, Nothing} even in the absence of nothing.

I (obviously) don't understand inference, but I would have expected/hoped that the return type would always be Union{T, Nothing}. (In particular, I changed the type of the second doctest as well so this would be the case, despite lack of nothing in the output.) If this is not expected, then the return type of this function will not be inferred, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference isn't involved here, we just use typejoin to widen the element type as new elements are encountered. So no, the element type will always reflect what the array actually contains. This decision has been made so that inference can change without affecting user-visible behavior. We could force always returning a Union{T, Nothing} array, but I'm not sure that's a good idea given that map behaves differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this may be an appropriate place to type-hint it as Union{valuetype(bdict), Nothing}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, just adding a type assertion? Wouldn't inference be able to figure that anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though @StefanKarpinski seemed to have a different opinion. Do you really want that behavior? At first returning a scalar made sense to me too, but thinking a bit more about the purpose of this function it's not so clear.

Anyway it would be easy to add a special method for Number if we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this more, I'm honestly not sure why Number is treated differently than other data types for map, collect, etc. First, there is (at least in my opinion) an impedance mismatch between map and collect:

julia> map(i->i, 3)
3

julia> collect(3)
0-dimensional Array{Int64,0}:
3

Additionally, I find this funny because either of these operations on a Char would result in a 1-d Array.

Also pinging @mbauman, as he had the opinion as well that it should return a scalar. As for me, I'm no longer convinced that it should (or is worth worrying about).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Char is not HasShape even though ndims is defined for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for me, I'm no longer convinced that it should (or is worth worrying about).

After re-reading that issue and this one, I think I would agree. It stinks that we've never managed to remove iteration from numbers, but that's the real problem here. We've papered over that in piecemeal cases as we've noticed it. So my 👍 was just to continue that "papering over," but I'm becoming less convinced that's the direction we should go. And of course, this ties into the big "what's a scalar" broadcast issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better discuss this in a separate issue? I'd rather do whatever sounds reasonable here and see how we can find a consistent general pattern later.

@garrison garrison changed the title WIP: Change indexin sentinel to nothing RFC: Change indexin sentinel to nothing Jan 26, 2018
@garrison garrison force-pushed the jrg/indexin-sentinel branch from ca1423c to 412dffa Compare January 28, 2018 22:04
@garrison
Copy link
Member Author

Sure, let's further discuss the above outdated-diff thread in a separate issue.

@JeffBezanson JeffBezanson merged commit da3b862 into master Jan 29, 2018
@JeffBezanson JeffBezanson deleted the jrg/indexin-sentinel branch January 29, 2018 01:50
@nalimilan
Copy link
Member

I've filed #25998 to return the first rather than the last matching index.

@andreasnoack andreasnoack added the needs news A NEWS entry is required for this change label Feb 15, 2018
@andreasnoack
Copy link
Member

We need a NEWS entry for this silently breaking change

@ararslan
Copy link
Member

#26067 for news

@KristofferC KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search & find The find* family of functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants