Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/RFC: Refactored dict.jl #7348

Closed
wants to merge 5 commits into from
Closed

Conversation

mauro3
Copy link
Contributor

@mauro3 mauro3 commented Jun 21, 2014

This is a refactor of dict.jl which also takes into account
DataStructures.jl/OrderedDict. The reason I did this is because I
needed a WeakObjectIdDict for PR #5572 (which I did there with an ugly
copy-paste). PR #5572 is for issue #3988 which should lay the
foundation for user-defined help.

Dictionaries have quite extensive internal mechanics and small changes
in those internals make new types of dicts. For instance, whether to
take a straight hash or use an object-id in hashindex makes a Dict
vs ObjectIdDict. This means that when making new dictionaries by
simply wrapping the standard Dict and defining new setindex!,
getindex, etc. one has to replicate lots of those inner mechanics
because dispatch cannot do its magic on inner functions. This PR is a
go at writing dict.jl in a more composable manner.

This refactor can accommodate quite a few different hash-table based
dicts. Implemented are Dict, ObjectIdDict2 (a new variation),
WeakKeyDict (now works for immutables too), WeakObjectIdDict (new),
and OrderedDict (essentially as in DataStructures.jl package).

Performance is as before except for ObjectIdDict2 which is 25% slower
than ObjectIdDict (but has the complete interface). This is because
ObjectIdDict is implemented differently to Dict whereas ObjectIdDict2
is within the same framework.

I'm not sure the style is in line with standard practices. The
concrete subtypes of HashDictionary and their constructors are made
with a macro. This is reminiscent of the approach of @kmsquire in
PR #2548 which was not liked by @JeffBezanson...

If this is not the way to cleanly implement a WeakObjectIdDict (and it
turns out its needed to solve #3988), then any comments on how to
implement it instead are welcome!

What I did:

  • Introduced a new abstract type HashDictionary which is the
    supertype of all implemented dicts at the moment and ought to be
    the supertype of all hash-table based dicts.
  • I moved some of the Dict specific methods to Associative.
  • I introduced key and value getters and setters (gkey, skey!,
    ...). These allow, for instance, keys to be transformed to
    WeakRef's and back.
  • hashindex now is dispatched on the type so, for instance,
    ObjectIdDict can change it to use object_id instead.
  • isequal is also dispatched on with the method isequalkey.
  • I added a purging of WeakRef, even for immutables, if their object
    get gc-ed with the function topurge.
  • the concrete types are then constructed with @makeHashDictionary
    and some of above internals are specialized.

Then to produce the standard Dict:

@makeHashDictionary(Dict, K, K, V, V, Unordered)

To make an object-id dict only a few extras are needed:

@makeHashDictionary(ObjectIdDict2, K, K, V, V, Unordered)
hashindex(::ObjectIdDict2, key, sz) = (int(object_id(key)) & (sz-1)) + 1
function keyconvert{K,V}(h::OIdDicts{K,V}, key0) # no conversion as that can create a new object.
    !isa(key0, K) ? error(key0, " is not a valid Object-Id-Dict key for type ", K) : key0
end
isequal(::OIdDicts, key1, key2) = key1===key2

Note 95% of this is a refactor of good code others wrote, thanks to all those others!

TODO:

  • check code-coverage and add tests
  • add performance tests
  • think about constructors
  • see whether ObjectIdDict2 can be made compatible with ObjectIdDict

TODO irrespective of this PR:

  • currently there is an inconsistency between key conversion in get
    and getindex. (some do an explicit conversion others don't).
    (see below for discussion)
  • at the moment the top-level abstract type is called Associative.
    Maybe Dictionary would be more consistent?
  • equality: == when keys and values are equal or also when type of
    dict is equal? ObjectIdDict seems to get special treatment at the moment.
  • could the standard Dict be made as performant as ObjectIdDict,
    i.e. 25% faster? (almost there now)

(edited 24 June)

@StefanKarpinski
Copy link
Member

Very cool!

@kmsquire
Copy link
Member

Hi Mauro, thanks for taking this on! I'll add a few comments inline.

const ISEMPTY = 0x0
const ISFILLED = 0x1
const ISMISSING = 0x2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider just using EMPTY, FILLED, MISSING. Most other uses of is* in Julia are predicates.

(To prevent leakage of common names, you could put this file in its own module.)

@kmsquire
Copy link
Member

This looks quite nice, @mauro3! I don't have any substantive comments.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 21, 2014

Thanks Stefan and Kevin for the encouraging comments. I changed those ISEMPTY. (I also did some minor edits in my description above).

end
end

function deserialize{K,V}(s, T::Type{Associative{K,V}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work, due to invariance. If T is Dict{K,V}, this method does not match.

@JeffBezanson
Copy link
Member

How is performance compared to the current Dict? We have a bit of a performance regression in Dicts in 0.3, which I am very concerned about.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 21, 2014

Performance of Dict seems to be the same as before, maybe even a bit better. The results so far are from tests based on an adaption of the current test/dict.jl and tests Kevin put together a wee while ago. Here the script: https://gist.github.com/mauro3/00ff8016bf909de3f4d1

(I'll put proper tests together which can go into test/perf next week and report.)

Dict and ObjectIdDict are the originals, TOREP* are the new ones. First line is timing the test/dict.jl-adaption, second line are Kevin's tests:

45.7 Dict{K,V}
499.8 Dict{K,V}

26.4 ObjectIdDict
334.0 ObjectIdDict

38.5 TOREPDict{K,V}
500.2 TOREPDict{K,V}

35.2 TOREPObjectIdDict{K,V}
435.3 TOREPObjectIdDict{K,V}

The new Dict looks fine, but we should definitely keep the old ObjectIdDict.

@JeffBezanson
Copy link
Member

The explicit conversion question is interesting. If it's the case that the input and output of convert are always equal, then of course it's not necessary, but my impression is that requiring that would be too strict. It should be enough to convert and check only on assignment, which ensures that doing a lookup with the same object as key will work.

@JeffBezanson
Copy link
Member

Since get! might assign new keys, the easiest thing for it to do is convert first. It could look up the key without conversion first to see if it's found, but if the key's not found it would have to convert and hash it again.

@JeffBezanson
Copy link
Member

What would you attribute the slight performance increase to? (from Dict to TOREPDict)
If anything the new version seems to do slightly more operations. Or is the performance difference not statistically significant?

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 23, 2014

I put some performance test together in test/perf/dict/perf.jl and ran them for this PR, upstream and 0.2.1.: https://gist.github.com/mauro3/daddb89c29b7f5e5cd59

Performance of this PR vs upstream is the same within error. I think that is because the extra function in this PR are no-opts for Dicts and thus are probably optimized away.

Somewhat puzzling/interesting is the performance difference between ObjectIdDict2 vs Dict (ObjectIdDict2 is identical to Dict except for the hashindex function): the "_unitt" test runs 30% faster, the "_del" deletion test 15% faster.

OrderedDict is as fast as Dict. Weak-dicts are slower in insertion and iteration. I'll see what this is about.

Upstream vs 0.2.1: about the same except for the _unitt and iteration test, where there is about a 20% performance regression.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 23, 2014

@JeffBezanson: yes, you're right about key conversion because isequal implies the same hash, which I was not aware of: http://docs.julialang.org/en/latest/stdlib/base/?highlight=concrete#Base.hash.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 24, 2014

Some random info about the not-key-conversion in the getters in upstream Dict:

julia> d = Dict{Symbol, Int}()
Dict{Symbol,Int64} with 0 entries

julia> get(d, 84, 8)
8

julia> get!(d, 84, 8)
ERROR: no method convert(Type{Symbol}, Int64)
 in get! at dict.jl:564

julia> get(d, :t, :a)
:a

julia> get!(d, :t, :a)
ERROR: no method convert(Type{Int64}, Symbol)
 in get! at dict.jl:573

julia> d[6]
ERROR: key not found: 6
 in getindex at dict.jl:615

So, the get does not throw any errors when the key or default is of the wrong type, whereas get! does. Also getindex thows a key not found error and not a ERROR: no method convert(Type{Int64}, Symbol).

I guess this is fine either way. In fact, base relies on the current behavior and it took me ages to figure out why adding key-conversion to the getters leads to a build error.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 24, 2014

I ran some more perfomance tests but now also using Ints as key as well as ASCIIStrings. They are much faster, making Dict faster than ObjectIdDict. Profiling shows that the strings' isequal is the culprit. Here this results:
https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-dict-perf-str-int

Note iteration: Dict{Int} is 10x faster than Dict{ASCIIString}, which is again 2x faster than ObjectIdDict.

Based on these tests I made the isequal function replacable too. Now my ObjectIdDict2 uses === for comparison (which is in fact also correct!). Now performance of my ObjectIdDict2{ASCIIString,Int} is almost on par with ObjectIdDict:
https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-obid-dicts

@JeffBezanson
Copy link
Member

Iterating over an ObjectIdDict is way too slow. I can fix that.

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 25, 2014

Here the new perfomance test for @JeffBezanson's updated ObjectIdDict: https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-jjeffs-new-objectiddict

Much faster now. Looping over (ASSCIString, Int) dict is about as fast as the normal Dict, over (Int, Int) still about 10x slower. Which is, I think, because type inference does not work so well for the ObjectIdDict as it is not parameterized on the types of the keys and values.

@mauro3
Copy link
Contributor Author

mauro3 commented Sep 30, 2014

Any interest in this? Should I rebase it?

@JeffBezanson
Copy link
Member

We can probably get rid of the macro now that it's possible to define constructors for types with any number of unspecified parameters. We can have Dict{K,V,Hash,Weak,Order} and hide the last 3 parameters.

Needs to be updated for the new style of Dict constructors/literals.

This also looks like a good opportunity to implement #9028 (comment) . Making a closure and/or finalizer for each element of a weak Dict can't be good for performance.

@mauro3
Copy link
Contributor Author

mauro3 commented Dec 8, 2014

Cool, I'll have a look at it. Probably will be after Christmas though.

@mauro3
Copy link
Contributor Author

mauro3 commented Jan 8, 2015

I had a look at issue #8712 to figure out what "now that it's possible to define constructors for types with any number of unspecified parameters" means but could not figure it out. Waiting for documentation: #9680

@hayd
Copy link
Member

hayd commented May 27, 2015

How does this sit with/after #10116? Presumably this'll look different after that...

Perhaps some of these perf tests are useful for #10116?

@mauro3
Copy link
Contributor Author

mauro3 commented Jun 28, 2015

Closing. See #10116. Performance test I ran over the ordered dict: #10116 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants