Can we speed up objectid using faster hashing #52440

Zentrik · 2023-12-07T13:29:14Z

I have an IdDict with SimpleVector keys, I see a lot of time is spent calling jl_object_id_ to hash the keys, specifically int64hash, which is called to hash any 64 bit elements in the key and it's also called in bitmix.

It seems jl_object_id_ is mainly used for hashtables like IdDict and IdSet in which case would it be fine to switch to something like FxHasher (used in the rust compiler) to replace both bitmix and int64hash?
Replacing bitmix in hash_svec with FxHasher gives a 10-100x improvement on @benchmark hash(x) setup=x=Core.svec(rand(Int64, 10^n)...) for n in 1 to 6.

However, FxHasher for 64 bit ints is just a multiplication by a constant, though if objectid is only used in internal hashtables, like IdDict, I think this should be fine. Whilst objectid is used to hash general objects on the Julia side, its passed through a hash so I don't think it should affect hash in Julia too much.

Alternatively, we could try using the non AES intrinsic version of ahash, this should still be much faster than what we have currently and higher quality than FxHasher. Again, ahash says it's specifically for hashtables.

Alternatively, to replace only int64hash, Squirrel3 is quite simple and seems to be reasonably fast (about 1.5-2x faster than int64hash) and should have higher quality hashes than FxHasher.

Does it seem reasonable to make a change along these lines? I see that currently int64hash is invertible, do we care about this? I don't see it being used anywhere.

This would also speed up my robinhood implementation of IdDict.

The text was updated successfully, but these errors were encountered:

vtjnash · 2023-12-07T14:24:12Z

FxHasher sounds reasonable to me to improve performance here. The current hash seems a bit of a local minima, since it is neither a great hash for cryptography (it is reversible, and thus trivially DOS-able) nor particularly fast

Tokazama · 2023-12-07T19:37:45Z

Could we have an additional option where a hashing function can be passed to IdDict so there's some choice between speed and cryptography?

Zentrik · 2023-12-07T23:12:15Z

It should be pretty easy to add that functionality to Dict. Not sure if that's feasible for IdDict.

Tokazama · 2023-12-08T01:05:03Z

It should be pretty easy to add that functionality to Dict. Not sure if that's feasible for IdDict.

Since the hashing function used can also effect the likelihood of collisions, I assume people would want to control things like growth/shrinkage of the hash table and lookup settings like max probe. I've wanted to have more direct control over all of these things at some point but I'm not sure how much that would overcomplicate things here.

Tokazama · 2023-12-11T18:20:52Z

I think the function is as simple as

hash_object_id(id::UInt64) = bitrotate(id * 0x517cc1b727220a95, 5)
hash_object_id(id::UInt32) = bitrotate(id * 0x9e3779b9 , 5)

Zentrik · 2023-12-11T18:30:35Z

There's no bitrotate, on 32/64 bit values FxHasher is just a multiplication by the constants you used. https://nnethercote.github.io/2021/12/08/a-brutally-effective-hash-function-in-rust.html is a nice reference.

Tokazama · 2023-12-11T19:48:46Z

I probably should have read the impl block too. The rotation might be a useful mechanism for some other hash table designs, since it appears that hits the sweet spot for hash speed and avoiding collisions.

stevengj · 2023-12-12T14:10:27Z

A custom hash function for Dict can be implemented already by wrapping the keys in a new key type, as described here.

We could make a nicer API for this by defining some wrapper objects.

Tokazama · 2023-12-12T14:44:52Z

We could make a nicer API for this by defining some wrapper objects.

FWIW, I would love to see this happen.

Zentrik · 2023-12-12T22:23:40Z

I think it would be nicer to allow creating a custom Dict type with a custom hash function, as then you can just replace any Dict in your code with your new type and everything should work. Having to wrap all the keys seems clunky.
Anyways, I was thinking something like this Zentrik@f3d8d0c.

stevengj · 2023-12-12T22:24:05Z

FWIW, I would love to see this happen.

It could easily be done in a package, e.g. CustomKeys.jl

PallHaraldsson · 2023-12-14T14:35:01Z

The current hash seems a bit of a local minima, since it is neither a great hash for cryptography (it is reversible, and thus trivially DOS-able) nor particularly fast

I'm not sure I understand, are you saying current Dict, or Set or hash or whatever is a security risk/DoS attachable when used by the user (as opposed to the compiler)?

I think we should make Julia internally/the compiler as fast as possible, likely meaning using FxHasher. Also, I proposed an ordered Dict by default but my PR didn't work out, but would be great for 1.10. I think it's potentially slower, when growing, but also potentially faster if the Set/Dict's are small, and most often they would be for the compiler?

I think we should revisit and make Dict ordered for usability, use DefaultOrderedDict as the new Dict, and compatibility with Python, they switched to ordered, all current versions do that. If because of the cryptographic issue it means we should switch too to such, also making slower by default, that seems like a good argument: we can't have fastest possible by default., anyway. Then if you need fastest non-ordered or non-cryptographic hash is ok, then you opt into that from a package.

It should be pretty easy to add that functionality to Dict. Not sure if that's feasible for IdDict.
We can't for Dict? Except for some internal-only type?

There's no bitrotate, on 32/64 bit values FxHasher is just a multiplication by the constants you used.

Well, no:

fn add_to_hash(&mut self, i: usize) {
self.hash = self.hash.rotate_left(5).bitxor(i).wrapping_mul(K);
[..]

It is brutally simple. Initialisation sets a single variable to zero. Hashing a value is just a rotate, an xor, and a multiply. Finalisation is a no-op.

(Are you wondering where the constant 0x517cc1b727220a95 comes from? 0xffff_ffff_ffff_ffff / 0x517c_c1b7_2722_0a95 = π.)

In terms of hashing quality, it is mediocre. If you run it through a hash quality tester it will fail a number of the tests. For example, if you hash any sequence of N zeroes, you get zero. And yet, for use in hash tables within the Rust compiler, it’s hard to beat. (Fortunately, the compiler is not an interesting target for HashDoS attacks.)

Tokazama · 2023-12-14T16:25:50Z

I think we should revisit and make Dict ordered for usability, use DefaultOrderedDict as the new Dict...

There's a lot of things we could/should do differently with dictionaries in the future. If we can agree that speed is the objective for dictionaries within Base, then I think this is the way to go (at least for IdDict). Other decisions probably warrant further discussion elsewhere

Tokazama · 2023-12-14T19:39:57Z

I think it would be nicer to allow creating a custom Dict type with a custom hash function, as then you can just replace any Dict in your code with your new type and everything should work. Having to wrap all the keys seems clunky. Anyways, I was thinking something like this Zentrik@f3d8d0c.

That seems like a pretty big change to Base, maybe we could have a HashTables.jl package for experimenting with this in JuliaCollections?

Zentrik mentioned this issue Dec 12, 2023

Switch to FxHasher hash function in int64hash and bitmix #52496

Closed

stevengj added the performance Must go faster label Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we speed up objectid using faster hashing #52440

Can we speed up objectid using faster hashing #52440

Zentrik commented Dec 7, 2023

vtjnash commented Dec 7, 2023

Tokazama commented Dec 7, 2023

Zentrik commented Dec 7, 2023

Tokazama commented Dec 8, 2023

Tokazama commented Dec 11, 2023

Zentrik commented Dec 11, 2023 •

edited

Loading

Tokazama commented Dec 11, 2023

stevengj commented Dec 12, 2023 •

edited

Loading

Tokazama commented Dec 12, 2023

Zentrik commented Dec 12, 2023

stevengj commented Dec 12, 2023

PallHaraldsson commented Dec 14, 2023 •

edited

Loading

Tokazama commented Dec 14, 2023

Tokazama commented Dec 14, 2023

Can we speed up objectid using faster hashing #52440

Can we speed up objectid using faster hashing #52440

Comments

Zentrik commented Dec 7, 2023

vtjnash commented Dec 7, 2023

Tokazama commented Dec 7, 2023

Zentrik commented Dec 7, 2023

Tokazama commented Dec 8, 2023

Tokazama commented Dec 11, 2023

Zentrik commented Dec 11, 2023 • edited Loading

Tokazama commented Dec 11, 2023

stevengj commented Dec 12, 2023 • edited Loading

Tokazama commented Dec 12, 2023

Zentrik commented Dec 12, 2023

stevengj commented Dec 12, 2023

PallHaraldsson commented Dec 14, 2023 • edited Loading

Tokazama commented Dec 14, 2023

Tokazama commented Dec 14, 2023

Zentrik commented Dec 11, 2023 •

edited

Loading

stevengj commented Dec 12, 2023 •

edited

Loading

PallHaraldsson commented Dec 14, 2023 •

edited

Loading