Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of the PRF (in the draft) and hash (in the original paper) diverges in non trivial ways #26

Open
cipriancraciun opened this issue Nov 16, 2024 · 3 comments

Comments

@cipriancraciun
Copy link

If one looks at the original paper, and how the hash function (PRF in our case) is used:

  • buf[0] = hash(cnt++, passwd, salt)
  • buf[m] = hash(cnt++, buf[m-1])
  • buf[m] = hash(cnt++, prev, buf[m])
  • int other = to_int(hash(cnt++, salt, idx_block)) mod s_cost (where block_t idx_block = ints_to_block(t, m, i))
  • buf[m] = hash(cnt++, buf[m], buf[other])

One (with the exception of the first usage, and perhaps partially the second one), could conclude that the signature of hash is similar to block_t hash (counter:usize, block_1:block_t, block_2:block_t), i.e. it always takes a counter and two blocks, and compresses them into a single block.

In fact, the paper states:

Since H maps blocks of 2k bits down to blocks of k bits, we sometimes refer to H as a cryptographic compression function.

(And personally, I find the simplicity of the initial Balloon algorithm, and the simplicity of its building blocks, namely the single one hash, quite appealing.)


However, the draft has the following usages of PRF:

  • key = PRF(key, password || salt || personalization || associatedData || LE32(pepper.Length) || LE32(password.Length) || LE32(salt.Length) || LE32(personalization.Length) || LE32(associatedData.Length))
  • previous = PRF(key, previous || UTF8("bkdf") || LE32(counter++))
  • pseudorandom = pseudorandom || PRF(emptyKey, LE32(VERSION) || personalization || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))
  • buffer[0] = PRF(key, LE32(VERSION) || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))
  • buffer[m] = PRF(key, buffer[m - 1] || LE64(counter++))
  • buffer[m] = PRF(key, previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3] || LE64(counter++))

Namely:

  • it introduces the concept of key, which doesn't exist in the initial algorithm;
  • sometimes there is a counter, sometimes there isn't one;
  • sometimes the VERSION is included, sometimes not;
  • sometimes the order of included parameters is one way, sometimes it is not;

Doesn't it seem that the draft drifts quite a bit from the original paper?

I'm not saying it's wrong to have a different take, but I do see two problems:

  • the major one being that the paper gives some proofs for what was specified in the paper, meanwhile the draft makes some changes that I'm not sure they are equivalent;
  • the way PRF is used (with a lot of inline canonicalization) is error prone to this types of problems;
  • the way the inputs of the PRF are computed, might have performance impacts (as compared to the initial version, more in a different paragraph);

Many hash functions, especially the "modern" ones (I don't know about Blake, but for example Xxh3, which although is not a cryptographic hash function) might have optimized assembly implementations when they work on fixed blocks.

Thus, from this point of view, perhaps the original paper might yield a performance boost (because it uses fixed inputs) that the current draft PRF usage that has to concatenate a lot of data.

@samuel-lucas6
Copy link
Owner

(And personally, I find the simplicity of the initial Balloon algorithm, and the simplicity of its building blocks, namely the single one hash, quite appealing.)

There are some benefits to using an unkeyed algorithm. The reason this was changed was purely to allow HMAC for NIST compliance since that's closer to PBKDF2. By supporting a keyed algorithm, you can also support unkeyed algorithms, whereas this isn't the case the other way around. The trouble is trying to support all these different algorithms is messy because of the design differences.

sometimes there is a counter, sometimes there isn't one;

There's a counter in most places. There's no need for a counter when deriving the PRF key because it's a single output and basically HKDF-Extract. You could get rid of the counter for the final key derivation as well, but I think it's better to keep one, aligning with HKDF-Expand.

sometimes the VERSION is included, sometimes not;

Where else do you think this should be included? When deriving the PRF key or for the output?

sometimes the order of included parameters is one way, sometimes it is not;

I'm not sure what you mean by this.

the major one being that the paper gives some proofs for what was specified in the paper, meanwhile the draft makes some changes that I'm not sure they are equivalent;

The fundamental approach to memory hardness should be the same. If that's the case, you can think of the rest more as cosmetic changes that have some pros and cons.

the way PRF is used (with a lot of inline canonicalization) is error prone to this types of problems;

What type of problem? This type of canonicalization is common with KDFs/MACs. If something is fixed length, you don't need to encode the length. Encoding the length of fixed length parameters is kind of like encoding the length of the length encodings.

Thus, from this point of view, perhaps the original paper might yield a performance boost (because it uses fixed inputs) that the current draft PRF usage that has to concatenate a lot of data.

Some of the concatenation is a one off cost or can be cached. You can generally figure out how many blocks there are going to be rather than it being a variable number. The personalization is the main thing that's interfered with that, but it can be cached. I have debated whether to prehash everything but the counter and use that as the PRF key for the pseudorandom bytes derivation. I'm not convinced that's worth it though.

I would think more calls to the hash functions is slower, but I haven't implemented this version of the draft to do benchmarks.

@cipriancraciun
Copy link
Author

Citing all the places where PRF is used:

key = PRF(key, password || salt || personalization || associatedData || LE32(pepper.Length) || LE32(password.Length) || LE32(salt.Length) || LE32(personalization.Length) || LE32(associatedData.Length))
## uses "parameters" (i.e. salt, persolalization, associated data, etc.)
## misses the VERSION, timeCost, parallelism, etc.

pseudorandom = pseudorandom || PRF(emptyKey, LE32(VERSION) || personalization || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))
## uses VERSION
## misses cost, parallelism, etc.

buffer[0] = PRF(key, LE32(VERSION) || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))
## uses VERSION, uses cost
## misses personalization

sometimes the VERSION is included, sometimes not;

sometimes the order of included parameters is one way, sometimes it is not;

Personally, I would just take all possible context that is either public (personalization, version, etc.) or mostly constant (costs, parallelism, etc.) and just just mix them into a single value that is to be used wherever "parameters" (or part of the parameters) might be needed.

For example, rewriting the above:

context = PRF(emptyKey,
    salt || LE32(salt.Length)      ## let's assume it's public
 || pepper || LE32(pepper.Length)  ## let's also assume it's public
 || associatedData || LE32(associatedData.Length)
 || personalization || LE32(personalization.Length)
 || LE32(VERSION)
 || LE32(parallelism)
 || LE32(timeCost)
 || LE32(spaceCost)
)
 

key = PRF(key, context || password)

pseudorandom = pseudorandom
         || PRF(emptyKey, context || LE32(iteration) || LE64(counter++))
buffer[0] = PRF(key     , context || LE32(iteration) || LE64(counter++))

See how "nice" the PRF usage looks?

  • there is some "key";
  • the first input is some "context";
  • then follows the data;

the way PRF is used (with a lot of inline canonicalization) is error prone to this types of problems;

What type of problem? This type of canonicalization is common with KDFs/MACs. If something is fixed length, you don't need to encode the length. Encoding the length of fixed length parameters is kind of like encoding the length of the length encodings.

Indeed canonicalization is common, however the usual flavour of canonicalization is either:

Meanwhile the current draft puts the lengths at the end, thus it's error prone to implement correctly. (Perhaps it doesn't break the cryptography, but it does offer enough opportunities to mix things and have broken outputs that don't match the test vectors.)


I would think more calls to the hash functions is slower, but I haven't implemented this version of the draft to do benchmarks.

Looking in the Blake3 paper there is the following graph:
image

As you can see, for small hashes (although here is the full hash computation, but perhaps it also extends to discrete individual update(some_data) calls), there is a smal cost for 64 bytes, then it goes up in between 64-128, then goes down for 128, up again until 256 bytes, and then there is a plateou in between 256 and 2K (perhaps some CPU pipelining or branch-prediction kicks-in).

Thus, assuming one uses Blake3 or another algorithm that has a similar behaviour (strangely enough also SHA2 follows a similar pattern), if the PRF is used as I've proposed above (plus the other two usages), i.e.:

PRF(key, context || LE32(iteration) || LE64(counter++))
## let's assume we pad iteration and counter to fit in 16 bytes each
=> PRF(key, 32 bytes || 32 bytes)
=> PRF(key, 64 bytes)

buffer[m] = PRF(key, buffer[m - 1] || LE64(counter++))
## let's assume we pad counter to 16 bytes, and we imagine we have an iteration of 0
=> PRF(key, 32 bytes || 32 bytes)
=> PRF(key, 64 bytes)

buffer[m] = PRF(key, previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3] || LE64(counter++))
## let's also pad counter to 16 bytes, plus iteration of 0
=> PRF(key, 32 bytes || x6)
=> PRF(key, 192 bytes)

We get almost the optimal behaviour for Blake3.

@samuel-lucas6
Copy link
Owner

misses the VERSION, timeCost, parallelism, etc.

The parallelism loop iteration can't be included in the key derivation since the key is static.

misses cost, parallelism, etc.

Those are included when computing the pseudorandom bytes.

misses personalization

This is unnecessary if it's in the key.

Personally, I would just take all possible context that is either public (personalization, version, etc.) or mostly constant (costs, parallelism, etc.) and just just mix them into a single value that is to be used wherever "parameters" (or part of the parameters) might be needed.

It's unfortunately not that simple because you don't want things like the salt/pepper/associated data in the pseudorandom bytes derivation. There's also no need to process those multiple times.

You could do it for the VERSION, personalization, parallelism, timeCost, spaceCost, and parallelism iteration, but it's pretty ugly because you need to use an empty key. You also end up potentially hashing more data because a hash is larger than those encoded parameters.

But what you're talking about is what I was thinking about with prehashing the personalization so the pseudorandom bytes input fits into a block (that isn't guaranteed if the personalization length is variable).

Meanwhile the current draft puts the lengths at the end, thus it's error prone to implement correctly. (Perhaps it doesn't break the cryptography, but it does offer enough opportunities to mix things and have broken outputs that don't match the test vectors.)

It's also normal to put the lengths at the end. For example, see the ChaCha20-Poly1305 RFC. It allows you to process inputs without knowing their length in advance, which admittedly isn't relevant here. If this is done incorrectly, the test vectors shouldn't pass.

Looking in the Blake3 paper there is the following graph

Yes, I've seen that graph. The question is how do repeated small calls do vs one slightly longer call.

I definitely don't want to include padding because it depends on the algorithm and that gets messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants