-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The first character of shortuuid.random() strings is not fully random #101
Comments
Hm, this is interesting, thank you. What solution do you recommend? The first thing that comes to my mind is to generate a string of N+1 length and remove the first character. The rest should be uniformly distributed. |
Unless the alphabet is >256 chars I think you're already generating more digits than you need. For example
is either 20 or 21 when the caller has asked for an output of length 15 and random() already discards the final 5-6 chars. Would it make sense to stop the iteration in int_to_string once the target length is reached? I think this would still return correct results for UUID data because the _length calculation makes the string just the right size, but when the length is specified manually in random() it would return the string generated for the less-significant characters fully randomly. Or if you'd prefer not to modify int_to_string then I think an even simpler fix would be to switch random() from returning the requested length from the beginning of the string to slicing from the end instead.
would become
|
You can take a look at the |
Very nice writeup! I assume that this is not an issue if alphabet is 32 or 64 characters either since both divide into 256. I had assumed that these would be cryptographically secure, but it seems not (yet.) Edit: Thinking about this, the easiest possible fix is probably to use |
Yes I think that's right... except the first char of the alphabet can still never be selected.
This seemed to be the recommended method from the Python docs when I looked. It will be cryptographically secure, if that's needed. |
* fix: Improve randomness (#101) Fixes: #101 * Update shortuuid/main.py * Appease pre-commit --------- Co-authored-by: Stavros Korokithakis <[email protected]>
Released 1.0.13 with the fix! |
Generating random strings via shortuuid's random method gives very unbalanced first chars.
I am not sure if generating ID strings purely via random() and not via a UUID is an expected use-case, but it was surprising to me based on what I could find in the docs (and from the code on a first look!).
I'm not sure what you'll want to do with it so I've just written up what I've found. Hopefuully it's useful! We ended up going for a different solution in the end, so this is not blocking me.
Demo:
2
is impossible as a first character, and3
is approx 50x more likely than most other characters.Thinking there might be something going on with the byte/char alignment when using the full length from
_length
I also tried some shorter lengths. Setting different lengths changed how many of the earlier characters got higher probabilities but the overall pattern was similar.For example here's the output for length=15.
2
is still impossible, but now everything in the range3
-B
is a much more likely first char thanC
-z
.The reason is that for a given output length n random() loads n bytes from os.urandom() and converts it to an int. The int is passed to int_to_string which generates the string one factor of the alphabet size at a time, least significant chars first (before reversing the string to give MSB-first output). If there is any misalignment between the byte boundaries and the alphabet-factor boundaries, then a final remainder of a limited size will fill the final output character, which becomes the first character returned from int_to_string, which is therefore not uniformly distributed.
With the standard 57-char alphabet and length=15:
You can see the effect goes away with an alphabet size of 16 where the alphabet size divides evenly into 256.
Even with the 16-char alphabet the problem that the first alphabet character is impossible to generate as the first string character remains. That's because if
number > 0
thendivmod(number, alpha_len)
can't ever return(0, 0)
. If the div part is non-zero then we're not on the final character. If the mod part is non-zero then we'll output a character but it won't bealphabet[0]
. We'd only be able to generate a 0 in the final iteration if the iteration count was fixed to match the expected width of the number.Finally, random() returns a prefix of the returned value from int_to_string, so it always includes this partial character. One simple fix for common alphabet sizes would be to either return a suffix instead, or to not reverse the output of int_to_string when called from random(). Obviously the reversal is important for the MSB-first behaviour of the main UUID functionality, but I don't think it is required when calling random().
Looking at the code I think there would be a different problem if the size of the alphabet was more than 256 characters. In that case random() would generate n bytes of random data from os.urandom() even though more than that is required to fill an n-character string from that alphabet. I assume such a massive alphabet is rare... but maybe worth guarding against just in case?
The text was updated successfully, but these errors were encountered: