Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The first character of shortuuid.random() strings is not fully random #101

Closed
BenSpencer opened this issue Jan 25, 2024 · 6 comments · Fixed by #103
Closed

The first character of shortuuid.random() strings is not fully random #101

BenSpencer opened this issue Jan 25, 2024 · 6 comments · Fixed by #103

Comments

@BenSpencer
Copy link

Generating random strings via shortuuid's random method gives very unbalanced first chars.

I am not sure if generating ID strings purely via random() and not via a UUID is an expected use-case, but it was surprising to me based on what I could find in the docs (and from the code on a first look!).

I'm not sure what you'll want to do with it so I've just written up what I've found. Hopefuully it's useful! We ended up going for a different solution in the end, so this is not blocking me.

Demo:

>>> import shortuuid
>>> from collections import Counter
>>> c = Counter()
>>> for x in shortuuid.get_alphabet(): c[x] = 0
...
>>> for _ in range(10000): c[shortuuid.random()[0]] += 1
...
>>> print("\n".join([f"{k}: {v}" for k, v in sorted(c.items())]))
2: 0
3: 5021
4: 174
5: 92
6: 77
7: 75
8: 72
9: 91
A: 82
B: 109
C: 82
D: 100
E: 111
F: 94
G: 103
H: 108
J: 90
K: 88
L: 77
M: 82
N: 75
P: 87
Q: 82
R: 100
S: 74
T: 98
U: 93
V: 97
W: 82
X: 83
Y: 94
Z: 86
a: 82
b: 98
c: 85
d: 90
e: 84
f: 71
g: 107
h: 84
i: 82
j: 89
k: 66
m: 78
n: 95
o: 85
p: 91
q: 96
r: 98
s: 94
t: 100
u: 73
v: 96
w: 76
x: 102
y: 103
z: 96

2 is impossible as a first character, and 3 is approx 50x more likely than most other characters.

Thinking there might be something going on with the byte/char alignment when using the full length from _length I also tried some shorter lengths. Setting different lengths changed how many of the earlier characters got higher probabilities but the overall pattern was similar.

For example here's the output for length=15. 2 is still impossible, but now everything in the range 3-B is a much more likely first char than C-z.

2: 0
3: 963
4: 1088
5: 1026
6: 961
7: 977
8: 1014
9: 1017
A: 1030
B: 971
C: 159
D: 18
E: 12
F: 20
G: 24
H: 16
J: 14
K: 23
L: 20
M: 13
N: 24
P: 15
Q: 13
R: 21
S: 20
T: 13
U: 16
V: 13
W: 14
X: 20
Y: 23
Z: 16
a: 18
b: 19
c: 21
d: 18
e: 14
f: 23
g: 11
h: 19
i: 16
j: 24
k: 26
m: 10
n: 12
o: 16
p: 25
q: 12
r: 22
s: 10
t: 25
u: 11
v: 13
w: 17
x: 17
y: 14
z: 13

The reason is that for a given output length n random() loads n bytes from os.urandom() and converts it to an int. The int is passed to int_to_string which generates the string one factor of the alphabet size at a time, least significant chars first (before reversing the string to give MSB-first output). If there is any misalignment between the byte boundaries and the alphabet-factor boundaries, then a final remainder of a limited size will fill the final output character, which becomes the first character returned from int_to_string, which is therefore not uniformly distributed.

With the standard 57-char alphabet and length=15:

  • We load 15 bytes of data from urandom, and convert it to an integer.
  • The integer has 256**15 possible values and 120 bits of information
  • 120 bits of information is 120 / log2(57) = 20.57... characters from our alphabet
  • Therefore the 21st character (which becomes the first/MSB in the output) is only containing about half a charater's worth of information.

You can see the effect goes away with an alphabet size of 16 where the alphabet size divides evenly into 256.

>>> shortuuid.set_alphabet("0123456789ABCDEF")
[...]

0: 0
1: 687
2: 659
3: 636
4: 715
5: 723
6: 655
7: 686
8: 665
9: 645
A: 600
B: 640
C: 685
D: 654
E: 670
F: 680

Even with the 16-char alphabet the problem that the first alphabet character is impossible to generate as the first string character remains. That's because if number > 0 then divmod(number, alpha_len) can't ever return (0, 0). If the div part is non-zero then we're not on the final character. If the mod part is non-zero then we'll output a character but it won't be alphabet[0]. We'd only be able to generate a 0 in the final iteration if the iteration count was fixed to match the expected width of the number.

Finally, random() returns a prefix of the returned value from int_to_string, so it always includes this partial character. One simple fix for common alphabet sizes would be to either return a suffix instead, or to not reverse the output of int_to_string when called from random(). Obviously the reversal is important for the MSB-first behaviour of the main UUID functionality, but I don't think it is required when calling random().


Looking at the code I think there would be a different problem if the size of the alphabet was more than 256 characters. In that case random() would generate n bytes of random data from os.urandom() even though more than that is required to fill an n-character string from that alphabet. I assume such a massive alphabet is rare... but maybe worth guarding against just in case?

@skorokithakis
Copy link
Owner

skorokithakis commented Jan 25, 2024

Hm, this is interesting, thank you. What solution do you recommend? The first thing that comes to my mind is to generate a string of N+1 length and remove the first character. The rest should be uniformly distributed.

@BenSpencer
Copy link
Author

Unless the alphabet is >256 chars I think you're already generating more digits than you need. For example

len(int_to_string(int(binascii.b2a_hex(os.urandom(15)), 16), shortuuid.get_alphabet()))

is either 20 or 21 when the caller has asked for an output of length 15 and random() already discards the final 5-6 chars.

Would it make sense to stop the iteration in int_to_string once the target length is reached? I think this would still return correct results for UUID data because the _length calculation makes the string just the right size, but when the length is specified manually in random() it would return the string generated for the less-significant characters fully randomly.

Or if you'd prefer not to modify int_to_string then I think an even simpler fix would be to switch random() from returning the requested length from the beginning of the string to slicing from the end instead.

return int_to_string(random_num, self._alphabet, padding=length)[:length]

would become

return int_to_string(random_num, self._alphabet, padding=length)[-length:]

@jsonvot
Copy link

jsonvot commented Jan 31, 2024

Hm, this is interesting, thank you. What solution do you recommend? The first thing that comes to my mind is to generate a string of N+1 length and remove the first character. The rest should be uniformly distributed.嗯,这很有趣,谢谢。您推荐什么解决方案?我想到的第一件事是生成一个长度为 N+1 的字符串并删除第一个字符。其余的应该均匀分布。

You can take a look at the secrets and string module introduced in Python 3.6, and maybe you can find what you need or inspiration

@hhartzer
Copy link
Contributor

hhartzer commented Mar 7, 2024

Very nice writeup!

I assume that this is not an issue if alphabet is 32 or 64 characters either since both divide into 256. I had assumed that these would be cryptographically secure, but it seems not (yet.)

Edit: Thinking about this, the easiest possible fix is probably to use secrets.choice() on the alphabet until we populate the length of string we want. I know this is hacky, slow, etc, but I'm guessing it would not have any of these issues and may be "fast enough" until a better solution is made.

@BenSpencer
Copy link
Author

I assume that this is not an issue if alphabet is 32 or 64 characters either since both divide into 256.

Yes I think that's right... except the first char of the alphabet can still never be selected.

Thinking about this, the easiest possible fix is probably to use secrets.choice() on the alphabet until we populate the length of string we want.

This seemed to be the recommended method from the Python docs when I looked. It will be cryptographically secure, if that's needed.

hhartzer added a commit to hhartzer/shortuuid that referenced this issue Mar 7, 2024
hhartzer added a commit to hhartzer/shortuuid that referenced this issue Mar 7, 2024
skorokithakis added a commit that referenced this issue Mar 8, 2024
* fix: Improve randomness (#101)

Fixes: #101

* Update shortuuid/main.py

* Appease pre-commit

---------

Co-authored-by: Stavros Korokithakis <[email protected]>
@skorokithakis
Copy link
Owner

Released 1.0.13 with the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants