The first character of shortuuid.random() strings is not fully random #101

BenSpencer · 2024-01-25T18:29:50Z

Generating random strings via shortuuid's random method gives very unbalanced first chars.

I am not sure if generating ID strings purely via random() and not via a UUID is an expected use-case, but it was surprising to me based on what I could find in the docs (and from the code on a first look!).

I'm not sure what you'll want to do with it so I've just written up what I've found. Hopefuully it's useful! We ended up going for a different solution in the end, so this is not blocking me.

Demo:

>>> import shortuuid
>>> from collections import Counter
>>> c = Counter()
>>> for x in shortuuid.get_alphabet(): c[x] = 0
...
>>> for _ in range(10000): c[shortuuid.random()[0]] += 1
...
>>> print("\n".join([f"{k}: {v}" for k, v in sorted(c.items())]))
2: 0
3: 5021
4: 174
5: 92
6: 77
7: 75
8: 72
9: 91
A: 82
B: 109
C: 82
D: 100
E: 111
F: 94
G: 103
H: 108
J: 90
K: 88
L: 77
M: 82
N: 75
P: 87
Q: 82
R: 100
S: 74
T: 98
U: 93
V: 97
W: 82
X: 83
Y: 94
Z: 86
a: 82
b: 98
c: 85
d: 90
e: 84
f: 71
g: 107
h: 84
i: 82
j: 89
k: 66
m: 78
n: 95
o: 85
p: 91
q: 96
r: 98
s: 94
t: 100
u: 73
v: 96
w: 76
x: 102
y: 103
z: 96

2 is impossible as a first character, and 3 is approx 50x more likely than most other characters.

Thinking there might be something going on with the byte/char alignment when using the full length from _length I also tried some shorter lengths. Setting different lengths changed how many of the earlier characters got higher probabilities but the overall pattern was similar.

For example here's the output for length=15. 2 is still impossible, but now everything in the range 3-B is a much more likely first char than C-z.

2: 0
3: 963
4: 1088
5: 1026
6: 961
7: 977
8: 1014
9: 1017
A: 1030
B: 971
C: 159
D: 18
E: 12
F: 20
G: 24
H: 16
J: 14
K: 23
L: 20
M: 13
N: 24
P: 15
Q: 13
R: 21
S: 20
T: 13
U: 16
V: 13
W: 14
X: 20
Y: 23
Z: 16
a: 18
b: 19
c: 21
d: 18
e: 14
f: 23
g: 11
h: 19
i: 16
j: 24
k: 26
m: 10
n: 12
o: 16
p: 25
q: 12
r: 22
s: 10
t: 25
u: 11
v: 13
w: 17
x: 17
y: 14
z: 13

The reason is that for a given output length n random() loads n bytes from os.urandom() and converts it to an int. The int is passed to int_to_string which generates the string one factor of the alphabet size at a time, least significant chars first (before reversing the string to give MSB-first output). If there is any misalignment between the byte boundaries and the alphabet-factor boundaries, then a final remainder of a limited size will fill the final output character, which becomes the first character returned from int_to_string, which is therefore not uniformly distributed.

With the standard 57-char alphabet and length=15:

We load 15 bytes of data from urandom, and convert it to an integer.
The integer has 256**15 possible values and 120 bits of information
120 bits of information is 120 / log2(57) = 20.57... characters from our alphabet
Therefore the 21st character (which becomes the first/MSB in the output) is only containing about half a charater's worth of information.

You can see the effect goes away with an alphabet size of 16 where the alphabet size divides evenly into 256.

>>> shortuuid.set_alphabet("0123456789ABCDEF")
[...]

0: 0
1: 687
2: 659
3: 636
4: 715
5: 723
6: 655
7: 686
8: 665
9: 645
A: 600
B: 640
C: 685
D: 654
E: 670
F: 680

Even with the 16-char alphabet the problem that the first alphabet character is impossible to generate as the first string character remains. That's because if number > 0 then divmod(number, alpha_len) can't ever return (0, 0). If the div part is non-zero then we're not on the final character. If the mod part is non-zero then we'll output a character but it won't be alphabet[0]. We'd only be able to generate a 0 in the final iteration if the iteration count was fixed to match the expected width of the number.

Finally, random() returns a prefix of the returned value from int_to_string, so it always includes this partial character. One simple fix for common alphabet sizes would be to either return a suffix instead, or to not reverse the output of int_to_string when called from random(). Obviously the reversal is important for the MSB-first behaviour of the main UUID functionality, but I don't think it is required when calling random().

Looking at the code I think there would be a different problem if the size of the alphabet was more than 256 characters. In that case random() would generate n bytes of random data from os.urandom() even though more than that is required to fill an n-character string from that alphabet. I assume such a massive alphabet is rare... but maybe worth guarding against just in case?

The text was updated successfully, but these errors were encountered:

skorokithakis · 2024-01-25T18:34:30Z

Hm, this is interesting, thank you. What solution do you recommend? The first thing that comes to my mind is to generate a string of N+1 length and remove the first character. The rest should be uniformly distributed.

BenSpencer · 2024-01-26T10:06:48Z

Unless the alphabet is >256 chars I think you're already generating more digits than you need. For example

len(int_to_string(int(binascii.b2a_hex(os.urandom(15)), 16), shortuuid.get_alphabet()))

is either 20 or 21 when the caller has asked for an output of length 15 and random() already discards the final 5-6 chars.

Would it make sense to stop the iteration in int_to_string once the target length is reached? I think this would still return correct results for UUID data because the _length calculation makes the string just the right size, but when the length is specified manually in random() it would return the string generated for the less-significant characters fully randomly.

Or if you'd prefer not to modify int_to_string then I think an even simpler fix would be to switch random() from returning the requested length from the beginning of the string to slicing from the end instead.

return int_to_string(random_num, self._alphabet, padding=length)[:length]

would become

return int_to_string(random_num, self._alphabet, padding=length)[-length:]

jsonvot · 2024-01-31T02:52:35Z

Hm, this is interesting, thank you. What solution do you recommend? The first thing that comes to my mind is to generate a string of N+1 length and remove the first character. The rest should be uniformly distributed.嗯，这很有趣，谢谢。您推荐什么解决方案？我想到的第一件事是生成一个长度为 N+1 的字符串并删除第一个字符。其余的应该均匀分布。

You can take a look at the secrets and string module introduced in Python 3.6, and maybe you can find what you need or inspiration

hhartzer · 2024-03-07T18:43:34Z

Very nice writeup!

I assume that this is not an issue if alphabet is 32 or 64 characters either since both divide into 256. I had assumed that these would be cryptographically secure, but it seems not (yet.)

Edit: Thinking about this, the easiest possible fix is probably to use secrets.choice() on the alphabet until we populate the length of string we want. I know this is hacky, slow, etc, but I'm guessing it would not have any of these issues and may be "fast enough" until a better solution is made.

BenSpencer · 2024-03-07T21:50:39Z

I assume that this is not an issue if alphabet is 32 or 64 characters either since both divide into 256.

Yes I think that's right... except the first char of the alphabet can still never be selected.

Thinking about this, the easiest possible fix is probably to use secrets.choice() on the alphabet until we populate the length of string we want.

This seemed to be the recommended method from the Python docs when I looked. It will be cryptographically secure, if that's needed.

Fixes: skorokithakis#101

* fix: Improve randomness (#101) Fixes: #101 * Update shortuuid/main.py * Appease pre-commit --------- Co-authored-by: Stavros Korokithakis <[email protected]>

skorokithakis · 2024-03-11T20:14:07Z

Released 1.0.13 with the fix!

hhartzer added a commit to hhartzer/shortuuid that referenced this issue Mar 7, 2024

fix: Improve randomness (skorokithakis#101)

9253dde

hhartzer added a commit to hhartzer/shortuuid that referenced this issue Mar 7, 2024

fix: Improve randomness (skorokithakis#101)

455293b

Fixes: skorokithakis#101

hhartzer mentioned this issue Mar 7, 2024

fix: Improve randomness (#101) #103

Merged

skorokithakis closed this as completed in #103 Mar 8, 2024

skorokithakis added a commit that referenced this issue Mar 8, 2024

fix: Improve randomness (#101) (#103)

5dda33e

* fix: Improve randomness (#101) Fixes: #101 * Update shortuuid/main.py * Appease pre-commit --------- Co-authored-by: Stavros Korokithakis <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first character of shortuuid.random() strings is not fully random #101

The first character of shortuuid.random() strings is not fully random #101

BenSpencer commented Jan 25, 2024

skorokithakis commented Jan 25, 2024 •

edited

Loading

BenSpencer commented Jan 26, 2024

jsonvot commented Jan 31, 2024 •

edited

Loading

hhartzer commented Mar 7, 2024 •

edited

Loading

BenSpencer commented Mar 7, 2024

skorokithakis commented Mar 11, 2024

The first character of shortuuid.random() strings is not fully random #101

The first character of shortuuid.random() strings is not fully random #101

Comments

BenSpencer commented Jan 25, 2024

skorokithakis commented Jan 25, 2024 • edited Loading

BenSpencer commented Jan 26, 2024

jsonvot commented Jan 31, 2024 • edited Loading

hhartzer commented Mar 7, 2024 • edited Loading

BenSpencer commented Mar 7, 2024

skorokithakis commented Mar 11, 2024

skorokithakis commented Jan 25, 2024 •

edited

Loading

jsonvot commented Jan 31, 2024 •

edited

Loading

hhartzer commented Mar 7, 2024 •

edited

Loading