Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Billion laughs attack #235

Closed
jonasw234 opened this issue Nov 23, 2018 · 10 comments
Closed

Billion laughs attack #235

jonasw234 opened this issue Nov 23, 2018 · 10 comments

Comments

@jonasw234
Copy link

Is there a way to disable anchors and aliases or cap the number of characters that can be created through expansions?
Right now PyYAML seems to be susceptible to billion laughs attacks.
@guyskk created a new version in #37 that prevents that but it also uses OrderedDict and SafeLoader, so it might be a good idea to implement just this functionality like the ignore_aliases=True flag in #104 for yaml.load/yaml.safe_load.

@guyskk
Copy link

guyskk commented Nov 23, 2018

My workaround: https://github.com/guyskk/simple-yaml

@perlpunk
Copy link
Member

I think the loading part of PyYAML itself is not a problem.
All the items are actually references, so they don't eat up additional memory.
If you load the document and then change data["a"] = "XXX", you will see that the new value appears in all the aliases, too.
Only if you process the data by walking through the whole tree and concatenate the content to a string, for example, you would get a problem. Or if you simply use print(data).

That said, ignoring aliases can still be an option. I don't see it as critical, though.

@jonasw234
Copy link
Author

I tested with Python 3.6.6 under WSL, PyYAML 3.13 with the following code:

import yaml
yaml.safe_load("""a: &a ["lol","lol","lol","lol","lol","lol","lol","lol","lol"]
b: &b [*a,*a,*a,*a,*a,*a,*a,*a,*a]
c: &c [*b,*b,*b,*b,*b,*b,*b,*b,*b]
d: &d [*c,*c,*c,*c,*c,*c,*c,*c,*c]
e: &e [*d,*d,*d,*d,*d,*d,*d,*d,*d]
f: &f [*e,*e,*e,*e,*e,*e,*e,*e,*e]
g: &g [*f,*f,*f,*f,*f,*f,*f,*f,*f]
h: &h [*g,*g,*g,*g,*g,*g,*g,*g,*g]
i: &i [*h,*h,*h,*h,*h,*h,*h,*h,*h]""")

Output from pmap -x $(pgrep python3) is

459:   python3
Address           Kbytes     RSS   Dirty Mode  Mapping
---------------- ------- ------- -------
total kB         6223040       0       0

at peak memory consumption and loading takes quite a while.

@perlpunk
Copy link
Member

@jonasw234 interesting. I get different results but I don't know how to get memory consumption at the peak.
If I load it into a variable and then measure at the end, I get 63196 kB for python 2.7.14 and 35360 kB for python 3.6.4.

I don't see any significant runtime and memory differences if I comment out lines f-i, for example.
I'm on debian.
Will post detailed results later.

@jonasw234
Copy link
Author

I think I see the problem now, I was testing in an interactive console, which tried to output the results immediately. My bad, thanks for the help!

@perlpunk
Copy link
Member

@jonasw234 oh good ;-)
phew

@DEMON1A
Copy link

DEMON1A commented May 3, 2021

Hey there, I think there's something isn't fine here. I know the problem isn't from PyYaml itself. but it's from python because of the memory usage when storing strings and integers. but as you can see most of the python codes is using this data and interacting with it and you can guess what's the point of getting the user data when the developer won't use it.

Setting a limit for the data getting over there will be great or that means you can freeze most of the python codes using PyYaml now since most of them are storing this data on strings then passing them to functions. or at least you can mention that on the docs.

@nitzmahone
Copy link
Member

nitzmahone commented May 3, 2021

Off the top of my head, customizing a Composer to completely disable anchor support and propagation would probably be pretty straightforward even in the current code. Any other kind of memory limiting behavior would be better accomplished by the runtime environment (eg, OS memory quotas, cgroups), as the details can vary significantly between Python versions, implementations (eg CPython vs PyPy), and even build parameters. Attempting to track and limit memory usage within a library with so many variables in play doesn't seem like a recipe for success.

If it is indeed as easy as I think it might be to disable anchor support at the Composer level, that might be a customization we could include in the box with some of the new work that's going on... I've added this to the 6.0 project as something to consider and play with (but no promises).

EDIT: and TBC, what I'm thinking of here would not address the ask in #37- anchors and refs would still be syntactically significant to the parser, they'd just be errors when actually composing the final representation of the doc.

@DEMON1A
Copy link

DEMON1A commented May 4, 2021

That looks great, I guess that will get it from the root case itself.

@pantoniou
Copy link
Contributor

For what is worth, this attack is harmless (and perhaps a valid use case) if one were to do lazy anchor resolution.
That is, one would never resolve references, merely follow them until the resulting end node.

Admittedly this is perhaps a hard thing to do in many languages.

Doing it that way, makes YAML suitable for representing any arbitrary hierarchy of objects, including ones with cyclic references.

For future versions of the spec I would argue that this is the default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants