-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track PVF preparation memory with a global allocator wrapper #718
Comments
Very interesting. Having consistent figures would be perfect for deterministically determining preparation memory. I am not sure if we can safely disregard the allocator overhead, though -- it would mean that, potentially, preparation can pass on one validator and error out (e.g. with OOM) on a different validator, with a different allocator. The enforced memory limit for preparation would have to be low enough where we would have some certainty that no reasonable allocator could trigger OOM. I do like this idea -- do you know of any prior art for this technique @koute? |
In theory an OOM on only one set of validators can already happen due to factors outside of our control. We don't control how much RAM the validator machines have; we don't control what other processes they're running; we don't control the system libraries and how much memory they take, etc. Still, this is a valid attack vector. If the attacker can directly allocate and deallocate memory and write to that memory then it's 100% possible and very easy to trigger an OOM which a global allocator wrapper would be unable to detect. But for metering non-malicious actors it should probably be fine, and technically we could sidestep the problem by also having a second absolute global limit - either a cgroups-based one (which would be ideal since it's OS enforced) or an RSS-based one (worse and potentially unreliable, but would work on more OSes). The deterministic** limit would work well for normal payloads, while this alternative one would be high enough to never trigger on normal payloads but would be a failsafe against potential bad actors. There's also an option of writing our own allocator which could work the same across OSes, allow us to track memory usage more precisely and would be in pure Rust. If we have our own database then maybe we could also have our own allocator? :P (I could probably write one with jemalloc-like performance in less than a month, but, honestly, it's probably not a good use of our resources?) ** - although even this is probably not going to be entirely deterministic due to e.g. multithreading where different amounts of memory will get allocated depending on how the OS schedules everything etc., but it should still be roughly in the same ballpark.
This is used relatively common for debugging purposes, but probably not for metering. |
Thanks, great info! And agreed with everything.
Haha, not sure. Maybe a good candidate for a grant from W3F? 🙂 |
Potentially also interesting are low-fragmentation allocators like e.g. https://github.com/plasma-umass/Mesh. Would it have any benefits in avoiding OOM? Would we get more deterministic memory measurements? There is for sure some performance overhead in keeping the fragmentation low. But I'd be interested in seeing how it performs in practice -- by using less memory the application could utilize cache more effectively, who knows? Worth exploring further? |
I believe it's not the time to put some significant effort into developing a deterministic cross-platform allocator yet. We are still so far away from cross-platform determinism in many other aspects that it would make little difference. From my perspective, what we really need is to develop a wrapper allocator which would do some bookkeeping for us and then forward allocation/deallocation requests to a real allocator. That would be enough for now; enforcing those limits somewhat differently on different platforms is bad but not worse than non-deterministic native stack behavior and non-deterministic timeouts on different platforms. We should not worry about multithreading right now as well, as Wasmtime does all the PVF preparation in single-threaded mode, and I don't see any clear pathway to overcoming that shortly.
As I see it, it adds one more headache (to ensure not only deterministic measurement but also deterministic defragmentation logic) with no sensible positive outcome. If our wrapper allocator enforces OOM, it doesn't matter if the allocated memory is fragmented or not, we'll enforce OOM after the allocation of X bytes anyway. If the kernel enforces OOM, it doesn't give us any leverage. If one host has 1 Gb of available memory, another host has 8 Gb, and our limit is 1.5 Gb, then trying to allocate 1.2 Gb will result in OS OOM condition on the first host and not on the second, regardless of the memory is fragmented or not. |
Agreed about the wrapper. Seems like it would help us track allocation metrics and enforce limits with minimal effort.
Why is it a headache to try a new allocator and see how it performs? If we write the wrapper properly we should be able to swap allocators in and out pretty easily. If we try the low-fragmentation allocator and it turns out to be too slow (which I suspect it probably will), then we just don't use it. But if it turns out to have acceptable performance, why not use it? It would help us avoid OOM in practice, I think.
Yes, but we could maybe practically set the limit X lower. From what I've seen there can be huge amount of waste (gaps/fragments on the heap) with traditional allocators. Any limit X implicitly accounts for this wasted space. |
Ah, I see what you mean. The profitability of defragmentation highly depends on what we're limiting, and we have yet to agree on that.
I agree it's not a problem to try another allocator, but we should understand what we're trying to achieve with it to be able to evaluate the outcome :) |
Someone should probably look at the metrics we added and see what they say. 😂 So far I would agree that RSS is not very useful since the measurements I have seen were very high. Here is a post summarizing the advantages/disadvantages of each approach. Edit: concerning the issues with RSS and swap, @eskimor there argues that most validators will not have swap enabled or significant enough memory pressure for it to be an issue. Tracking and limiting allocations seems sensible and simple to me. Additionally taking into consideration fragmentation instead of just the raw allocation stat would probably actually be less deterministic. Was just a fun idea though.
What do you mean? Wikipedia: "The size of this segment is determined by the size of the values in the program's source code, and does not change at run time." |
I agree, with one amendment: "under current conditions". @bkchr always reminds me that it's not us who are in control of the network. We cannot predict what validators, in what hardware configurations, and even with which concrete implementations of Polkadot spec will participate on the network in the future. I wouldn't like the determinism of the measurements could be ruined by honest actors who decided to enable swap space on their hosts (btw, some of them, hosted on shared platforms, don't have a choice). From the point of view of an OS process, we don't have any memory beyond the virtual one, so I'd say it's exactly what has to be measured (or its volatile part, at least, to be more precise).
Well, yes, there's a lot of ambiguity between abstract concepts and real life :) That's what is called "heap segment" in the article, but in a real Linux process, there's no separate heap segment, it's a continuation of the data segment, which grows towards the stack when you allocate more memory, and it represents the heap. Look at the |
Yeah, we have raw accesses to jmalloc-ctl sprinkled around the codebase. It'd probably make sense to create e.g.
FYI, AFAIK only glibc's crappy allocator uses sbrk; good allocators in general don't. (e.g. jemalloc by default only uses it to allocate extents if it can't do so through mmap, so in practice should be pretty much never)
This reminds me, when tracking RSS it'd probably make sense to track RSS and swap. In general if there's swapping going on then that does open an entirely new DoS vector. (e.g. imagine the PVF process triggering enough memory allocations to not OOM anything yet, but to trigger excessive swapping in the host validator process; we could rectify this by running both in separate cgroups which gives us tools to control this) |
That means I'm somewhat rusty on modern allocators. If it's all Cgroups would be an excellent opportunity to take control over everything we need, but are they available anywhere beyond Linux? |
They aren't, but do we care? We could just require people to run Linux to securely run a validator (and 99% of nodes are running Linux according to our telemetry) and for other OSes have minimal/no sandboxing and require e.g. a command line parameter |
Also, this is totally unrelated, but another benefit of running PVF stuff under cgroups is that we could give the cgroup e.g. only 2 cores and then safely enable parallel compilation in |
That's a valid point for now, but doesn't it look like we're trying just to postpone the problem? For example, we're talking a lot about cross-platform PVF execution determinism, like between x64 and arm64, when we could just require validators to run on x64 and forget about that, but for how long? I mean, a global platform revolution wiping out Linux servers on Intel in favor of, say, AIX servers on ARM is unlikely to happen tomorrow, and the question is how far ahead should we look? |
We can not and we should not require anything. However, we can put out recommendations and tell people on what the current implementation is not supporting. If we only support |
But telling people "we only support Anyway, as we discussed before, the preparation memory limit is not something we can apply deterministically across platforms and not something that can be included in the spec for that reason. I remember you agreed it's a hacky thing, and we only need it to mitigate possible malicious behavior until we find some better solution. So if nobody has any objections against sticking to a single platform for now, I will vote for cgroups then, that looks like the most straightforward solution. |
Came here from another issue. Just my 2 cents. Apologies if it doesn't make sense.
But isn't it more than that? If your node can't limit the memory the binary can crash with OOM and the validator will need some time to restart. An attacker would then craft a parathread that crashes all non Linux validators. And that could affect the overall network, right?
I think it is. Probably no one would run a validator and risk the binary being crashed by a malicious parathread. But as you said: It is already a pretty bad idea to run on a non-Linux system for other reasons. I don't get the Intel hardware part, though. Cgroups should be supported on different hardware architectures as long as it is Linux.
I don't think it causes any harm except that adds another thing to the list of things that need to be reworked for cross platform support in the future. It seems unlikely but I remember some talks about leveraging Apple hardware for its insane memory bandwith. But it seems much more likely that Intel will just adapt the feature and we continue using it on Linux. |
I believe it was in another discussion where I insisted that any memory limits cannot be applied deterministically across platforms. That is, we can enforce it with cgroups, wrapper allocator, or whatever, but the Wasmtime binary itself, compiled on different platforms, wouldn't use precisely the same amount of memory on each of them. Somewhere due to alignment rules, somewhere due to different code generation, but they shall differ. And if we set the memory limit to 1024 bytes, and Intel binary uses 1023 and ARM binary uses 1025, it is still not deterministic at all. The same goes for the stack overflow condition, for example. |
In general, when I said "we" I meant "Parity" aka the Parity Polkadot node implementation. We need to face the reality and the reality currently is that most validators run on Linux. Let's take the instance reusing which was slow on MacOs, because we only had optimized it for Linux until you @athei came and optimized it for lovely MacOs ;) Here we can go again the same way, we will only start with
That is only for research to find out if we may want this when this is commodity/not only apple exclusive.
We can continue discussing about all these small edge cases until the end of the universe :P We will need some framework to find out these problems and then come up with some smart ways to solve them. I already proposed some RFP to work on some conformance testing framework: https://github.com/w3f/Grants-Program/blob/master/docs/RFPs/Open/parachain_validation_conformance_testing.md |
Or we write our own wasm compiler that is build around all our requirements from the beginning. |
Vote up with both hands. I'm currently trying to come up with a proof of concept for my very simplistic approach to that problem, we'll see if it pays out, but regardless, having deterministic code generation is the only straightforward way to deterministic code execution. |
Writing a linear time compiler for wasm that also produces fast code is hard. There were many failed attempts in trying to make this work for contracts. Given, the requirements for contracts are stricter and you might come up with something decent when only needing to be linear in memory usage but not time. But it seems like a colossal task if you try to compete with the performance of wasmtime. I think cranelift (the wasmtime backend) is already pretty memory conscious. So vetting it for linear memory complexity and disabling non linear (in terms of memory) optimizations might be the way to go. They might be open to merge a "linear memory complexity mode". Performance will probably suffer but it will with a custom solution, too. Some optimization just don't have linear algorithms. But the upshot is that you don't need to worry about limiting memory. For example, in contracts our current thinking is that it might be easier to design tailor made instruction set that can be trivially compiled and has other useful properties for our use case. My 2 cents: Stick with wasmtime and memory metering / cgroups. It seems to be good enough. As long as you don't need linear time compilation that is (it seems that you don't). |
I should have written it, but this was a JOKE :D For sure I don't want to encourage us to write our own wasm compiler, we have more important things to fix for now :P And yeah, this would be an endless task that would take some years to be finished. |
Leaving this here in case it's useful for the allocator wrapper (just skimmed it so far): |
That was my first thought, and that's what I was experimenting with for the last couple of days, and now I doubt if it's the way to go. The problem is that Wasmtime is multi-threaded by itself. So in the end we get a bunch of statistic counters from different threads only identified by their id, and we don't know what should be summed up and what shouldn't. I'm evaluating other solutions. As the preparation worker is a separate process, it seems that per-process measurement, if accompanied by adjustable allocator checkpoints and limits, should give a very good approximation, and when worker binaries are separated it will be close to perfect. Any thoughts and ideas are highly appreciated. |
Sounds reasonable, even with both workers being part of the same process it should work? As each process is always only doing one task? |
Currently, they are different OS processes (spawned by different calls to Are we going to change that at some point? |
Change what? I mean each validation/preparation always needs to be done in its own process. We need to be able to kill the workers if something is taking more time than allowed. |
Yeah, agreed, makes perfect sense, I just didn't understand that "even with both workers being part of the same process" part of your question :) |
* reference erc20 contract from new location * skipped xcm tests
…aritytech#718) Bumps [tar](https://github.com/alexcrichton/tar-rs) from 0.4.35 to 0.4.36. - [Release notes](https://github.com/alexcrichton/tar-rs/releases) - [Commits](alexcrichton/tar-rs@0.4.35...0.4.36) --- updated-dependencies: - dependency-name: tar dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fn increase_message_fee() * benchmarks + weights * - extra lines * split error
We could perhaps emulate this by having a global allocator wrapper and just tracking the sizes ourselves. Won't be as accurate since it wouldn't include the allocator's overhead/metadata, but on other hand that'd make it more consistent (the figures reported would be the same on different allocators) so maybe it's not a bad thing.
Originally posted by @koute in paritytech/polkadot#6675 (comment)
Related issue
#719
The text was updated successfully, but these errors were encountered: