Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run tests many times #2011

Closed
wants to merge 4 commits into from
Closed

run tests many times #2011

wants to merge 4 commits into from

Conversation

faddat
Copy link
Contributor

@faddat faddat commented Oct 21, 2024

This pull request runs the tests twenty times with a cache and twenty times without a cache on linux and mac, for a grand total of eighty runs of the test suite.

The observed error is always:

runtime: bad pointer in frame github.com/CosmWasm/wasmvm/v2/internal/api.AnalyzeCode at 0xc001268738: 0x1
fatal error: invalid pointer found on stack

This pull request makes no changes to the code in this repository, and simply runs its tests eighty times.

@faddat
Copy link
Contributor Author

faddat commented Oct 21, 2024

As can be seen in the test results, we get a pointer error stemming from the internal api of wasmvm a significant portion of the time.

I can also say that after fixing the gas issues that using v2.1.3 poses, this problem still occurs about 5% of the time.

If I had to guess: CGO.

Copy link

codecov bot commented Oct 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 42.87%. Comparing base (b2b6abe) to head (55d930a).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2011   +/-   ##
=======================================
  Coverage   42.87%   42.87%           
=======================================
  Files          85       85           
  Lines       11587    11587           
=======================================
  Hits         4968     4968           
  Misses       6179     6179           
  Partials      440      440           

@chipshort
Copy link
Collaborator

Good find! I'm also able to reproduce this locally on my M1 Macbook. That's really bad.
Could it be similar to CosmWasm/wasmvm#536? Except this one seems to happen with the default optimization level.
We'll look into this.

@faddat
Copy link
Contributor Author

faddat commented Nov 2, 2024

The problem is not the continuous integration system. It is all of the binary blobs that sit in the internal folder of wasm VM, and the fact that:

  • CGO is nearly impossible to debug
  • CGO is known to leak ram like whoa
  • CGO interrupts a bunch of the basic low level features of golang
  • The current design pass go <-> c <-> rust (rust)

And yes I agree that it is very bad. It means that when merging prs, the process has been to just click retry even though at best there are flaky tests and at worst there's a very serious logic error, but it was chronically ignored.

And the reality is that can actually be seen in the CI system.

@chipshort
Copy link
Collaborator

  • I agree that debuggability is a bit of a pain point here.
  • I don't know much about the internals of CGO, but I would be very surprised if it leaked a lot of memory. And FWIW we haven't gotten any reports about memory leaks that could be traced back to CGO as a culprit. Memory leaks are most likely caused by the underlying C / Rust library that is called through CGO.
  • It doesn't really pass through C as a language, but we do need to use C ABI on the Rust side. We are planning to investigate using uniffi to simplify all these bindings and make them more robust.

It means that when merging prs, the process has been to just click retry even though at best there are flaky tests and at worst there's a very serious logic error, but it was chronically ignored

You are making some very strong claims here that are entirely based on the assumption that we found this bug already before your PR. To my knowledge, this is not the case. Do you have any proof that someone just reran CI to hide such a bug? Because if so, I would like to talk to them about it.

@faddat
Copy link
Contributor Author

faddat commented Nov 8, 2024

Concerning strong claims -- some of them are by necessity true.

The only one I'm not sure of concerns weather or not the issue is in the tests or

I can assure you (you can believe me or not) that cgo is basically the weak point of any app it is in.

@faddat
Copy link
Contributor Author

faddat commented Dec 3, 2024

Guys, I noticed that a fix to this had been merged to main.

Unfortunately CI is not set up to run tests 20 times.

The result of this is, to quote Do Kwon, "your patch is not patch"

... I am going to make a new PR that targets main, just to confirm this result, but I also wanted to make sure that the main branch is patched?

I am of course referencing lthe PRs/Issues linked to this PR.

@chipshort
Copy link
Collaborator

I have already confirmed this in a previous PR, but feel free to confirm this yourself.
Just keep in mind that we won't merge it because running all tests that often all the time is just wasteful.

@aumetra
Copy link
Member

aumetra commented Dec 4, 2024

The current design pass go <-> c <-> rust (rust)

If you have the money to fund a few months of full-time dev and all the resulting maintenance, feel free to contact us, then I can aid in implementing the Go calling convention using assembly trampolines to call Rust without CGO. Otherwise we will have to keep this design.

C is the lingua franca of programming languages. Not because it's a good choice, but because it's abundant. C doesn't even have a well-defined ABI, it's just "what compiler people agreed on doing at some point"(tm), and we all have to live with how terrible C is at representing anything complex (see: https://faultlore.com/blah/c-isnt-a-language/).


Sidenote on that "what compiler people agreed on" point: Not even that is true. There are multiple instances of painful interop difficulties between different compilers because they do subtly different things on the ABI level: https://faultlore.com/blah/abi-puns/

But enough of this tangent.


there are flaky tests and at worst there's a very serious logic error

Yes. And these are probably Heisenbugs due to Go having a garbage collector, and garbage collectors are usually non-deterministic. This would be a non-issue if we had deterministic memory reclamation, but unfortunately we don't live in that world.

This is probably an issue of Go not keeping the object around quite long enough, but due to the GC not running immediately it will work in most cases. But in some cases where the GC happens to reclaim memory right before the pointer is dereferenced, the thing will crash.

What's additionally painful is that all of this is very much implicit and has no compiler guidance, so you're basically poking in the dark. It's as if you were writing C, but a chaos-monkey inserts free calls wherever it thinks it could fit at runtime. Not optimal. Quite suboptimal, actually.


Unfortunately CI is not set up to run tests 20 times.

There could be a middleground here. Running the tests scheduled once a day 20 times. Should still catch bad regressions, but it's not as wasteful as running it on every commit.

Otherwise, again, feel free to pay for the CircleCI credits. Then we can run it as often as you want.

@faddat
Copy link
Contributor Author

faddat commented Dec 4, 2024

Hey :)

I don't know who you are, but work is underway to remove cgo.

I am not paying for it, nor am I being paid for it.

I am jfdi it.

This shouldn't cost a penny on circleci: these are GitHub actions jobs.

Personally not a big fan of circleci.

https://GitHub.com/cosmwasgo/cosmwasgo is what I am working on here, and basically all I want to achieve is a drop-in replacement to wasmvm because it is in my opinion very problematic.

There could be a middleground here. Running the tests scheduled once a day 20 times. Should still catch bad regressions, but it's not as wasteful as running it on every commit.

Could you lmk what these tests are costing?

Again -- for me/orgs I set up on github -- typically I keep them 100% public, then there's no need to pay github. I know that sometimes, GH services can be horiffically expensive.

@aumetra
Copy link
Member

aumetra commented Dec 4, 2024

I'm an engineer at Confio. This comment was made purely to explain our stack choices and technical reasons behind it. Not for you specifically but for people coming across this, to get across our reasoning here.

I felt this was needed since your comments display a lack of knowledge on why these things work how they work right now, and make quite bold claims.

Whether this is done on purpose, due to a language barrier, or otherwise: I don't really care.

And the comments about insinuating that "you should pay" have a simple reason: you want us to foot the bill for your CI workflows, that you are seemingly insistent on adding.

This is not something that should need adding and is a defect of tooling around Go, since every other language in existence has address sanitizers.

@faddat
Copy link
Contributor Author

faddat commented Dec 4, 2024

the jobs I added work just fine with the totally free github actions, which is actually all I personally use, and indeed I personally use that exact option.

I 100% get you on the GC and heisenbug thing and that is the precise reason that I am working on pure Go VM to run compiled rust contracts in.

I don't think that it should be controversial -- or bold at all -- to say that it is a bad idea to use unsafe in rust, or report -- as I have for years -- that chains using x/wasm immediately begin to leak RAM. They do, although confio has consistently denied that... but in the end, well, I was able to find the issue, and much to Confio's credit, the issue is now resolved. Thank you and your team for that.

Really, that is about all I am really aiming at here.

It is unfortunate that I (placing blame on my side) was not able to get the point across during the first couple of years.

Weather or not this changes your thoughts on what I am doing here -- I think CosmWasm is a great way to build blockchain applications, but that WasmVM isn't real safe, due to the use of CGO -- in fact... until I made this PR to run the tests many many times, I didn't realize that unsafe was being used.

There's nothing at all wrong with the original virtual machine design in CosmWasm. But today there are better options avail.able.

Thanks to you and anyone else at confio for assisting in the resolution of this issue. To me, that's the key.

If there is anything further that you feel I do not understand here, please, let me know.

PS: about costs -- at no time was I attempting to impose costs on Confio -- me, I just use all the free github stuff, works really well. If you are using the paid stuff, well, I totally get it, and sorry if it came off as a demand that Confio pay.

@aumetra I missed your last sentence

This is not something that should need adding and is a defect of tooling around Go, since every other language in existence has address sanitizers.

yes absolutely, cgo is one of the worst, very very worst parts of go!

That's why I'd like to write CosmWasm contracts in Rust, and run it without cgo.

@faddat
Copy link
Contributor Author

faddat commented Dec 4, 2024

Yes. And these are probably Heisenbugs due to Go having a garbage collector, and garbage collectors are usually non-deterministic. This would be a non-issue if we had deterministic memory reclamation, but unfortunately we don't live in that world.

this is why I want to stop using cgo in chains written in Go. I feel like we are in fact on the same page.

@faddat
Copy link
Contributor Author

faddat commented Dec 4, 2024

we haven't gotten any reports about memory leaks that could be traced back to CGO as a culprit. Memory leaks are most likely caused by the underlying C / Rust library that is called through CGO.

You have, but that is okay!

Look there's one key thing here and the others don't matter: we are fixing stuff.

@pinosu
Copy link
Contributor

pinosu commented Dec 12, 2024

Hey @faddat , the fix is finally in main. Here you can see the tests passing #2057 .
I will close this PR because the issue is now solved in main.
Thanks for your contribution 👍

@pinosu pinosu closed this Dec 12, 2024
@pinosu pinosu mentioned this pull request Dec 12, 2024
@faddat
Copy link
Contributor Author

faddat commented Dec 16, 2024

I'm just really happy to see this get addressed. Thanks for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants