Simulation Failure/Timeout #13088

alexanderbez · 2022-08-30T00:34:13Z

So it has been raised to my attention that non-PR CI simulations are timing out. Namely, they were passing/running to completion up and until this PR, however, they started to timeout after this PR which I implemented.

Thanks for raising this @kocubinski. I initially thought it was a bug or something in my PR, however, after digging into it a bit more, I don't think it is. Namely, if you run the sim with seed=11 for example, you'll get to a point where it's stuck on EndBlock, specifically, x/crisis EndBlock. If you look at that, you'll see it gets stuck on:

ASSERTING CRISIS INVARIANTS; INV: 8/12, NAME: group/Group-TotalWeight

Taking a look at this invariant, you see it's a naked for loop (very bad!). In any case, I don't see how this has anything to do with my PR and I think my PR just surfaced a mutation in the simulation execution branch path that simply uncovers a pre-existing issue in the x/group Group-TotalWeight invariant.

Take a look: https://github.com/cosmos/cosmos-sdk/blob/main/x/group/keeper/invariants.go#L41-L98

I'm convinced this has nothing to do with PR. Someone with x/group expertise should look into this. Here is the output with seed 11 for the invariant:

...
groupWeight: 18, membersWeight: 18
groupWeight: 23, membersWeight: 23
groupWeight: 20, membersWeight: 20
groupWeight: 5, membersWeight: 5
groupWeight: 9, membersWeight: 9

Also, I noticed the last message to be executed prior to this getting stuck was the following:

OP MSG: {"route":"/cosmos.authz.v1beta1.MsgRevoke","name":"/cosmos.authz.v1beta1.MsgRevoke","comment":"","ok":true,"msg":{"type":"cosmos-sdk/MsgRevoke","value":{"grantee":"cosmos1l3wl6d57krm4ujdgtg4p5pdh9fw9evhn3t8e2a","granter":"cosmos18ph9tar8en6z8f9tn529utk2xlmuk6waw0ltm0","msg_type_url":"/cosmos.bank.v1beta1.MsgSend"}}}

The text was updated successfully, but these errors were encountered:

alexanderbez · 2022-08-30T00:50:31Z

I also don't see any errors. I printed the group IDs and member addresses and it seems to get stuck after getting the group ID:

...
GROUP ID: 708
GROUP MEMBER ID: cosmos1tljq9xse7wvfuuajee80ukch7lu3x8rc8fs7kw
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP MEMBER ID: cosmos1ed2j0xrdaattf82xc6rnxhsdel2gxeea8r2zqw
GROUP MEMBER ID: cosmos16nfnckv83a7n34l624flkqfcmvqn43mg8tlyhe
GROUP ID: 709
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP ID: 710
GROUP MEMBER ID: cosmos1tljq9xse7wvfuuajee80ukch7lu3x8rc8fs7kw
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP ID: 711
<HANGS HERE>

This tell smells like an ORM bug maybe?

amaury1093 · 2022-08-30T08:34:26Z

I already debugged some group sims in the past, can take a look at this.

alexanderbez · 2022-08-30T18:51:52Z

@AmauryM it's either this call:

			var groupMember group.GroupMember
			_, err = memIt.LoadNext(&groupMember)
			if errors.ErrORMIteratorDone.Is(err) {
				break
			}

Or this call:

		memIt, err := groupMemberByGroupIndex.Get(ctx.KVStore(key), groupInfo.Id)
		if err != nil {
			msg += fmt.Sprintf("error while returning group member iterator for group with ID %d\n%v\n", groupInfo.Id, err)
			return msg, broken
		}
		defer memIt.Close()

amaury1093 · 2022-08-31T10:49:53Z

I confirm via git bisect that this bug appeared in #12852. @alexanderbez It seems that your analysis is correct, it's hanging on the second call. However, it's not hanging because of group/ORM logic, but in the cachekv store.

See my branch with some logging: main..am/13088-group-sims. Using the same seed 11, I get:

ABC Running GroupTotalWeightInvariant on block 45
ABC Running GroupTotalWeightInvariant on block 50
ABC clearUnsortedCacheSubset BEFORE sortedCache.Set (HERE IT HANGS...)

which means it hangs on this line

cosmos-sdk/store/cachekv/store.go

Line 392 in 8704983

store.sortedCache.Set(item.Key, []byte{})

This brings directly to tm-db code. Before I dig into this deeper, @alexanderbez do you have an idea why tm-db might hang here?

alexanderbez · 2022-08-31T18:38:40Z

Interesting. I have zero clue lol. All my PR did was remove a few types and add a new message type.

As for the cache store, I do know there is a mutex and that iterator gets a write lock, which could maybe cause contention? I'm not really sure. Seems like we need to dig into the cache KVStore more on this one.

EDIT: NVM, the deadlock would be coming from tm-db actually

alexanderbez · 2022-08-31T18:39:35Z

Perhaps we add some logging to iterator to see if there could be any thread contention that would cause a deadlock? Or we could even run the simulation with the race detector on?

alexanderbez · 2022-08-31T20:43:47Z

Ok I think I got the output to display the lock contention: https://pastebin.com/TA70tkz7

But this seems to be coming from levelDB? Or at least our usage of it? Not clear...

I also tweaked the local version of tm-db when running with https://github.com/sasha-s/go-deadlock and I got the following output (which seems to be more useful and relevant): https://pastebin.com/ZNDW4ADN

Looks like we got some recursive locking in tm-db :/

alexanderbez · 2022-08-31T20:49:42Z

@AmauryM take a look at that last pastebin...it's not clear to me if it's an issue/bug in tm-db mem's iterator or the ORM's potentially incorrect usage of it.

It looks like to me and others took a look too, that it's in ORM's incorrect usage -- creating an iterator within a loop. Or maybe it's the invariant...two for loops?

alexanderbez added C:Simulations C:x/group labels Aug 30, 2022

amaury1093 self-assigned this Aug 30, 2022

amaury1093 added this to Cosmos-SDK Aug 31, 2022

amaury1093 moved this to 📝 Todo in Cosmos-SDK Aug 31, 2022

amaury1093 added the T:Sprint label Aug 31, 2022

amaury1093 moved this from 📝 Todo to 💪 In Progress in Cosmos-SDK Aug 31, 2022

alexanderbez mentioned this issue Aug 31, 2022

fix: fix Group-TotalWeight invariant #13116

Merged

19 tasks

julienrbrt closed this as completed in #13116 Sep 1, 2022

Repository owner moved this from 💪 In Progress to 👏 Done in Cosmos-SDK Sep 1, 2022

tac0turtle removed this from Cosmos-SDK Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation Failure/Timeout #13088

Simulation Failure/Timeout #13088

alexanderbez commented Aug 30, 2022

alexanderbez commented Aug 30, 2022

amaury1093 commented Aug 30, 2022

alexanderbez commented Aug 30, 2022

amaury1093 commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

Simulation Failure/Timeout #13088

Simulation Failure/Timeout #13088

Comments

alexanderbez commented Aug 30, 2022

alexanderbez commented Aug 30, 2022

amaury1093 commented Aug 30, 2022

alexanderbez commented Aug 30, 2022

amaury1093 commented Aug 31, 2022 • edited Loading

alexanderbez commented Aug 31, 2022 • edited Loading

alexanderbez commented Aug 31, 2022 • edited Loading

alexanderbez commented Aug 31, 2022 • edited Loading

alexanderbez commented Aug 31, 2022 • edited Loading

amaury1093 commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading

alexanderbez commented Aug 31, 2022 •

edited

Loading