Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: move TxPool reorg and events to background goroutine #19705

Merged
merged 8 commits into from
Jun 21, 2019

Conversation

fjl
Copy link
Contributor

@fjl fjl commented Jun 12, 2019

This change moves internal queue re-shuffling work in TxPool to a background
goroutine, TxPool.runReorg. Requests to execute runReorg are accumulated
by the new scheduleReorgLoop. The new loop also accumulates transaction events.

The motivation for this change is making sends to txFeed synchronous.

Fixes #19192

@fjl
Copy link
Contributor Author

fjl commented Jun 12, 2019

This isn't quite done yet. I still need to figure out how to remove the go txFeed.Send(...) in TxPool.add.

@fjl
Copy link
Contributor Author

fjl commented Jun 12, 2019

The 'aggregation' of promoteExecutables calls probably won't work in practice with this change alone because runReorg and add compete for the central lock.

Consider AddRemote: we take the lock, validate the transaction, move the TX into pending or queue and drop the lock. Then we kick off a background reorg, which takes the lock again, moves TXs around, drops the lock and finally sends events. Since the next call to AddRemote will have to wait for the lock held by reorg before adding its TX, concurrency doesn't improve much. Things are slightly better than before this change when it comes to events though. If a downstream consumer of txFeed is blocked for a while, reorg calls will aggregate instead of launching many goroutines to deliver events.

To really improve high load behavior, we'd need a way to quickly verify the TX without the big lock, submit the request to add it and move on. The reorg stuff could then take batches of transactions periodically and integrate them all at once, keeping the central lock free for readers to take most of the time.

@fjl
Copy link
Contributor Author

fjl commented Jun 12, 2019

An idea: if we split up the lock so there is one for the queue and one for pending we could just always insert TXs into the queue and then always promote them in the background.

@fjl fjl force-pushed the txpool-event-sync branch from 8f9b7da to 31d523b Compare June 13, 2019 10:45
@fjl fjl marked this pull request as ready for review June 13, 2019 10:45
@fjl fjl changed the title core: move transaction pool reorgs to background goroutine core: move TxPool reorg and events to background goroutine Jun 13, 2019
core/tx_pool.go Outdated Show resolved Hide resolved

return pool.addTxsLocked(txs, local)
done := pool.requestPromoteExecutables(dirtyAddrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code only caled promoteExecutables if len(dirty) > 0 (in addTxsLocked). Was there a reason to not carry that over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is definitely needed to trigger sending of events even if
there isn't anything to promote. Furthermore, promoteExecutables is pretty
smart about avoiding work.

@fjl fjl force-pushed the txpool-event-sync branch 2 times, most recently from 72e57d4 to 096e413 Compare June 13, 2019 12:55
@fjl
Copy link
Contributor Author

fjl commented Jun 13, 2019

@holiman PTAL

core/tx_pool.go Outdated Show resolved Hide resolved
core/tx_pool_test.go Outdated Show resolved Hide resolved
@fjl fjl force-pushed the txpool-event-sync branch 2 times, most recently from 096e413 to 1f90cc7 Compare June 14, 2019 10:04
@holiman
Copy link
Contributor

holiman commented Jun 14, 2019

This PR (specifically) instance=Geth/v1.9.0-unstable-87409422-20190614/linux-amd64/go1.12.6 is now running on

bootnode-aws-us-east-1-001
bootnode-aws-ap-southeast-1-001
bootnode-azure-australiaeast-001
bootnode-azure-westus-001

@holiman
Copy link
Contributor

holiman commented Jun 16, 2019

Here are the charts after a few days running. The machines are LES servers, and have around 490 peers each, the upper chart shows the one running this PR, the lower ones some old version which is fairly recent (but does not have this PR (nor some other changes in the LES area)
Screenshot_2019-06-16 Mainnet Bootnodes Datadog

The blip on friday evening is when they were restarted with this PR.

The non-updated machines have a lifetime of around 24-36 hours before they crash, with stack traces that point to extreme goroutine overload. So yeah, this PR looks good to me from that perspective!

@holiman
Copy link
Contributor

holiman commented Jun 17, 2019

Seeing quite a lot of these in the logs;

ERROR[06-17|06:25:47.909] Demoting invalidated transaction         hash=d51870…cd6d3d
ERROR[06-17|06:25:47.946] Demoting invalidated transaction         hash=560c07…7c2506
ERROR[06-17|06:25:47.965] Demoting invalidated transaction         hash=d74ba3…9629c2
ERROR[06-17|06:25:47.991] Demoting invalidated transaction         hash=75b0b8…950046

If it is truly ERROR, we need to look into it. If it's expected, it shouldn't be ERROR level

@holiman
Copy link
Contributor

holiman commented Jun 17, 2019

Error messages comes from demoteUnexecutables, which is called from reset.

	// Inject any transactions discarded due to reorgs
	log.Debug("Reinjecting stale transactions", "count", len(reinject))
	senderCacher.recover(pool.signer, reinject)
	pool.addTxsLocked(reinject, false)

	// validate the pool of pending transactions, this will remove
	// any transactions that have been included in the block or
	// have been invalidated because of another transaction (e.g.
	// higher gas price)
	pool.demoteUnexecutables()

The reset tries to take transactions that were in a block that became reorged and did not make it into the new block, and shove them back into the pending. However, the addTxsLocked in this PR is async, and they don't immediately wind up in pending.

Afterwards, it tries to demote unexecutables, and finds that the subsequent transactions are still there, and it becomes surprised since the reinjceted ones are now not there. And the effect is that the subsequent ones get shoved out from pending and into queued.

@fjl fjl force-pushed the txpool-event-sync branch from 740d18c to ee68955 Compare June 17, 2019 09:39
@holiman
Copy link
Contributor

holiman commented Jun 17, 2019

Screenshot_2019-06-17 Mainnet Bootnodes Datadog(1)

This PR against master. The four machines running this PR are all around 2.5K goroutines, glued to the bottom of the chart. The ones running master are going haywire.

fjl added 4 commits June 20, 2019 10:57
This change moves internal queue re-shuffling work in TxPool to a
background goroutine, TxPool.runReorg. Requests to execute runReorg are
accumulated by the new scheduleReorgLoop. The new loop also accumulates
transaction events.

The motivation for this change is making sends to txFeed synchronous
instead of sending them in one-off goroutines launched by 'add' and
'promoteExecutables'. If a downstream consumer of txFeed is blocked for
a while, reorg requests and events will queue up.
This change removes tracking of the homestead block number from TxPool.
The homestead field was used to enforce minimum gas of 53000 for
contract creations after the homestead fork, but not before it. Since
nobody would want configure a non-homestead chain nowadays and contract
creations usually take more than 53000 gas, the extra correctness is
redundant and can be removed.
This is useless now because there is no separate code path for
individual transactions anymore.
@karalabe karalabe force-pushed the txpool-event-sync branch from e4bf834 to 0dbe8bd Compare June 20, 2019 07:58
@holiman
Copy link
Contributor

holiman commented Jun 20, 2019

Future bootnodes now updated to run this PR 0dbe8bd against master

@karalabe karalabe added this to the 1.9.0 milestone Jun 21, 2019
@karalabe karalabe merged commit 60c062e into ethereum:master Jun 21, 2019
@gzliudan gzliudan mentioned this pull request May 10, 2024
19 tasks
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

goroutine memory leak
4 participants