Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting positioning ATX might finish after the poet round starts #6041

Closed
Tracked by #328
poszu opened this issue Jun 13, 2024 · 1 comment
Closed
Tracked by #328

Selecting positioning ATX might finish after the poet round starts #6041

poszu opened this issue Jun 13, 2024 · 1 comment
Assignees
Labels

Comments

@poszu
Copy link
Contributor

poszu commented Jun 13, 2024

Description

The selection of a positioning ATX involves two steps:

  1. finding a high-tick ATX
  2. verifying PoSTs in the chain of a candidate

The first step was significantly speeded up in #5952. The second step could also take significant time because of two reasons:

  1. It fetches the dependencies (positioning and previous ATXs) back to the genesis (or a "valid" ATX if it was previously fully validated) from the DB, which is usually very loaded at the time ("the atx storm")
  2. It verifies the PoST of every dependency. The post-verifying workers are also very loaded at that time

Actual Behavior

Verification takes very long and, in some cases, finishes after the CG ended (and the poet round started). The miner fails to submit the poet challenge in time.

Expected Behavior

The selection should be guaranteed to always end on time. As a fallback, it's always valid to pick a golden ATX or the previously selected positioning ATX (in the last epoch - that could be picked up from the previous ATX).

Additional Resources

Related code:

func (b *Builder) searchPositioningAtx(
ctx context.Context,
nodeID types.NodeID,
publish types.EpochID,
) (types.ATXID, error) {
logger := b.logger.With(log.ZShortStringer("smesherID", nodeID), zap.Uint32("publish epoch", publish.Uint32()))
b.posAtxFinder.finding.Lock()
defer b.posAtxFinder.finding.Unlock()
if found := b.posAtxFinder.found; found != nil && found.forPublish == publish {
logger.Debug("using cached positioning atx", log.ZShortStringer("atx_id", found.id))
return found.id, nil
}
latestPublished, err := atxs.LatestEpoch(b.db)
if err != nil {
return types.EmptyATXID, fmt.Errorf("get latest epoch: %w", err)
}
logger.Info("searching for positioning atx", zap.Uint32("latest_epoch", latestPublished.Uint32()))
// positioning ATX publish epoch must be lower than the publish epoch of built ATX
positioningAtxPublished := min(latestPublished, publish-1)
id, err := findFullyValidHighTickAtx(
ctx,
b.atxsdata,
positioningAtxPublished,
b.conf.GoldenATXID,
b.validator,
logger,
VerifyChainOpts.AssumeValidBefore(time.Now().Add(-b.postValidityDelay)),
VerifyChainOpts.WithTrustedID(nodeID),
VerifyChainOpts.WithLogger(b.logger),
)
if err != nil {
logger.Info("search failed - using golden atx as positioning atx", zap.Error(err))
id = b.conf.GoldenATXID
}
b.posAtxFinder.found = &struct {
id types.ATXID
forPublish types.EpochID
}{id, publish}
return id, nil
}

Implementation hints

The searchPositioningAtx() function should pass a context with a deadline that leaves enough time to register in poet, for example, 5 minutes for mainnet, possibly configurable to let the users tweak it. The deadline should be shorter for tests, systests, and testnets. If the deadline is exhausted, it should fall back to the miner's previous ATX or the golden ATX if this is the initial ATX.

Additionally, post-verification workers could prioritize verifying the PoSTs of the candidate's dependencies. This could be achieved by changing the Post() and PostV2() methods of Validator to allow prioritized verification. The existing Post verifier already supports prioritization but it would need to be extended to allow prioritizing the current call, instead of just by node IDs. See

func (v *offloadingPostVerifier) Verify(
ctx context.Context,
p *shared.Proof,
m *shared.ProofMetadata,
opts ...verifying.OptionFunc,
) error {
job := &verifyPostJob{
ctx: ctx,
proof: p,
metadata: m,
opts: opts,
result: make(chan error, 1),
}
metrics.PostVerificationQueue.Inc()
defer metrics.PostVerificationQueue.Dec()
var jobChannel chan<- *verifyPostJob
_, prioritize := v.prioritizedIds[types.BytesToNodeID(m.NodeId)]
switch {
case prioritize:
v.log.Debug("prioritizing post verification", zap.Stringer("proof_node_id", types.BytesToNodeID(m.NodeId)))
jobChannel = v.prioritized
default:
jobChannel = v.jobs
}

@poszu
Copy link
Contributor Author

poszu commented Jul 3, 2024

Fixed in #6053

@poszu poszu closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants