Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtxBuilder should validate initial PoST loaded from DB #5921

Closed
poszu opened this issue May 9, 2024 · 4 comments
Closed

AtxBuilder should validate initial PoST loaded from DB #5921

poszu opened this issue May 9, 2024 · 4 comments
Assignees

Comments

@poszu
Copy link
Contributor

poszu commented May 9, 2024

Description

When ATX Builder starts, it first determines whether it needs to build an initial PoST proof. It doesn't need to when the smesher already has previous ATXs or the proof is found in the local database. However, in the latter case (PoST in local DB), it could be that the proof is not valid anymore for the current configuration. It could happen if the user increases Space Units.

Steps to reproduce

  1. initialize X SU
  2. wait for the initial PoST to be generated
  3. increase to X+N SU
  4. the initial PoST proof in the DB is no longer valid

Actual Behavior

Atx Builder trusts the initial PoST found in the local DB. It later repeatedly fails to publish an ATX because of invalid initial PoST proof. It cannot break out of the cycle on its own, the only way to get out is to manually remove the local.sql DB.

Expected Behavior

Atx Builder should verify the initial PoST found in the local DB and recreate it if it's not valid anymore.

Affected code

func (b *Builder) buildInitialPost(ctx context.Context, nodeID types.NodeID) error {
// Generate the initial POST if we don't have an ATX...
if _, err := atxs.GetLastIDByNodeID(b.db, nodeID); err == nil {
return nil
}
// ...and if we haven't stored an initial post yet.
_, err := nipost.GetPost(b.localDB, nodeID)
switch {
case err == nil:
b.logger.Info("load initial post from db")
return nil
case errors.Is(err, sql.ErrNotFound):
b.logger.Info("creating initial post")
default:
return fmt.Errorf("get initial post: %w", err)
}

@poszu poszu added the bug label May 9, 2024
@poszu poszu moved this to 📋 Backlog in Dev team kanban May 9, 2024
@poszu poszu added the area/atx label May 9, 2024
@acud acud self-assigned this Jun 3, 2024
@acud
Copy link
Contributor

acud commented Jun 5, 2024

@poszu there are several issues here:

  1. the node was started with different configurations, and now needs to reinitialize the storage and redo the post
  2. the user changed the data over the disk by using the initialization tool and now the data doesn't match on the disk and the proof can never be made either (the initial post needs to be made again). For this case I'm not sure we can do anything, because it involves verifying that the labels are correct (regenerate them) for the nodes involved in the proof.

My approach to handling #1 was to try to use the prover to run the existing proof that we find in the DB with the parameters of smeshing-opts-numunits we get in the CLI. That should at least cover the case where there's a mismatch in the storage units. Just the only things is that I just can't seem to find in any relevant place in the activation package the usages of this value. It is not clear whether the grpc node on the other end (which handles communicating with the rust library) has this configuration somehow implicitly. It is also not clear why this very specific parameter isn't passed on the actual protobuf message which asks for the initial post to be generated (it only specifies the identity and the challenge).

@acud
Copy link
Contributor

acud commented Jun 6, 2024

Updating after discussing further with @poszu, that this approach may work fine for cases where the spacemesh node is run in standalone mode, but won't in cases where it manages more identities.

In that case, my suggestion would be the following:

  1. that configuration for every identity (what is now managed by the rust process in the postdata_metadata.json file) should be moved to the go node's responsibility
  2. that the rpc method of get_metadata should be inverted - that the rust process should ask the go process for that metadata, not the other way around (my understanding is that the rust process is not expected to be running always, but only when the PoST window comes into view). the rust process would do so when it starts, and keep that in-mem for it's entire lifecycle (as it does now with the data in the file).
  3. once we agree to these changes, we can then add a simple post verification call in the go node's startup sequence that checks whether the proof we have matches up with the number of SUs a certain identity has configured

Since this is infers a bunch of changes to CLI flags in both the rust application and the go application, we need to plan on how to phase this and allow users time to migrate.
The other issue is that it may affect the initialization process as it is generated there.
I see that the node ID is included in the file, however, there's absolutely no assurance that the location of the metadata file is actually accessible from the go node. Therefore, it may have implications for node operators.

EDIT: the other alternative is to have reliance that the post service is always available from the go node on startup. This will save on all these changes (but might complicate the startup sequence a bit)

@fasmat
Copy link
Member

fasmat commented Jun 7, 2024

I went through the code and think these are the steps required to fix the core of the issue:

  1. Update the nipostBuilder interface:

    type nipostBuilder interface {
       BuildNIPost(
     	  ctx context.Context,
     	  sig *signing.EdSigner,
     	  challenge *types.NIPostChallenge,
     	  challengeHash types.Hash32,
       ) (*nipost.NIPostState, error)
       Proof(ctx context.Context, nodeID types.NodeID, challenge []byte, initialPost *types.Post) (*types.Post, *types.PostInfo, error)
       ResetState(types.NodeID) error
    }

    We need to pass the full challenge to the builder instead of just its hash. It contains the publish epoch and the (optional) initial post that needs to be checked if present.

    The Proof method also needs to be adjusted to receive the (optional) initial PoST that needs to be verified to still be valid when generating a proof.

  2. In the nipostBuilder in the Proof method the initial proof (if not nil) needs to be verified with the post info from the connected post-service before the new proof is generated:

     	retries = 0
     	if initialPost != nil {
     		info, err := client.Info(ctx)
     		if err != nil {
     			events.EmitPostFailure(nodeID)
     			return nil, nil, fmt.Errorf("failed to get post info: %w", err)
     		}
     		// verify that the initial post is valid with the given `info` parameters
     		// if not return an `ErrInvalidInitialPost`
     	}
     	post, postInfo, err := client.Proof(ctx, challenge)
  3. BuildNipost in nipost.go needs to be updated to pass the initial post from the challenge to the Proof method - it is not nil when this is the first ATX and is nil otherwise.

  4. Other callers of Proof just pass nil for the initial post

  5. createAtx needs to be updated to pass the full challange and its hash instead of just its hash and the publish epoch

  6. The run method in activation.go needs to check for the new error returned by nipostbuilder.Proof and do the same as in the case of ErrATXChallengeExpired plus building a new initial post by removing it from the local database and calling buildInitialPost again.

  7. Update existing tests to the new interfaces, add new (unit) tests for the updated behavior.

spacemesh-bors bot pushed a commit that referenced this issue Jun 13, 2024
## Motivation
In #5921, an edge case is described where a node may start with a certain amount of SUs, generate an initial PoST but for possible reasons, the number of SUs may change before the first ATX is broadcasted. This means that the initial PoST is no longer valid and needs to be regenerated (costing the miner another epoch before being able to submit the initial ATX).
spacemesh-bors bot pushed a commit that referenced this issue Jun 13, 2024
## Motivation
In #5921, an edge case is described where a node may start with a certain amount of SUs, generate an initial PoST but for possible reasons, the number of SUs may change before the first ATX is broadcasted. This means that the initial PoST is no longer valid and needs to be regenerated (costing the miner another epoch before being able to submit the initial ATX).
@acud
Copy link
Contributor

acud commented Jun 20, 2024

fixed with #6031

@acud acud closed this as completed Jun 20, 2024
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Dev team kanban Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants