sync2: ATX integration #6448

ivan4th · 2024-11-11T10:05:22Z

Motivation

We need more efficient ATX sync.

Description

This adds set reconciliation for ATXs.
There are per-epoch syncers, with lower FPTree depth (16 by default)
used for older epochs and greater FPTree depth (21 by default) used
for current epoch.
Both active syncv2 and passive (server-only) syncv2 are disabled by
default. It is possible to enable syncv2 in server-only or
full (active) mode.

Test Plan

Test on testnet and mainnet nodes

TODO

Explain motivation or link existing issue(s)
Test changes and document test plan
Update documentation as needed
Update changelog as needed

This adds a possibility to take a connection from the pool to use it via the Executor interface, and return it later when it's no longer needed. This avoids connection pool overhead in cases when a lot of quries need to be made, but the use of read transactions is not needed. Using read transactions instead of simple connections has the side effect of blocking WAL checkpoints.

Using single connection for multiple SQL queries which are executed during sync avoids noticeable overhead due to SQLite connection pool delays. Also, this change fixes memory overuse in DBSet. When initializing DBSet from a database table, there's no need to use an FPTree with big preallocated pool for the new entries that are added during recent sync.

This adds set reconciliation for ATXs. There are per-epoch syncers, with lower FPTree depth (16 by default) used for older epochs and greater FPTree depth (21 by default) used for current epoch. Both active syncv2 and passive (server-only) syncv2 are disabled by default. It is possible to enable syncv2 in server-only or full (active) mode.

Split sync could become blocked when there were slow peers. Their subranges are assigned to other peers, and there were bugs causing indefinite blocking and panics in these cases. Moreover, after other peers managed to sync the slow peers' subranges ahead of them, we need to interrupt syncing against the slow peers as it's no longer needed. In multipeer sync, when every peer has failed to sync, e.g. due to temporary connection interruption, we don't need to wait for the full sync interval, using shorter wait time between retries.

config/mainnet.go

codecov · 2024-11-11T10:27:04Z

Codecov Report

Attention: Patch coverage is 77.00205% with 112 lines in your changes missing coverage. Please review.

Project coverage is 79.8%. Comparing base (138e9bd) to head (6e40bda).
Report is 7 commits behind head on develop.

Files with missing lines	Patch %	Lines
sync2/atxs.go	71.7%	42 Missing and 8 partials ⚠️
syncer/syncer.go	68.8%	32 Missing and 7 partials ⚠️
sync2/multipeer/setsyncbase.go	63.3%	8 Missing and 3 partials ⚠️
fetch/mesh_data.go	62.9%	10 Missing ⚠️
sync2/multipeer/multipeer.go	81.8%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop   #6448     +/-   ##
=========================================
- Coverage     79.9%   79.8%   -0.1%     
=========================================
  Files          353     354      +1     
  Lines        46602   46885    +283     
=========================================
+ Hits         37245   37429    +184     
- Misses        7247    7330     +83     
- Partials      2110    2126     +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

It turned out that sync interactions are happening rather quickly, and thus it is not really practical to try and begin handling arriving keys (e.g. ATX IDs) during sync itself. Moreover, on-the-fly key handling was actually only used to register peer IDs for each ATX ID to fetch the actual blob from, and that can be done in the handler's Commit() method just as well.

fasmat

Reviewing this PR took me quite some time because it builds on code that I'm not yet very familiar with. There are quite a few things that I noticed when going through the code:

We callbacks in places where they make code more complex (and inefficient) than it needs to be. Why are errors propagated using callbacks?

I'm also unsure about the DB abstraction that syncv2 builds on. Why do we have to implement something like this ourselves? Are there no existing solutions that we can use? Is it even needed? It's another layer of abstraction for mostly simple queries as far as I can tell, it also overlaps with the existing sql/builder package that exists to solve mostly the same challenges and was newly implemented only a few months ago.

I'm becoming unsure about the extensive use of iterators in syncv2: they are inconsistent; some are actually rings others are just slices, which leads to them behaving differently and the only way to know is to dig into the implementation. Some uses need to know the number of elements in the iterator before working through it. Wouldn't at that point a slice do as well?. I thought we needed iterators because a) we might not know in advance how many elements will be returned and b) collecting elements eagerly would be much more inefficient than using an iterator.

Especially regarding inefficiency of using slices over iterators: I now believe just behaving similar to io.Reader (but for Hashes instead of Bytes) would solve a lot of the complexity and remove the need for rangesync.Seq. Instead we could replace it with a rangesync.Reader that looks like this:

type Reader interface {
    Read([]KeyBytes) (int, error) // what is the need for `KeyBytes` btw? isn't that always a types.Hash32?
}

The caller can still iterate over the elements (by reading in a loop) error propagation is more straight forward and rangesync.SeqResult is then probably not needed at all? The underlying can be a mock, a DB, some in memory structure, even a network connection - just like for the o.g. io.Reader 🙂 wdyt?

sync2/atxs.go

fasmat · 2024-12-05T19:47:36Z

syncer/syncer.go

+
+func (s *Syncer) ensureMalfeasanceInSync(ctx context.Context) error {
+	// TODO: use syncv2 for malfeasance proofs:
+	// https://github.com/spacemeshos/go-spacemesh/issues/3987


I do not understand how this issue relates to Malfeasance sync?

See this comment: #3987 (comment)

I still don't understand how these two are related. The issue talks about sync issues in general and only mentions malfeasance proofs in passing. Your comment lists options for different sync methods.

When and how is this issue considered resolved?

How is this method in particular related to the issue?

Removed the link to issue to avoid confusion. It has to deal with state sync somewhat, but is indeed more general

syncer/syncer.go

syncer/syncer_test.go

fasmat · 2024-12-05T19:51:30Z

fetch/mesh_data.go

+			err = fmt.Errorf("acquiring slot to get hash: %w", err)
+			for _, h := range hashes[i:] {
+				options.callback(h, err)
+			}
+			return err


That seems very inefficient calling possibly millions of callbacks to communicate the same error where the caller side probably doesn't even care about any error besides the first?

It's rather likely to be a small subset of ATX IDs e.g. not downloaded due to hs/1 request throttling, and these requests will be retried later

The point still stands - getHashes might be called with 5.0 Mio hashes at once - if the limit cannot be acquired because throttling is active then millions of callbacks are called.

As far as I can see this only affects this section of the code:

go-spacemesh/sync2/atxs.go

Lines 110 to 129 in ad4924b

err := h.f.GetAtxs(ctx, cs.items, system.WithATXCallback(func(id types.ATXID, err error) {

mtx.Lock()

defer mtx.Unlock()

switch {

case err == nil:

cs.numDownloaded++

someSucceeded = true

delete(cs.state, id)

case errors.Is(err, pubsub.ErrValidationReject):

h.logger.Debug("failed to download ATX",

zap.String("atx", id.ShortString()), zap.Error(err))

delete(cs.state, id)

case cs.state[id] >= h.maxAttempts-1:

h.logger.Debug("failed to download ATX: max attempts reached",

zap.String("atx", id.ShortString()))

delete(cs.state, id)

default:

cs.state[id]++

}

}))

This will ~~print a debug log with the exact same error for every ATX and~~ increment every element in cs.state. This could be handled much simpler (and arguably more efficiently) without requiring to keep track of the retries of every single hash:

Arguably this is out of the scope of this PR but this should be addressed. It makes no sense to register a hash for a peer then requesting that hash in a batch and let the fetcher again reconstruct from which peer to fetch that hash from. Error handling is also bad, because for every fetched hash an error has to be returned via callback or aggregated in a &fetcher.BatchError{}. Instead imo it would be much simpler to just have a (blocking) getHash method that fetches a hash from a given peer and returns an error if something went wrong. Then the caller can easily parallize requests and match errors to peers & hashes.

Internally the fetcher can still group requests into single batches, request those from individual peers and deserialise the batched result. This also makes it easier to figure out what when wrong if something did go wrong. Right now we have a lot of log errors of the kind validation ran for unknown hash and hash missing from ongoing requests because the fetcher fails to match requests of hashes to peers and/or callers with how it is structured at the moment.

We should do a serious fetcher cleanup after we switch to syncv2 at least for ATXs and malfeasance proofs. Right now it's probably a bit too early as we'll have to update v1 syncers that are on their way out.
Simplifying blob fetch logic that currently uses promises etc. should be one of the goals, other being removing non-streaming fetcher client and server code.

fasmat · 2024-12-05T19:51:42Z

fetch/mesh_data.go

+			for _, h := range hashes[i:] {
+				options.callback(h, err)
+			}


ivan4th · 2024-12-05T22:58:05Z

Thanks a lot for review!

Reviewing this PR took me quite some time because it builds on code that I'm not yet very familiar with. There are quite a few things that I noticed when going through the code:

We callbacks in places where they make code more complex (and inefficient) than it needs to be. Why are errors propagated using callbacks?

You mean the fetcher interaction (mesh_data.go)? If so, I'll provide details in the reply to the relevant comment.

I'm also unsure about the DB abstraction that syncv2 builds on. Why do we have to implement something like this ourselves? Are there no existing solutions that we can use? Is it even needed? It's another layer of abstraction for mostly simple queries as far as I can tell, it also overlaps with the existing sql/builder package that exists to solve mostly the same challenges and was newly implemented only a few months ago.

Passing just TableName, IDColumn, TimestampColumn and Filter via SyncedTable which are really all that syncv2 needs is IMO much more clean than writing out 7 SQL queries at a higher level in the code, which would basically make SQL table based sync a rather leaky abstraction; especially given that later, some queries may need to be added/replaced. Using AST to generate SQL is just safer (see CustomQuery example below) and more readable than fmt.Sprintf(...) with multiple long strings sometimes spanning several lines.

The idea was that given that we planned to introduce rqlite/sql dependency in any case (it is to be used for migration code cleanup and clean schema comparsion), we could use it's SQLite AST representation for SQL query generation as well,
as the resulting code is much cleaner than one based on concatenating parts of SQL queries, in particular given operator precedence in SQL. Other places where I intended to use sql/expr are:

Bloom filters (sql: add Bloom filters for ATXs and malicious identities #6332) -- to be re-benchmarked and finalized; the PR description contains motivation for using these expressions
Basically replace sql/builder. It's now kind of footgun in some aspects, e.g. if we ever happen to pass an expression containing OR as CustomQuery without adding (...) around it, we'll mess up results in a funny way due to operator precedence:

go-spacemesh/sql/builder/builder.go

Line 109 in 9092fe7

queryBuilder.WriteString(fmt.Sprintf(" %s", op.CustomQuery))

With AST based approach, this cannot happen.

The tests contain generated SQL so it can also be seen clearly. Hand-written SQL is often error-prone and thus it probably doesn't harm to state the intent twice in different ways :)

Other than that, sql/expr is just a very thin wrapper around AST structures defined in rqlite/sql to make them a bit more readable. I wouldn't introduce it's use in syncv2 if they were only to be used for atxs table, but we'll also need to sync malfeasant identities, and later quite likely other objects as well.

I'm becoming unsure about the extensive use of iterators in syncv2: they are inconsistent; some are actually rings others are just slices, which leads to them behaving differently and the only way to know is to dig into the implementation. Some uses need to know the number of elements in the iterator before working through it. Wouldn't at that point a slice do as well?. I thought we needed iterators because a) we might not know in advance how many elements will be returned and b) collecting elements eagerly would be much more inefficient than using an iterator.

Re the need for iterators themselves, there are places in the code where the pattern of invoking yield callback makes code quite a bit easier, e.g. FPTree traversal. There are other places where we need to combine sequences, making use of iter.Pull: https://github.com/spacemeshos/go-spacemesh/blob/9092fe701922855261d03dd0e54d4699b653445b/sync2/rangesync/combine_seqs.go This basically means the need to retrieve data on item-by-item basis.
In this case, one of the sequences may refer to the database, while other may refer to an FPTree containing items received via the recent sync mechanism.

The need for cyclic iterators arises from the currently used range representation: https://github.com/spacemeshos/go-spacemesh/blob/develop/dev-docs/sync2-set-reconciliation.md#range-representation
I followed the original range sync paper in that. Right now I'm not 100% convinced that this is the right approach and maybe later we could experiment with simpler non-cyclic range representation for sync, but right now that would entail too much rework.
While we still have to deal with cyclic iterators, the code should never assume that iterator is not cyclic, and the places where this assumption is not followed (thx for noticing these), it should be fixed. We could also make iterators actually always cyclic for consistency, although I'm less sure this is strictly necessary.

Returning temporary slices instead of iterators, which can sometimes turn out to contain million of items, may cause substantial GC pressure and consequently unnecessarily high CPU usage, which is rather pronounced with the old sync implementation.

Especially regarding inefficiency of using slices over iterators: I now believe just behaving similar to io.Reader (but for Hashes instead of Bytes) would solve a lot of the complexity and remove the need for rangesync.Seq. Instead we could replace it with a rangesync.Reader that looks like this:
type Reader interface {
    Read([]KeyBytes) (int, error) // what is the need for `KeyBytes` btw? isn't that always a types.Hash32?
}
The caller can still iterate over the elements (by reading in a loop) error propagation is more straight forward and rangesync.SeqResult is then probably not needed at all? The underlying can be a mock, a DB, some in memory structure, even a network connection - just like for the o.g. io.Reader 🙂 wdyt?

I'm afraid that will make e.g. FPTree traversal will become quite a bit more complicated, as well as aforementioned iterator combining code. It can of course wrap an iter.Seq under the hood but then I'm unsure whether it's worth the effort

ivan4th · 2024-12-08T07:10:58Z

Made the sequences non-cyclic in most cases, except for the case of sync2/sqltore.IDStore interface where it's somewhat less obvious and probably not strictly necessary

fasmat

I added more comments to the current PR, sorry for taking a while to finish reviewing it but it is a rather large change...

Regarding your comment:

Returning temporary slices instead of iterators, which can sometimes turn out to contain million of items, may cause substantial GC pressure and consequently unnecessarily high CPU usage, which is rather pronounced with the old sync implementation.

I do understand this, but using iterators doesn't necessarily solve the problem when what is iterated is a slice (which it is in many cases). Slices - as long as they are not copied - are just pointers that are returned or passed as argument. Wrapping them into an iterator doesn't necessarily mean there are fewer memory allocations or less pressure on the GC.

I do understand that especially with a DB backend it can be much more efficient to use iterators than slices. That's why I suggested a "io.Reader inspired" alternative. Were not all data has to be read into memory at once, but only as much as really is needed - similar how the iterators behave in that case.

I'm afraid that will make e.g. FPTree traversal will become quite a bit more complicated, as well as aforementioned iterator combining code. It can of course wrap an iter.Seq under the hood but then I'm unsure whether it's worth the effort

Yes, it would still be possible to use iter.Seq where it would simplify code by just wrapping the interface in an iterator.

With the removal of passing along the length of the iterator and removing (most) cyclic cases it is probably OK now but I still feel like in many cases it neither improves readability nor performance but just adds a layer of abstraction 🤷

fasmat · 2024-12-13T13:40:45Z

sync2/atxs.go

+		someSucceeded, err := h.getAtxs(ctx, cs)
+		switch {
+		case err == nil:
+		case errors.Is(err, context.Canceled):
+			return err
+		case !errors.Is(err, &fetch.BatchError{}):
+			h.logger.Debug("failed to download ATXs", zap.Error(err))
+		}


Sorry, I had this wrong in my previous suggestion. errors.Is is used to check for value equality, not type equality. We probably want to do this here instead:

Suggested change

someSucceeded, err := h.getAtxs(ctx, cs)

switch {

case err == nil:

case errors.Is(err, context.Canceled):

return err

case !errors.Is(err, &fetch.BatchError{}):

h.logger.Debug("failed to download ATXs", zap.Error(err))

}

someSucceeded, err := h.getAtxs(ctx, cs)

batchErr := &fetch.BatchError{}

switch {

case err == nil:

case errors.Is(err, context.Canceled):

return err

case !errors.As(err, batchErr):

h.logger.Debug("failed to download ATXs", zap.Error(batchErr))

}

The reason being that errors.As will check (and assign) err to batchErr if the type allows it. errors.Is does an equality check which by default will always fail unless err == &fetch.BatchError{}. If the type implements interface{ Is(error) bool } then errors.Is will use the Is method instead. fetch.BatchError has such a method that correctly checks if errors.Is would be true for any of the contained errors so that a caller can treat a batch error as if it was an error of a single item fetched.

Fixed, but it had to be case !errors.As(err, &batchErr): b/c BatchError methods have pointer receiver

fasmat · 2024-12-13T14:17:32Z

sync2/interface.go

+type Fetcher interface {
+	system.AtxFetcher
+	Host() host.Host
+	Peers() *peers.Peers
+	RegisterPeerHash(peer p2p.Peer, hash types.Hash32)
+}


I think you can just inline system.AtxFetcher:

Suggested change

type Fetcher interface {

system.AtxFetcher

Host() host.Host

Peers() *peers.Peers

RegisterPeerHash(peer p2p.Peer, hash types.Hash32)

}

type Fetcher interface {

GetAtxs(context.Context, []types.ATXID, ...GetAtxOpt) error

Host() host.Host

Peers() *peers.Peers

RegisterPeerHash(peer p2p.Peer, hash types.Hash32)

}

Interfaces should be defined on the user side not in a system package.

With my other suggestions this can be further simplified to

type Fetcher interface { GetAtxs(context.Context, []types.ATXID, ...system.GetAtxOpt) error RegisterPeerHashes(peer p2p.Peer, hash []types.Hash32) }

Fixed by merging your branch

fasmat · 2024-12-13T14:29:30Z

sync2/interface.go

+	system.AtxFetcher
+	Host() host.Host
+	Peers() *peers.Peers
+	RegisterPeerHash(peer p2p.Peer, hash types.Hash32)


Also this method exists twice on the fetcher implementation:

RegisterPeerHashes(peer p2p.Peer, hash []types.Hash32) RegisterPeerHash(peer p2p.Peer, hash types.Hash32)

would it make sense to drop the second and just use the first instead?

Fixed by merging your branch

fasmat · 2024-12-13T14:32:46Z

sync2/fptree/fptree.go

@@ -363,6 +363,7 @@ func (ft *FPTree) traverseFrom(
 }

 // All returns all the items currently in the tree (including those in the IDStore).
+// The sequence in SeqResult is either empty or infinite.


My linter complains that there is a meaningless assertion in sync2/fptree/nodepool_test.go:

require.Nil(t, nil, idx3)

A leftover from some older code, removed

fasmat · 2024-12-13T14:39:01Z

syncer/interface.go

@@ -47,7 +48,7 @@ type fetcher interface {
 	GetLayerOpinions(context.Context, p2p.Peer, types.LayerID) ([]byte, error)
 	GetCert(context.Context, types.LayerID, types.BlockID, []p2p.Peer) (*types.Certificate, error)

-	system.AtxFetcher
+	sync2.Fetcher


This Interface can be simplified to:

// fetcher is the interface to the low-level fetching. type fetcher interface { GetLayerData(context.Context, p2p.Peer, types.LayerID) ([]byte, error) GetLayerOpinions(context.Context, p2p.Peer, types.LayerID) ([]byte, error) GetCert(context.Context, types.LayerID, types.BlockID, []p2p.Peer) (*types.Certificate, error) sync2.Fetcher GetMalfeasanceProofs(context.Context, []types.NodeID) error GetBallots(context.Context, []types.BallotID) error GetBlocks(context.Context, []types.BlockID) error RegisterPeerHashes(peer p2p.Peer, hashes []types.Hash32) SelectBestShuffled(int) []p2p.Peer PeerEpochInfo(context.Context, p2p.Peer, types.EpochID) (*fetch.EpochData, error) PeerMeshHashes(context.Context, p2p.Peer, *fetch.MeshHashRequest) (*fetch.MeshHashes, error) }

With my other suggestions this can be further simplified to:

type fetcher interface { GetLayerData(context.Context, p2p.Peer, types.LayerID) ([]byte, error) GetLayerOpinions(context.Context, p2p.Peer, types.LayerID) ([]byte, error) GetCert(context.Context, types.LayerID, types.BlockID, []p2p.Peer) (*types.Certificate, error) GetAtxs(context.Context, []types.ATXID, ...system.GetAtxOpt) error GetMalfeasanceProofs(context.Context, []types.NodeID) error GetBallots(context.Context, []types.BallotID) error GetBlocks(context.Context, []types.BlockID) error RegisterPeerHashes(peer p2p.Peer, hashes []types.Hash32) SelectBestShuffled(int) []p2p.Peer PeerEpochInfo(context.Context, p2p.Peer, types.EpochID) (*fetch.EpochData, error) PeerMeshHashes(context.Context, p2p.Peer, *fetch.MeshHashRequest) (*fetch.MeshHashes, error) }

Fixed by merging your branch

fasmat · 2024-12-13T14:45:21Z

syncer/syncer.go

+		serverOpts := append(
+			s.cfg.ReconcSync.ServerConfig.ToOpts(),
+			server.WithHardTimeout(s.cfg.ReconcSync.HardTimeout))
+		s.dispatcher = sync2.NewDispatcher(s.logger, fetcher.(sync2.Fetcher), serverOpts)


This cast shouldn't be necessary:

Suggested change

s.dispatcher = sync2.NewDispatcher(s.logger, fetcher.(sync2.Fetcher), serverOpts)

s.dispatcher = sync2.NewDispatcher(s.logger, fetcher, serverOpts)

Also this here is the reason why the sync2.Fetcher interface needs to expose the Host method. I think instead it would be better to pass the host explicitly to NewSyncer and remove the method from the interface and fetch.Fetch implementation.

Fixed by merging your branch

fasmat · 2024-12-13T14:53:03Z

sync2/atxs.go

+	curSet := dbset.NewDBSet(db, atxsTable(epoch), 32, cfg.MaxDepth)
+	handler := NewATXHandler(logger, f, cfg.BatchSize, cfg.MaxAttempts,
+		cfg.MaxBatchRetries, cfg.FailedBatchDelay, nil)
+	return NewP2PHashSync(logger, d, name, curSet, 32, f.Peers(), handler, cfg, enableActiveSync)


This is the only use of the fetcher.Peers() method. Just like with host the peer.Peers service should be passed explicitly to NewATXSyncer and not indirectly via the fetcher. For this it needs to be initialised in node.go and passed to both the fetcher and here instead of constructing it in the fetcher.

Fixed by merging your branch

fasmat · 2024-12-13T15:38:53Z

syncer/syncer.go

+		hss := sync2.NewATXSyncSource(
+			s.logger, s.dispatcher, cdb.Database.(sql.StateDatabase),
+			fetcher.(sync2.Fetcher), s.cfg.ReconcSync.EnableActiveSync)


See my other comments about the sync2.Fetcher and syncer.fetcher interfaces

fasmat · 2024-12-13T15:41:47Z

syncer/syncer.go

+
+func (s *Syncer) ensureMalfeasanceInSync(ctx context.Context) error {
+	// TODO: use syncv2 for malfeasance proofs:
+	// https://github.com/spacemeshos/go-spacemesh/issues/3987


I still don't understand how these two are related. The issue talks about sync issues in general and only mentions malfeasance proofs in passing. Your comment lists options for different sync methods.

When and how is this issue considered resolved?

How is this method in particular related to the issue?

fasmat · 2024-12-13T16:01:58Z

fetch/mesh_data.go

+			err = fmt.Errorf("acquiring slot to get hash: %w", err)
+			for _, h := range hashes[i:] {
+				options.callback(h, err)
+			}
+			return err


The point still stands - getHashes might be called with 5.0 Mio hashes at once - if the limit cannot be acquired because throttling is active then millions of callbacks are called.

As far as I can see this only affects this section of the code:

go-spacemesh/sync2/atxs.go

Lines 110 to 129 in ad4924b

err := h.f.GetAtxs(ctx, cs.items, system.WithATXCallback(func(id types.ATXID, err error) {

mtx.Lock()

defer mtx.Unlock()

switch {

case err == nil:

cs.numDownloaded++

someSucceeded = true

delete(cs.state, id)

case errors.Is(err, pubsub.ErrValidationReject):

h.logger.Debug("failed to download ATX",

zap.String("atx", id.ShortString()), zap.Error(err))

delete(cs.state, id)

case cs.state[id] >= h.maxAttempts-1:

h.logger.Debug("failed to download ATX: max attempts reached",

zap.String("atx", id.ShortString()))

delete(cs.state, id)

default:

cs.state[id]++

}

}))

This will ~~print a debug log with the exact same error for every ATX and~~ increment every element in cs.state. This could be handled much simpler (and arguably more efficiently) without requiring to keep track of the retries of every single hash:

Arguably this is out of the scope of this PR but this should be addressed. It makes no sense to register a hash for a peer then requesting that hash in a batch and let the fetcher again reconstruct from which peer to fetch that hash from. Error handling is also bad, because for every fetched hash an error has to be returned via callback or aggregated in a &fetcher.BatchError{}. Instead imo it would be much simpler to just have a (blocking) getHash method that fetches a hash from a given peer and returns an error if something went wrong. Then the caller can easily parallize requests and match errors to peers & hashes.

Internally the fetcher can still group requests into single batches, request those from individual peers and deserialise the batched result. This also makes it easier to figure out what when wrong if something did go wrong. Right now we have a lot of log errors of the kind validation ran for unknown hash and hash missing from ongoing requests because the fetcher fails to match requests of hashes to peers and/or callers with how it is structured at the moment.

ivan4th · 2024-12-15T23:40:08Z

fixed fetcher/peer related stuff by merging @fasmat's branch
hash fetcher logic refactoring is a must, but probably after we have mainnet switched to syncv2 for both ATXs and malfeasance proofs, so that we have less code to refactor
- also will need to remove non-streaming fetcher client/server
other comments addressed

ivan4th added 4 commits November 11, 2024 12:52

ivan4th requested review from dshulyak, fasmat, poszu, acud and jellonek as code owners November 11, 2024 10:05

fasmat reviewed Nov 11, 2024

View reviewed changes

config/mainnet.go Show resolved Hide resolved

ivan4th added 9 commits November 11, 2024 23:46

sync2: dbset: fix connection leak in non-loaded DBSets

4984889

Merge branch 'sync2/dbset-conns' into sync2/atxs

a89964a

sql: revert removing Rollback method of the Migration interface

bb31cc6

sql: remove Database.Connection() method, keep WithConnection()

3e5a401

Merge branch 'feature/long-db-conns' into sync2/dbset-conns

2753560

sql: allow multiple connections to in-memory database

f5cae06

sync2: fixup for temporary OrderedSet copies

9e585e5

Merge branch 'sync2/dbset-conns' into sync2/fix-multipeer

042d0d7

Merge branch 'sync2/fix-multipeer' into sync2/atxs

294bc6c

ivan4th mentioned this pull request Nov 13, 2024

[Merged by Bors] - sql: add Database.WithConnection #6445

Closed

ivan4th added the area/syncv2 label Nov 18, 2024

ivan4th added 6 commits November 21, 2024 00:13

Merge branch 'develop' into sync2/dbset-conns

0914f1f

Merge branch 'sync2/dbset-conns' into sync2/atxs

1ff57ab

sync2: use saner retry scheme for fetched ATXs

ce73f54

syncer: rename "v2" field to "reconcSync" in the config

f411f7e

Merge branch 'sync2/dbset-conns' into sync2/fix-multipeer

4ac694b

sync2: add server options and request rate limits

bb7226f

ivan4th force-pushed the sync2/atxs branch from bffee74 to bb7226f Compare November 21, 2024 11:03

Merge branch 'develop' into sync2/atxs

50afbf8

ivan4th added 2 commits November 25, 2024 00:37

sync2: fix mainnet/testnet configs

284f836

ivan4th mentioned this pull request Nov 25, 2024

sync v2 spacemeshos/pm#272

Open

35 tasks

ivan4th added 3 commits November 26, 2024 22:40

Merge branch 'develop' into sync2/fix-multipeer

24768fb

Merge branch 'develop' into sync2/atxs

bb54fd8

Merge branch 'sync2/fix-multipeer' into sync2/atxs

244ceb1

spacemesh-bors bot changed the base branch from sync2/fix-multipeer to develop December 5, 2024 01:38

Merge branch 'develop' into sync2/atxs

9092fe7

fasmat requested changes Dec 5, 2024

View reviewed changes

ivan4th added 5 commits December 6, 2024 23:54

sync2: address comments

2987bbc

sync2: make most Seqs/SeqResults non-cyclic

7dab83a

sync2: address comments

858787a

sync2: refactor ATXHandler.Commit()

f869ded

sync2: fixup

ad4924b

fasmat reviewed Dec 13, 2024

View reviewed changes

fasmat and others added 3 commits December 13, 2024 17:56

Simplify interfaces

8a1e693

Fixup after Fetcher update

c22bd45

Address comments

6e40bda

	err := h.f.GetAtxs(ctx, cs.items, system.WithATXCallback(func(id types.ATXID, err error) {
	mtx.Lock()
	defer mtx.Unlock()
	switch {
	case err == nil:
	cs.numDownloaded++
	someSucceeded = true
	delete(cs.state, id)
	case errors.Is(err, pubsub.ErrValidationReject):
	h.logger.Debug("failed to download ATX",
	zap.String("atx", id.ShortString()), zap.Error(err))
	delete(cs.state, id)
	case cs.state[id] >= h.maxAttempts-1:
	h.logger.Debug("failed to download ATX: max attempts reached",
	zap.String("atx", id.ShortString()))
	delete(cs.state, id)
	default:
	cs.state[id]++
	}
	}))

	s.dispatcher = sync2.NewDispatcher(s.logger, fetcher.(sync2.Fetcher), serverOpts)
	s.dispatcher = sync2.NewDispatcher(s.logger, fetcher, serverOpts)

sync2: ATX integration #6448

Are you sure you want to change the base?

sync2: ATX integration #6448

Conversation

ivan4th commented Nov 11, 2024 • edited by fasmat Loading

Motivation

Description

Test Plan

TODO

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

fasmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivan4th commented Dec 5, 2024 • edited Loading

ivan4th commented Dec 8, 2024

fasmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivan4th commented Dec 15, 2024

ivan4th commented Nov 11, 2024 •

edited by fasmat

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading

ivan4th commented Dec 5, 2024 •

edited

Loading