Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-522: Optional Transaction Ordering #522

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

DavidM-D
Copy link
Contributor

@DavidM-D DavidM-D commented Dec 6, 2023

@DavidM-D DavidM-D requested a review from a team as a code owner December 6, 2023 08:02
Copy link

render bot commented Dec 6, 2023

@DavidM-D DavidM-D changed the title Stateless Transactions NEP-522: Stateless Transactions Dec 6, 2023
@DavidM-D DavidM-D added A-NEP A NEAR Enhancement Proposal (NEP). WG-tools Tools Working Group should be accountable. WG-protocol Protocol Standards Work Group should be accountable S-draft/needs-implementation A NEP in the DRAFT stage that needs implementation. S-review/needs-wg-to-assign-sme A NEP that needs working group to assign two SMEs. labels Dec 7, 2023
@walnut-the-cat
Copy link
Contributor

@DavidM-D , is this NEP ready to be reviewed?

Copy link
Contributor

@ewiner ewiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these problems are solvable with sufficiently smart clients, enough (100s-1,000s) of keys per account and rock solid infrastructure, but we haven't managed that so far and it's probably a lot easier to just simplify how we call the network.

But it's also a thoroughly solved problem in the blockchain world at this point. That doesn't mean we shouldn't make it easier for users, but it raises the bar to do so.

However, IMO this proposal is more compelling for DelegateAction than for Transaction. Nonce management is all about having a single party controlling each nonce, but once you're using a metatransaction, there's already two parties juggling a single nonce (the original user and the relayer).


Far more perniciously, since no client I'm aware of verifies the RPC response using a light client before retrying, RPC nodes can prompt the client to provide them with transactions with different nonces. This allows rogue RPC nodes to perform replay attacks.

We have found that often users are managing a single key using multiple clients or are running multiple servers controlling the same key. As such clients fetch a nonce from an indexer every time they need to make a transaction. Indexers tend to fall behind or fail under load and this causes our wallets, relayers and other services to fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They fetch a nonce from an indexer? Why is that? Fetching nonces from RPC (view_access_key) should be less likely to fall behind. That wouldn't prevent the nonce collision issues you're talking about, but should help for when a key is rarely used but the network is under high load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, @MaximusHaximus, why don't we fetch directly from an RPC node or do we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidM-D Actually, we do use RPC calls to fetch access key details (including the nonce) - here's where it happens inside of near-api-js just FYI: https://github.com/near/near-api-js/blob/master/packages/accounts/src/account.ts#L590-L598

We could have the expiry time measured in seconds since Unix epoch and have it be a u32. This saves some space, but it's [inconsistent with our contract standards](https://github.com/near/NEPs/blob/random-nonces/neps/nep-0393.md) and [2038](https://en.wikipedia.org/wiki/Year_2038_problem) isn't that far away in the great span of things.

Why not do this without the expiry time?
We could but it would mean that there'd still be a dependency on an indexer. The current block height is somewhat predictable, but it's not like you can set your watch to it especially under load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my last question, why use an indexer for this instead of a status RPC call?


We also add an optional `expires_at` field to the `Transaction` and `DelegateAction` messages. The protocol guarantees that this message will not be run in any block with a block timestamp later than the one contained in `expires_at`.

`expires_at` and `random_nonce` exist as alternatives to `max_block_height` and `nonce` respectively. They will cause this field to be ignored and the mechanisms connected to them to not be used. We put limits on the validity period of messages using `random_nonce` to avoid a large cost of checking their uniqueness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does max_block_height already exist on Transaction? It's not in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to and it's mandatory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean for Transaction, not DelegateAction. For Transaction, it appears that expires_at would be the only expiration option, not an alternative to max_block_height.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit late here but in case this is still relevant...

Today, transactions don't have an explicit max_block_height, as @ewiner correctly observed. Instead, they expire at the height of the given block_hash + transaction_validity_period. In mainnet, this is currently at 86400 blocks, which is 1 day if block time is 1s.

In theory, this means by picking a block hash at a specific height, one can define an expiration block height. But at the same time, the block hash is used to pick the canonical hash in case of forks. So I would argue, you really want to use the most recent block_hash you know, to ensure your changes build on top of the blockchain state that is known to you. But then you always get the fixed validity of 86400 blocks.

So yes, in my understanding, the proposal is the first proper way to limit a transaction validity period by the user.

Here is the check in the code:

https://github.com/near/nearcore/blob/7bad46aa5a9fe12b6d2b6e28800c7623365dd92f/chain/chain/src/store.rs#L707-L711


There is no requirement that the `random_nonce` field is unique or random. A client could (inadvisably) decide to always provide a `random_nonce` with a value of 0 and it would work as expected until they tried to send two completely identical transactions.

When `random_nonce` is present the protocol **may** reject transactions with an `expires_at` more than 120 seconds after the most recent block timestamp or a `max_block_height` more than 100 blocks after the most recent block height and the error `ExpiryTooLate` will be thrown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-delegate transactions, I believe this also implies that if you specify random_nonce, you must also specify either expires_at or max_block_height.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct I will make that clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "may reject" is too weak due to the cost to all nodes if a long-lived transaction is accepted. Too long-lived transactions should be a chunk validity condition.


### Solution

Our new solution is simple from a client side. Generate a random nonce, set an `expires_at` 100 seconds in the future and send it to the network. If it fails spuriously retry with the same nonce, if it fails consistently prompt the user to make a decision on whether to resend the transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many systems that use a random nonce recommend using a timestamp, or a timestamp concatenated with a random number, like the first 64 bits of a UUIDv7. Would you recommend or discourage that approach, and why? I think that's worth calling out here either way.

Copy link
Contributor Author

@DavidM-D DavidM-D Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend a random number and it doesn't need to be cryptographically secure.

random_nonce is only used to disambiguate identical short lived transactions with identical expires_at/max_block_height. Therefore the transactions requiring disambiguation will be created at the roughly the same time making the timestamp a poor additional source of entropy to the expiry.

We don't require a cryptographically secure number as we don't require non predictability, the client could reasonably increment a single value per transaction and it would remain secure.


If a client attempts to send an identical transaction with an identical `random_nonce` we preserve the existing behavior for responding to an already sent transaction. Provided it still exists they will receive the previous transactions receipt and no action will be taken on chain.

There is no requirement that the `random_nonce` field is unique or random. A client could (inadvisably) decide to always provide a `random_nonce` with a value of 0 and it would work as expected until they tried to send two completely identical transactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to make sure this doesn't conflict with plans around transaction priority / gas wars. On many other chains at least, there's an explicit feature to allow the sender to overwrite a pending transaction at a given key+nonce with a new one, either to attempt to proactively 'cancel' a transaction that isn't likely to be included in a block or to increase it's priority so it'll get included faster.

That first use case isn't as important here, because a pending transaction with a given random_nonce isn't preventing other transactions with different random_nonces like it would for the incrementing nonces. But transaction priority features might require that we change this part of the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't prevent any of these mechanisms when not using a random_nonce.

Also if we want people to be able to cancel/re-submit outstanding transactions, wouldn't it be better to do it for individual transactions rather than all outstanding transactions using nonce contention? It lead to a nicer API and fewer re-submissions during peak load.


Great care must be taken to ensure that the `CryptoHash` of any message that has been executed must exist anywhere that message may be valid and could be executed. If this is not the case then it's possible to launch replay attacks on transactions.

Since these transactions store the `CryptoHash` in the working trie for the duration of their validity, the validity period of these transactions must always be small enough to prevent slow lookups or excessive space use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this trie is unbounded. Even though no single entry will last in the trie for more than 2 minutes or so, I'm worried about the attack surface here.

An obvious possible solution is to limit based on the number of working trie entries for a given key.

I don't know much about the current mechanics of NEAR under high load – for normal incrementing nonces, is there a point where the network will reject transactions because there's too many pending for that key? If so, we could use the same limit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's bounded by the number of transactions that can be executed by a shard per expiry window (~100 seconds). Assuming a shard can one day do 10,000 transactions a second (I've only observed up to ~300) that'd be 32MBs in the trie.

I don't see much point in having a per key limit since an attacker could easily create an account with 100,000 keys. It would also require us to store the keys that each cryptohash belongs to which would double the amount of data stored in the trie under certain circumstances.

@bowenwang1996
Copy link
Collaborator

During the protocol working group meeting today, @mm-near came up with an idea to address the nonce issue for parallel transaction submission without modifying the state:

  • Transaction nonces are a 32-byte hash of block height concatenated with a salt. For each height, there are 2048 predefined valid salts (could be 1,2,...,2048).
  • Transactions are valid for a short period of time (100 blocks for example)
  • Validators precompute the hashes for each height and store them in memory. Assuming transactions are valid for 100 blocks, validators need to keep 204,800 hashes for each shard in memory, which is roughly 6.5MB
  • When validators process a chunk, they check whether each transaction is valid by checking whether its nonce belongs to the list of valid nonces.
  • Forks do not matter since the hash computation is based on block height.

This scheme does not require changing anything in the state and is therefore simpler to implement. However, it is not compatible with the existing design. To unify both mechanisms, we can make transaction nonce 256 bit and if the highest bit is set to 0 then the current mechanism is used (i.e, it is treated as a 64 bit number). Otherwise the new mechanism is used.

@ewiner
Copy link
Contributor

ewiner commented Jan 8, 2024

I'm sorry, I don't fully understand your proposal yet or the benefits of it.

Is this correct?

  1. The transaction author picks two numbers: one between currentBlockHeight and currentBlockHeight-99, and another between 1 and 2048.
  2. The nonce is determined by some hash function with those two numbers as inputs.
  3. Validators keep track of those 204,800 total allowed hashes and ensure that the given nonce is in the list.

How does this protect against transaction replay?

Also, for offline signers there's a delay between when a transaction is signed and when it can be broadcast. If I understand this proposal correctly, someone using this new nonce scheme would have to anticipate and estimate the block height for a short window when they broadcast the transaction.
There's already a 24-hour time limit between tx creation and broadcast, enforced by the blockHash parameter. I'd be worried about restricting that even further.

@DavidM-D
Copy link
Contributor Author

DavidM-D commented Jan 8, 2024

2048 possible salts per block seem too small. If you're selecting those salts randomly and you want to have a >99% chance of not having a collision you can only submit 6 transactions per block.

What is the hashing being used for? The small set of possible inputs make it both trivial to reveal the input (the validators do it every block) and larger than the data it's hashing.

Copy link
Contributor

@MaximusHaximus MaximusHaximus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to put my voice behind this NEP also -- nonce behaviour has been a regular frustration point in cases where it is desirable to use a single account to execute many transactions concurrently, which is a very common pattern.

A real-world example of how this impacts throughput is how we provide new account creation on the testnet chain. This functionality is provided by a helper service which used to hold a single access key for the .testnet account. I discovered that even with eager nonce incrementation, the highest throughput that I could accomplish with a single key was only ~6 new account transactions per second. To work around this and provide a higher throughput for testnet new account creation, I added 50 keys to the .testnet top-level account, and made the service use those keys in a round-robin fashion, which is what is currently live on testnet -- 300 new accounts per second is the effective limit (6 per key * 50 keys) of new account creation on testnet today.

While it is true that we can scale this higher by adding more access keys to the account and spreading the transactions across more keys, it's a 'gotcha' that catches many developers and not a great development experience or efficient use of development time. Putting hundreds of keys on a single account to scale up throughput makes it harder to manage things like key rotation and makes secret management frustrating.

neps/nep-0522.md Outdated Show resolved Hide resolved
neps/nep-0522.md Outdated Show resolved Hide resolved
neps/nep-0522.md Outdated Show resolved Hide resolved
@DavidM-D DavidM-D changed the title NEP-522: Stateless Transactions NEP-522: Optional Transaction Ordering Jan 9, 2024
@bowenwang1996
Copy link
Collaborator

As @ewiner pointed out, only allowing 2048 salts per block is too small when multiple users want to submit transactions at the same time. While we can increase that number, it may create other issues such as the hashes that validators need to compute. Instead, we can do the following:

  • Transaction nonce remains a u64.
  • Nonces are only valid for a small number of blocks N (100 for example). More specifically, when a transaction is submitted, it is invalid if the difference in height between base_block_hash and the tip of the chain is larger than N.
  • Validators locally keep track of all the nonces used in the past N blocks and keep updating them. This can be done by storing a hashset of nonces and separately a deque of nonces. Assuming that each block has 100,000 transactions, the total amount of memory usage is roughly 1,600,000 * N. If N is 100, this is roughly 160MB, which is reasonable. When a node restarts, it needs to read the past N blocks to build the in-memory indices. Forks are also fine since all operations here are in memory.
  • When validators receive a transaction of this form, after checking that its base block is within the last N blocks, they then check that the nonce is unique, i.e, not in the set of nonces already used. Note that it is okay for nonces to repeat after N blocks because it is guaranteed that the base block hash will not be the same and therefore the transactions cannot be the same.
  • For a user to send a transaction, it is sufficient to generate a random nonce. The probability of collision given N*100,000 transactions is extremely low when N is small (100 for example). The requirement that the base block hash has to be quite recent is not really an additional burden since client side APIs (near-api-js for example) already queries the chain for the latest block when creating a transaction.

This mechanism, however, is again not compatible with the current format. One idea to unify them is to check the highest bit of the nonce and only use the new mechanism is the highest bit is set to 1. However, due to a protocol feature introduced long time ago to prevent transaction duplication, there are already access keys with nonces larger than 2^32. As a result, we may need to introduce a new version of Transaction or play with serialization to allow this new type of nonce.

@DavidM-D
Copy link
Contributor Author

DavidM-D commented Jan 9, 2024

So this is relatively similar to the current proposal with some differences.

Rather than storing the nonces we store hashes of the transactions which contain the nonces. This prevents other entities from frontrunning a transaction, copying the nonce and then preventing the other transaction from being accepted. It also allows us to have a smaller space of nonces since we need unique nonces for otherwise identical transactions, in a pinch it could perhaps be as low as 2^16.

We also use an expiry rather than a block hash and check that isn't too far in the future. This requires a short validity period of a transaction but doesn't require that the signature occurred during that validity period, unlike requiring a block hash. That's somewhat useful for some of the colder wallet setups although a little on the margins.

Our implementation uses a Trie of Expiry => Hash which makes removal simple, but I'm sure a deque works just fine too.

I'm glad to see that it appears we're converging to some extent!

@birchmd
Copy link
Contributor

birchmd commented Jan 9, 2024

Thanks for the active work on this @DavidM-D ! At Aurora we also feel the troubles of nonce contention and a protocol change to address this is a good idea!

Some comments:

  • To prevent replay, a transaction hash must be remembered until its expiry time elapses. Therefore, I think the protocol should enforce a maximum expiry time. In addition to the may reject rule you have in the proposal, I think there should be a new block validity rule that says for all transactions included in a chunk, their expiry time must be in the range [chunk.timestamp, chunk.timestamp + 100s) (the upper bound would be a new protocol-level parameter). This prevents lax nodes that accept all transactions with all expiries from putting a long-lived transaction into a chunk that all nodes tracking the shard then need to remember.
  • Separately, I like @bowenwang1996 's suggestion of using the already present block_hash field to give an implicit expiry (by enforcing a fixed maximum number of blocks between the specified hash and transaction inclusion). I think this is nice because it opens an alternative to the trie for tracking which transaction hashes have already been included. For each block you create a Bloom filter containing the included transactions and then store those in a deque. This is a fixed-sized data structure regardless of the number of transactions, and it is smaller than the worst case size of other storage options (e.g. using 256 kiB Bloom filters and 128 block expiry is only 32 MiB). During times of high transaction volume there will be an increased chance of a false positive rejection, but with 256 kiB filters the probability of rejection is less than 50% even for 1 million transactions.

@DavidM-D
Copy link
Contributor Author

@bowenwang1996 it seems like we've converged on a design and there appears to be a need. How do we move this into review?

@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Jan 24, 2024

Thank you @DavidM-D for submitting this NEP.

As a moderator, I reviewed the NEP meets the proposed template guidelines, so I am moving this NEP to the REVIEW stage. Having said that, I believe we need to make sure the NEP reflects all the discussions happened among stakeholders so if you can make appropriate changes, that will be great.

Meanwhile, I would like to ask the @near/wg-protocol working group members to assign 2 Technical Reviewers to complete a technical review (see expectations below).

Just for clarity, Technical Reviewers play a crucial role in scaling NEAR ecosystem as they provide their in-depth expertise in the niche topic while work group members can stay on guard of the NEAR ecosystem. The discussions may get too deep and it would be inefficient for each WG member to dive into every single comment, so NEAR Developer Governance designed this process that includes subject matter experts helping us to scale by writing a summary with the raised concerns and how they were addressed.

Technical Review Guidelines
  • First, review the proposal within one week. If you have any suggestions that could be fixed, leave them as comments to the author. It may take a couple of iterations to resolve any open comments.

  • Second, once all the suggestions are addressed, produce a Technical Summary, which helps the working group members make a weighted decision faster. Without the summary, the working group will have to read the whole discussion and potentially miss some details.

Technical Summary guidelines:

  • A recommendation for the working group if the NEP is ready for voting (it could be approving or rejecting recommendation). Please note that this is the reviewer's personal recommendation.

  • A summary of benefits that surfaced in previous discussions. This should include a concise list of all the benefits that others raised, not just the ones that the reviewer personally agrees with.

  • A summary of concerns or blockers, along with their current status and resolution. Again, this should reflect the collective view of all commenters, not just the reviewer's perspective.

Here is a nice example and a template for your convenience:


### Recommendation

Add recommendation

### Benefits

* Benefit

* Benefit

### Concerns

| # | Concern | Resolution | Status |

| - | - | - | - |

| 1 | Concern | Resolution | Status |

| 2 | Concern | Resolution | Status |

Please tag the @near/nep-moderators once you are done, so we can move this NEP to the voting stage. Thanks again.

@DavidM-D
Copy link
Contributor Author

DavidM-D commented Feb 7, 2024

@walnut-the-cat anything I need to do to keep this moving?

@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Feb 21, 2024

Hey @DavidM-D , as long as you made sure points discussed as comments are reflected (as needed) in the latest NEP, no. Once @near/wg-protocol nominates SMEs for the NEP review, we will move forward with review process.

@marcelo-gonzalez
Copy link

this looks like a cool proposal! I have had some annoying experiences with nonces in the past so would def welcome something like this. One thing I'm wondering about, is what kind of assumptions we would have to make about the guarantees you get from expires_at field. As far as I know, in practice block timestamps do look roughly like unix timestamps separated by about one second or so, but that isn't enforced by the protocol. If you make the below change to a validator, its blocks will continue to be accepted as valid:

diff --git a/core/primitives/src/block.rs b/core/primitives/src/block.rs
index 49f1c9503..fb8b01125 100644
--- a/core/primitives/src/block.rs
+++ b/core/primitives/src/block.rs
@@ -12,9 +12,7 @@ use crate::sharding::{
     ChunkHashHeight, EncodedShardChunk, ReedSolomonWrapper, ShardChunk, ShardChunkHeader,
     ShardChunkHeaderV1,
 };
-use crate::static_clock::StaticClock;
 use crate::types::{Balance, BlockHeight, EpochId, Gas, NumBlocks, StateRoot};
-use crate::utils::to_timestamp;
 use crate::validator_signer::{EmptyValidatorSigner, ValidatorSigner};
 use crate::version::{ProtocolVersion, SHARD_CHUNK_HEADER_UPGRADE_VERSION};
 use borsh::{BorshDeserialize, BorshSerialize};
@@ -240,7 +238,7 @@ impl Block {
         signer: &dyn ValidatorSigner,
         next_bp_hash: CryptoHash,
         block_merkle_root: CryptoHash,
-        timestamp_override: Option<DateTime<chrono::Utc>>,
+        _timestamp_override: Option<DateTime<chrono::Utc>>,
     ) -> Self {
         // Collect aggregate of validators and gas usage/limits from chunks.
         let mut prev_validator_proposals = vec![];
@@ -270,9 +268,7 @@ impl Block {
         );
 
         let new_total_supply = prev.total_supply() + minted_amount.unwrap_or(0) - balance_burnt;
-        let now = to_timestamp(timestamp_override.unwrap_or_else(StaticClock::utc));
-        let time = if now <= prev.raw_timestamp() { prev.raw_timestamp() + 1 } else { now };
-
+        let time = prev.raw_timestamp()+1;
         let (vrf_value, vrf_proof) = signer.compute_vrf_with_proof(prev.random_value().as_ref());
         let random_value = hash(vrf_value.0.as_ref());

so here we are just incrementing the unix timestamp by one nanosecond every time instead of taking a "real" timestamp. So that would imply that the expires_at field isn't guaranteed by the protocol to match up to anything like what you would expect. But maybe it's okay since in practice validators aren't going to do that and nothing seriously bad will happen if they do?

@bowenwang1996
Copy link
Collaborator

As a working group member, I'd like to nominate @jakmeier and @birchmd as subject matter experts to review this NEP.

@marcelo-gonzalez
Copy link

^ about my last message, I wasnt aware of this but it looks like there's a transaction field in bitcoin that makes reference to block timestamps: https://en.bitcoin.it/wiki/NLockTime. So I would guess if my concern raised there is a problem for this proposal, then surely it would somehow have been a problem for this bitcoin nLockTime field, too... But it appears actually this nLockTime field is compared (when it is chosen to refer to timestamps and not block heights) to a median of the timestamps of the last 11 blocks. some explanation here, and in the inked BIP: https://github.com/bitcoin/bitcoin/blob/master/doc/release-notes/release-notes-0.11.2.md#bip113-mempool-only-locktime-enforcement-using-getmediantimepast

Maybe the choices there are relevant to this discussion too, even though the details are not exactly the same of course

Copy link
Contributor

@birchmd birchmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a SME reviewer, I recognize the troubles with nonces this proposal is attempting to solve. I agree that this is an issue which deserves being addressed, however the current proposal is under-developed. I think it is important to consider technicalities such as whether the used nonces are stored separately in memory or in the state trie; as well as if a simple mapping is sufficient or if another data structure should be used (eg Bloom filter). I think it is important that any solution explicitly bound the amount of resources nodes can be forced to use in tracking nonces.


Our new solution is simple from a client side. Generate a random nonce, set an `expires_at` 100 seconds in the future and send it to the network. If it fails spuriously retry with the same nonce, if it fails consistently prompt the user to make a decision on whether to resend the transaction.

The client doesn't need to query the key's nonce, and it doesn't need to know what the current block height is, removing a source of instability. They can send as many messages as they like from a single key without worrying about any form of contention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to know what the current block height is

The client will still need to know a recent block hash to specify the block_hash field in the transaction.


## Reference Implementation

A draft PR of the protocol change is a WIP and should land on Friday 8th Dec 2023.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of the reference implementation?


Since these transactions store the `CryptoHash` in the working trie for the duration of their validity, the validity period of these transactions must always be small enough to prevent slow lookups or excessive space use.

Node providers may decide to mess with the block time. The protocol ensures that block time is always monotonically increasing, so invalid messages can't become valid, but they could make messages last much longer than one might expect. That being said that's already possible if you can slow down the networks block speed causing similar issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are minimal protocol-level guarantees on timestamp, I wonder if we should not add the expires_at field and instead always use max_block_height for expiry. Wallets could still provide an expires_at UI where it translates the user's expiry time into an amount of blocks based on an average block time.

- Simpler more secure clients
- Better application reliability
- No more reliance on an indexer to send transactions
- Read RPC [will work better](https://pagodaplatform.atlassian.net/browse/ND-536)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue should be made public (and not require any atlassian login) since NEPs are public documentation.

We then ensure that the message is unique using the following pseudocode:

```rust
fn ensure_unique(trie: &mut Trie, message: Transaction | DelegateAction) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are alternatives which leave the nonce storage in memory instead of in the trie. For example Bowen's proposal. I also proposed that we could bound the amount of space taken for nonce checking by using a Bloom filter (though the trade-off is the possibility for false rejections).

I think these proposals should be integrated into this NEP either as alternatives that were considered and rejected or as part of the main design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filters probably wouldn't be appropriate because it's difficult to remove an element from them, which you'd have to do when things expire.


There is no requirement that the `random_nonce` field is unique or random. A client could (inadvisably) decide to always provide a `random_nonce` with a value of 0 and it would work as expected until they tried to send two completely identical transactions.

When `random_nonce` is present the protocol **may** reject transactions with an `expires_at` more than 120 seconds after the most recent block timestamp or a `max_block_height` more than 100 blocks after the most recent block height and the error `ExpiryTooLate` will be thrown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "may reject" is too weak due to the cost to all nodes if a long-lived transaction is accepted. Too long-lived transactions should be a chunk validity condition.

@birchmd
Copy link
Contributor

birchmd commented Feb 27, 2024

Another potential idea for this (originally from @JonathanLogan):

Conceptually, we imagine a new kind of access key has an infinite bit-set associated with it. When a transaction is submitted, the nonce represents what index in the bit-set is being set. If the bit is already set then we know the nonce has already been used and we can reject the transaction, otherwise the transaction can be accepted. However, an infinite bit-set is not realistic, therefore we will only expose a finite window of the bit-set at any time. This will be a sliding window; it changes as transactions are submitted.

Let W be the width of the window. As we shall see, the width of the window will define the maximum number of transactions that can be in-flight at the same time. The on-chain state associated with this kind of access key will be represented by an offset (eg a 64-bit number) and a bit-set of size W (e.g. 256 bit). The offset represents the starting index of the window in the hypothetical infinite bit-set and the finite bit-set represents the bits that are currently visible in the window. The logic for accepting or rejecting transactions is given by the following pseudocode:

fn check(tx_nonce: u64, mut offset: u64, mut window: [bool; W]) -> Accept | Reject {
    if tx_nonce < offset {
        return Reject; // Nonce is obviously too low
    }

    let window_index = tx_nonce - offset;

    if window_index >= W {
        // Nonce is outside current window, so shift the window
        let shift = window_index - W + 1;
        offset += shift;
        window <<= shift;
        window[W - 1] = true;
        return Accept;
    }

    // Reject used nonces
    if window[window_index] { return Reject; }

    window[window_index] = true;
    Accept
}

Let's work through an example. Suppose W = 8 (let's keep it small to make our example easy) and the key is brand new so that offset = 0 and window is an empty bitset. In parallel a client can send transactions with nonces 3 and 7, which will set the corresponding bits in the window. A block later, a client can send transactions with nonces 0, 1, 2 and all these will be accepted also. Even though these nonces are smaller than 3 and 7, they are accepted because the bits in the window are not set and the window has not yet shifted. Some time later a client now sends a transaction with nonce 12. This is outside the current window (it currently exposes indices 0-7 inclusive), so the window shifts by 5 and the offset is now set to 5. Now if a client sends the nonce 4 it will be rejected as too low (even though 4 was not actually used).

Hopefully it is clear now why the window width represents the number of transactions that can be in-flight at the same time. For example, with W = 8, we can send transactions with nonces 0-7 as a batch and they will all be accepted (regardless of the order they arrive), then once those are all final we could send a batch with nonces 8-15 and again they will all be accepted regardless of the order they arrive (the window will shift as higher nonces arrive).

If a user wanted to recover the same behaviour as the current nonce system for some reason then they could do so by sending nonces separated by multiples of the window size. For example, sending nonces 0, 8, 16, 24; if these arrive in the wrong order then not all the transactions will execute (eg if 16 arrives before 8 then the window will shift too far and 8 will be rejected).

In practice we would choose a value like W = 256 so that for practical purposes a single client should not need to rotate through multiple keys (though each distinct client should have its own key; for example each server behind a load balancer in a relayer service).

Advantages:

  • No change to transaction structure; nonce is sent as normal
  • Minimal validator overhead (no new in-memory structures)
  • Constant size data structure per key means this scheme fits easily within current cost models (accounts may store any number of the new kind of access key as long as they have $NEAR to cover the storage staking)

Disadvantages:

  • Some synchronization still needed if two clients share a key (for example, one could alway use the lower bits of the window and the other could use the upper bits, but they would need to coordinate to make sure the window does not shift too quickly)
  • Adding a new kind of access key is a big change (tooling needs to be able to parse the new variant for showing access key details, and needs to allow users to add this new variant to their accounts)

Copy link
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an SME reviewer, I have read through the proposal and discussions so far. I will provide my first impression and what I think needs to be done to move forward in this report.

The problem description is very through and convincing, thank you for writing it!

Overall, I want to support this NEP because I believe this is an important need worthy of a protocol change.
And I think the proposed design resolvess the core issue with a relatively low impact on validator hardware requirements.

However, I agree with @birchmd that the NEP is not quite ready for a voting phase.
All the comments of their review should be addressed, especially these two:

  • The integration with the alternative proposals in the NEP. (And a clear bottom line of what the currently proposed design is, in case it changed due to the discussion.)
  • The reference implementation.

I would also like to see the ambiguities in the specification removed.

  • How long exactly would a random nonce need to be stored at most?
  • What is an upper bound on extra storage and IOPS requirements per shard? (Should be part of the NEP description, not just a comment)
  • Will transaction gas costs change in any way?

Once all these points have been addressed, I can do another review and provide a more in-depth analysis from my side.


We describe `Transaction` and `DelegateAction`s collectively as messages in this specification.

We propose to add the optional fields to the messages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it could be a bit of work to get backwards-compatibility right for adding these fields. It's probably not worth describing it here in the NEP itself. But I'm looking forward to see how this is done in the reference implementation.

Comment on lines +110 to +113
for t in expired_times {
// Remove all hashes that expired at this time
trie.remove_all(ExpiresAt(t));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially produces a lot of IOPS for the modified trie nodes, which has been a bottleneck in the past.

I guess it should be fine because of the rounding to full seconds for timestamps. But it would be great if you could derive an explicit upper bound for how many trie values would be added / removed per chunk and include it in the NEP description.

@bowenwang1996
Copy link
Collaborator

@birchmd do I understand correctly that in #522 (comment), there can be at most W transactions in flight at the same time?

@ilblackdragon
Copy link
Member

ilblackdragon commented Mar 31, 2024

There are two sets of problems that users have as far as I see:

  1. Submitting batch of transactions with strict ordering. Currently solved by waiting for a transaction to get fully submitted before sending the second transaction.
  2. Submitting a set of transactions without ordering as fast as possible. Currently solved by using separate keys but limited in usability in wallet use cases which don't have a lot of keys.

For the (1) wallets already send one transaction at time. Submitting a batch of transactions at the same time will not work in many cases (e.g. ft_transfer_call + unwrap gets submitted at the same time - there is nothing to unwrap yet).

For this to work effectively, there needs to be a new approach to attaching more payload after one transaction executed to kick off another one even if method doesn't support call forward.

Given above, why would we not switch all access keys to use the "nonce set" approach described in #522 (comment)? If there is a need to submit transactions in order, one can use + window_size to previous nonce to always require window to move.

We also need a way to tell wallet if transactions need to be sent in order or as a set.
Probably some format of [(tx1, tx2), (tx3, tx4)] is required where (tx1, tx2) sent in parallel and two groups are sent sequentially.

@nujabes403
Copy link
Contributor

nujabes403 commented Apr 1, 2024

We also need a way to tell wallet if transactions need to be sent in order or as a set.
Probably some format of [(tx1, tx2), (tx3, tx4)] is required where (tx1, tx2) sent in parallel and two groups are sent sequentially.

Yes, it would be good to have developers choose the transaction ordering.
So if the order of transactions from the dapp is not critical, they can improve UX much better than now

@birchmd
Copy link
Contributor

birchmd commented Apr 1, 2024

@birchmd do I understand correctly that in #522 (comment), there can be at most W transactions in flight at the same time?

@bowenwang1996 , yes that is correct.

@birchmd
Copy link
Contributor

birchmd commented Apr 1, 2024

why would we not switch all access keys to use the "nonce set" approach

It would be a large state migration because all access keys would take up 256 bits more space (if we choose a window size of 256). This might be ok, but generally the core team tends to shy away from these kinds of state migrations because they are risky.

@victorchimakanu
Copy link

victorchimakanu commented May 7, 2024

NEP Status (Updated by NEP Moderators)

Status: Not eligible for Voting

SME reviews:

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

Based on the reviews above, This proposal did not pass the SME reviews, therefore it cannot enter the final voting stage.

@DavidM-D
Copy link
Contributor Author

DavidM-D commented May 8, 2024

@birchmd that's an interesting proposal, but it seems like there'd be the same number of IO ops (1 per transaction), more storage use I think (32 bytes per key, ~2 million key per shard vs 8 bytes per transaction in the last 100 seconds, ~33,000 txs per shard) and a slightly more finicky interface (clients remembering what nonce's they've used).

Is the main advantage that it's easier to implement?

@birchmd
Copy link
Contributor

birchmd commented May 15, 2024

Is the main advantage that it's easier to implement?

Yes, being easier to implement is a key advantage of the "nonce set" proposal. Specifically, since it addresses the problem at the access key level instead of the transaction level, there is no need to add new logic for tracking expiry. Notably, this means the cost of increased storage usage is already accounted for without any additional change.

the same number of IO ops (1 per transaction)

I think the nonce set proposal is better in terms of IO ops because it is bounded by the amount of gas that can be spent in a single chunk (since state changes happen eagerly when the transaction is converted to a receipt). For the random nonce proposal the bound is more complicated due to the future expiry. Suppose that for 100 blocks in a row a malicious user submits transactions with the same expires_at value; once that time passes all transactions from all of those blocks suddenly need to be removed from the trie at the same time. As Jakob pointed out above, needing to do many IO ops all at once has been a performance bottleneck in the past so this presents an issue. It's possible this could be avoided by lazily removing elements from the expiry mapping in a background task but that increases the complexity of the random nonce solution.

more storage

I don't think the total amount of storage used is a problem as long as it's properly costed and only small parts of the total storage need to be used at a single time. As I discussed above, the nonce set proposal fits into Near's current cost model and the amount needed at a given time is bounded.

slightly more finicky interface

I agree the nonce set interface is not as nice as the random nonce one, but I still think it is a big improvement over the current status quo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NEP A NEAR Enhancement Proposal (NEP). S-draft/needs-implementation A NEP in the DRAFT stage that needs implementation. S-review/needs-wg-to-assign-sme A NEP that needs working group to assign two SMEs. WG-protocol Protocol Standards Work Group should be accountable WG-tools Tools Working Group should be accountable.
Projects
Status: NEW❗
Status: REJECTED
Development

Successfully merging this pull request may close these issues.