Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Betanet release at March 11 2020 #2247

Merged
merged 41 commits into from
Mar 11, 2020
Merged

Betanet release at March 11 2020 #2247

merged 41 commits into from
Mar 11, 2020

Conversation

ailisp
Copy link
Member

@ailisp ailisp commented Mar 11, 2020

This is the version currently running at rpc.devnet.nearprotcol.com

  • applayer test good
  • nightly isn't ready, so won't be tested for this release

k06a and others added 30 commits February 4, 2020 08:22
* Enable floats but prohibit some CPU architectures
* Merge branch 'staging' into enable_floats
* Merge refs/heads/staging into enable_floats
* Avoid overflowing u32 during contract preparation (#1946)
* Update runtime/near-vm-runner/src/runner.rs

Co-Authored-By: Evgeny Kuzyakov <[email protected]>
* Merge branch 'staging' into enable_floats
* Nit

Co-authored-by: Maksym Zavershynskyi <[email protected]>
* Move sysinfo fix from staging
* Introduce keccak256 and keccak512 native support

* modify genesis

Co-authored-by: Bowen Wang <[email protected]>
* Create code-of-conduct.md

* Move code of conduct to proper name

Co-authored-by: Illia Polosukhin <[email protected]>
* Fix #2067: querying contract state

* Change format to {key, value, proof} for state view. Proofs are empty for now
Instead of dumping state in a json file which is slow, this PR changes it to dumping state in a binary format. Fixes #2070.

Test plan
---------
Manually test this with a local node to make sure that the state is preserved correctly after state dumping.
* Add json state dump for debugging

* state_dump.json

* fix
Fixes: #2183

# Test plan:
- Extracted the code into utils crate and added unit tests.
- Revert binary state dump
- #1922 for separating config and records
- scripts/new-genesis-from-existing-state.sh that dump state, calculate new genesis hash, upload to s3
- testnet genesis records separate from near binary
- download testnet genesis from s3 in python start_testnet
- check genesis hash when run testnet
 
Co-authored-by: Evgeny Kuzyakov <[email protected]>
Co-authored-by: Bo Yao <[email protected]>

Test Plans
-------------
- it can `near init --chain-id=testnet --genesis-config near/res/testnet_genesis_config.json --genesis-records <records download from s3> --genesis-hash <expected-hash>` to initialize testnet config from external genesis records
- When `near run`, with incorrect `~/.near/genesis_hash` it will panic
- It can start testnet with updated start_testnet.py, which download genesis_records from s3
- After stop a node, we can call `state-viewer dump_genesis` to dump genesis_records, config, and genesis_hash.
- If there was no state change happen since genesis block, load the dump genesis_records/config will give you a same genesis_hash, and dump again generate same genesis_records, config & genesis_hash.
Some of our tests don't have actix running. @Kouprin found when assert in such tests failed it's not able to see which test fails, only `thread panicked while panicking`. So check actix is running and only shutdown it if so fix this.

Test Plan
---------
```
#[test] 
fn test_assert() {
init_stop_on_panic();
assert(false);
}
```
should give assert fails instead of `thread panicked while panicking`
If an invalid chunk made it into block the invalid incoming receipts might be forwarded to different shards. Ideally the chunk should be challenged and the block is reverted, but before this happens we need to handle invalid receipts in the Runtime.

Another possibility is to create invalid receipts in the state directly and then create a challenge on this invalid state. So any field in the state can potentially contain invalid value. So if an invalid receipt is present in the delayed receipts state, we consider it a StorageError.  

Fixes: #1850

# Test plan

- Added 2 tests to cover handling of invalid receipts in the Runtime
- Filed an issue to handle error in NightshadeRuntime #2152
* `block_production.py` started failing because the blocks are now
produced faster, and by the first time it checked the heights the
heights exceeded 2, causing poorly stated assert to trigger
* `staking*.py` were never adjusted for yocto-near
* `state_sync*.py` both header sync and state sync were broken
Header sync: we were cleaning up mapping from heights to headers during
GC, which is necessary for header sync.
State sync: we changed the state sync to always happen at the epoch
boundary, but did it in a wrong way: the hash we stored locally was
pre-pushing to the epoch boundary, and thus the node was rejecting any
incoming state responces
* `state_sync1.py` was also failing because it was relying on one out of
two validating nodes being able to produce blocks, which it cannot,
because of Doomslug
* `skip_epoch.py` was expecting one validator to be producing blocks,
which with doomslug requires such validator to have more than half the
stake. Also was not updated for yocto-near
* `one_val.py` similarly was expecting one validator to be producing
blocks, and thus needed that validator stake bumped to worh with DS
* `lightclnt.py` with doomslug blocks are produced faster, and by the
time the test started a few blocks were already produced. Making the
test expect such behavior
* `block_sync.py` was failing because Doomslug requires
`max_block_production_delay` to be at least 2x
`min_block_production_delay`.

Also disabling `network_stress`, because the current runner crashes
trying to launch it and skips all the consecutive tests

Test Plan
---------
All the above python tests pass.
Similar to #1985 but now we changed back to log to stderr

Test Plans
-------------
Run pytest locally, now it's no longer have `AssertionError: node dirs: 0 num_nodes: 4 num_observers: 0` error
…tly failures (#2204)

There were several issues in the test infra:
1. The peer info in the client test infra was the largest height and the
largest score ever observed. If a block with a higher score but lower
height than the previous tip was created, it would report incorrect peer
info, and peers would attempt header sync believing the peer has higher
header head height, and such header sync would fail.
2. The tests tamper with the FG, and the last final block could be way
more than 5 epochs in the past. That makes creating light client blocks
potentially require blocks from 5 epoch lengths ago. I'm just making all
nodes in cross_shard_tx archival. In practice if one epoch has been
lasting for five epoch lengths, we have bigger problems.
3. We historically see cross_shard_tx tests fail with
`InvalidBlockHeight` error when a block is more than epoch length ahead
of the previous block. Since that check is a heuristic anyway, I'm
doubling the distance, to reduce the flakiness of the test.

Separately, increasing the timeouts for NFG tests, they take more than
15 minutes.

Also bumping timeouts for the `test_all_chunks_accepted_1000*` tests,
it's clear that they need at least 2000 / 4000 / 1000 seconds to
complete, I set the timeouts to 3600 / 7200 / 1800 for some extra room.
Also the one that requires 7200 (`*_slow`) seems to provide no value
compared to the base test, and is the slowest test in our entire suite,
so I completely disable it.

Separately, fixing the issue with state sync tests, where the transition
to state sync happens before the log tracker is initialized, and the
check for the transition in the log later fails

Slightly bumping block production time in
`test_catchup_sanity_blocks_produced`, it works on local machine, but on
the gcloud runner doesn't keep up.

Test plan
---------
All cross_shard_tx* tests passed at least three runs.
If they are flaky, nightly will catch that.
…rds via RPC

Resolves #2007 and #2025

We needed to make Near config query-able through node RPC. Specifically,
one of our clients wanted to know how many blocks remain until an
account is going to be evicted due to rent. This information can be
derived from the account balance and the config.

In this commit, two new RPC endpoints are exposed:
EXPERIMENTAL_genesis_config and EXPERIMENTAL_genesis_records. Learn more
in PR #2109.

# Test plan

Added tests to query the endpoints with a happy path and also invalid
parameters.
Many timeout tweaks, and small typos.
`cross_shard_tx_with_validator_rotation` with 150 block time has
extremely long forks (doomslug is disabled), and takes lots of time per
iteration. I split it into two: one with 150ms block time, but only 16
iterations, and one with 400ms block time, but all 64 iterations. Both
locally take ~45 minutes, will see how long it takes on gcloud.
Fixing the following issues:
1. `ThreadNode` was not properly setting its state on kill
2. `ThreadNode` doesn't properly free up its port, but instead of
figuring out why, I just replaced it with `ProcessNode` in the tests
that are affected
3. `test_4_20_kill1` wasn't accounting for fees. Disabling fees.
4. In the same test, the same chunk producer was always mapped to the
same block producer, and thus killing the second node was making the
0-th chunk producer (who happens to be attached to the 2nd BP) to not be
able to have their transactions included. Address it by having 17 seats.
Also split the test in two, with one shard and with two shards
5. In multiple places we were using the wrong node to get the access key

Test plan
---------
Locally `test_4_20_kill1` and `test_*_multiple_nodes` pass, let's see
how the next nightly looks
Disabling old tests that fail due to the runtime cache unti they are
fixed;

Changing the `cross_shard_tx_with_validator_rotation` slightly based on
its performance on gcloud.
(increasing the block prod time speeds up test, because it results in
fewer forks, so the 150->200 change is to make more iterations fit.
With 150ms it fits ~6 iterations into one hour)
The actual pipeline definition is saved on buildkite ui, this way, we got it shared between stable, beta and master. There'll be some master only builds (nightly release)

Test Plans
--------------
Buildkite CI should pass
Remove doctest that was running incorrectly.
The test was not updated after some config parameters were renamed.
Also because of #2195, tx status of lost transactions times out, so
adding a workaround into the test for now

Disabling the version of stress that messes up with network, because the
nightly runner is currently not configured to support the utility I use
to stop network between processes

Couple other changes:
1. Made `node_restart` worker not restart the node if no blocks were
produced in the meantime. It is needed because with only two nodes after
a long restart doomslug can take a while to recover (it is equivalent to
half the network shutting down and restarting after some delay).
Block production worker will fail the test if the block production
actually stalls
2. For the same reason increased the tolerated delays for block
production
3. Limited how many txs are sent per iteration of tx worker, since due
to #2195 it takes one second to query one transaction, and if the test
finished in the middle of querying, the allowed one minute for workers
to stop is not sufficient for the tx worker.
4. Also generally increasing the allowance from 1m to 2m at the end,
since the tx worker at the end of the test might take some time before
it even starts querying the transaction outcomes
* Try setting up GitPod

* Try pre-building nearcore in Docker image

* Remove rustup command  (nightly should already be available)

* Update location for nearcore prebuild

* Fix the way cargo is executed in Dockerfile

* Do cargo test as well when building docker image
* Fix fishermen unstake

* fix kickout set

* add test
Equivalent coverage and docker image release from gitlab ci. Except release is a docker image release and will also add s3 release in future, since gcloud storage requires google login, which is inconvenient for used in scripts

Test Plans
--------------
coverage only in master, beta, stable branch and docker release only in master branch (For test purpose also this cov-release branch but will be removed). badge updated. docker release in beta/stable branch will be added but not ready now as tests in beta/stable is more strict
Kouprin and others added 11 commits March 5, 2020 18:36
* Fix validator kickout set

* fix stake change
Some fixes, refactoring and comments.

Test plan
---------
Run existing tests
Add more asserts
Further investigation of `stress.py` failures in nightly shows that all
the workers appear to work as expected, but frequent restarts cause
block production to be delayed more than the current tolerance.
Locally spradically also more than 50% of transactions get lost if the
node get restarted too frequently.
Increasing both tolerances.

Also making the prints unbuffered, otherwise several workers in the
nightly runs don't flush their outputs.

Finally, fixing a typo in the nightly.txt

Test plan
---------
Locally `stress.py` doesn't fail, so there's no easy way to test whether
the new tolerances would be sufficient.
It also doesn't completely fix all the known issues, there are some
other failures in stress.py that I haven't gotten to yet
Add doc test runs in CI

Test Plan
------------
Should see doc test log at begining of other tests
Fixes #2226 
The current behavior of a view call is to prohibit state changes. This change moves check from the end of the state viewer to VMLogic by prohibiting state changes functions during view calls.

## Test plan:
- Fixed state change tests. Previously the test didn't verify error type. The error was `MethodNotFound` and also alice account was wrong.
- Added unit tests for the new prohibited methods.
* Bump Borsh version

* Nit

* Bump Borsh versions

* Bump borsh more
scripts for binary release

Test Plan
------------
Download uploaded binary in a few popular linux vm and see if they works
Two people run 100 node together is often blocked by firewall rules limit and static ip address limit (The recent fail by @mfornet is this case). Made they do not reserve ip address and use organization global firewall rules would fix this.

Test Plans
--------------
The same setting has been proved to work in create devnet nodes.
Resolves: #2034 and #2048

The changes RPC API was broken implementation-wise (#2048), and design-wise (#2034).

This version has meaningful API design for all the exposed data, and it is also tested better. This PR is massive since, initially, we missed quite a point of exposing deserialized internal data (like account info, access key used to be returned as a Borsh-serialized blob, which is useless for the API user as they don't have easy access to the schema of those structures).

## Test plan

I did not succeed in writing Rust tests as we used a mocked runtime there. Thus, I had extended end-to-end tests (pytest) with:

* Test for account changes on account creation
* Test for access key changes on account creation and access key removal
* Test for code changes
* Test for several transactions on the same block/chunk
@gitpod-io
Copy link

gitpod-io bot commented Mar 11, 2020

@codecov
Copy link

codecov bot commented Mar 11, 2020

Codecov Report

Merging #2247 into beta will decrease coverage by 0.59%.
The diff coverage is 78.33%.

Impacted file tree graph

@@           Coverage Diff           @@
##            beta   #2247     +/-   ##
=======================================
- Coverage   87.2%   86.6%   -0.6%     
=======================================
  Files        183     184      +1     
  Lines      34977   35260    +283     
=======================================
+ Hits       30502   30538     +36     
- Misses      4475    4722    +247
Impacted Files Coverage Δ
core/primitives/src/block.rs 96.33% <ø> (+0.22%) ⬆️
core/primitives/src/telemetry.rs 100% <ø> (ø) ⬆️
core/primitives/src/merkle.rs 100% <ø> (ø) ⬆️
core/primitives/src/transaction.rs 81.81% <ø> (ø) ⬆️
test-utils/testlib/src/node/mod.rs 100% <ø> (ø) ⬆️
core/primitives/src/challenge.rs 85.71% <ø> (ø) ⬆️
core/primitives/src/lib.rs 100% <ø> (ø) ⬆️
core/primitives/src/sharding.rs 99.47% <ø> (ø) ⬆️
chain/chain/src/doomslug.rs 99.52% <ø> (ø) ⬆️
core/primitives/src/network.rs 91.83% <ø> (ø) ⬆️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4a07fa...6d07a2d. Read the comment docs.

@ailisp ailisp merged commit 2e5f367 into beta Mar 11, 2020
@ailisp ailisp deleted the betanet-release branch March 11, 2020 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.