Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BorshSchema #56

Merged
merged 7 commits into from
Feb 22, 2020
Merged

Add BorshSchema #56

merged 7 commits into from
Feb 22, 2020

Conversation

MaksymZavershynskyi
Copy link
Contributor

@MaksymZavershynskyi MaksymZavershynskyi commented Feb 21, 2020

We need Borsh schema for two things:

  • Rust contracts API;
  • Node RPC API, e.g. error messages;

#[derive(BorshSchema)] implements the BorshSchema trait that describes the schema of the type and of all its dependencies. See borsh-rs/borsh/tests/test_schema* for examples.

The key differences with the current implementation of Borsh schema in nearlib:

  • This schema is auto-derivable for most of Rust types;
  • We now can have arrays of types other than u8;
  • Works with generics. The name like Mytype<string, u64> is just a string so languages without generics can simply map it to a standard non-generic type;
  • The schema is language agnostic (enum is a tagged union which is a common language construct that can be emulated in many languages);
  • Uniform structure of the definitions, no inline definitions like: "fields": [ "field_a", ["u64"]]. Instead we have "fields": [ "field_a", "Array<u64>"].

@MaksymZavershynskyi MaksymZavershynskyi requested review from willemneal and vgrichina and removed request for mikhailOK February 21, 2020 04:23
@MaksymZavershynskyi
Copy link
Contributor Author

FYI @willemneal and @vgrichina you might be interested to familiarize yourself with the format, since it is likely that you will be implementing support for it in nearlib (this would require fixing the nearlib borsh serialization) and AssemblyScript.

@MaksymZavershynskyi MaksymZavershynskyi requested review from frol and removed request for willemneal and vgrichina February 21, 2020 04:25
@MaksymZavershynskyi
Copy link
Contributor Author

Assigning to @frol since he is our API guardian and will be encountering Borsh schema a lot.

@MaksymZavershynskyi MaksymZavershynskyi added enhancement New feature or request P1 Priority 1 labels Feb 21, 2020
Comment on lines +311 to +312
"Vec<u64>" => r#"{ "kind": "sequence", "params": [ "u64" ] }"#,
"Vec<Vec<u64>>" => r#"{ "kind": "sequence", "params": [ "Vec<u64>" ] }"#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks inconsistent to me. Why do we encode types with kind (sequence / tuple / enum) on the top-level, and then in params we use Vec<...>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind does not encode the type, it is a type constructor. Meaning knowing the value of the kind and the params we know how to construct this type in memory. Arguably we could leave just four kinds: enum (a.k.a tagged union), sequence, array, tuple (and merge struct with tuple). enum requires reading the byte that tells how many variants are there. sequence requires reading the byte that tell how many elements are there. array does not require reading additional bytes but it has uniform type of element. tuple and struct have non-uniform types. However I decided to keep struct because it allows us to have explicit field names which will be useful when we will be converting from JSON to Borsh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant, but not exactly the same: https://serde.rs/data-model.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having hard times understanding why we need this top-level type [constructor] be different from the inner items. To achieve consistency, I see two ways:

// 1
"Vec<Vec<u64>>" => r#"{ "kind": "sequence", "params": [ {"kind": "sequence", "params": ["u64"] } ] }"#

// 2
"Vec<Vec<u64>>" => r#"Vec<Vec<u64>>"#

I may just miss something, so I will approve the PR (it looks solid from the code and tests point of view), and let you make the decision. I am ready to continue the discussion, though, if you feel that there is a benefit of ELI5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that every entry of the map is the following: <name of the type> => <how to construct it>. We cannot have

"Vec<Vec<u64>>" => r#"{ "kind": "sequence", "params": [ {"kind": "sequence", "params": ["u64"] } ] }"#

and should instead have:

"Vec<Vec<u64>>" => r#"{ "kind": "sequence", "params": [ "Vec<u64>" ] }"#,
"Vec<u64>" => r#"{ "kind": "sequence", "params": [ "u64" ] }"#

because we can have recursive definitions like:

struct Node {
parent: Rc<Node>,
children: Vec<Rc<Node>>,
}

that cannot be described using the proposed way.

Also we cannot have

"Vec<Vec<u64>>" => r#"Vec<Vec<u64>>"#

for the same reason, and also because processing it would require parsing the string, which is much harder than simply working with JSON.

Copy link
Contributor Author

@MaksymZavershynskyi MaksymZavershynskyi Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason is that if we have Vec<A> and Hashmap<A, B> we don't want A to be redefined twice, for redundancy reasons.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaah, I got it! Vec<u64> is essentially an alias, which we use inside Vec<Vec<u64>> definition! Now it makes perfect sense. Looks good to me!

@vgrichina
Copy link
Contributor

@nearmax @frol let's make sure we actually have a solution for https://commonwealth.im/near/proposal/discussion/314-dealing-borsh-arguments-on-hsm

@MaksymZavershynskyi
Copy link
Contributor Author

@nearmax @frol let's make sure we actually have a solution for https://commonwealth.im/near/proposal/discussion/314-dealing-borsh-arguments-on-hsm

It seems to be an orthogonal problem that we can solve in parallel.

@frol
Copy link
Collaborator

frol commented Feb 21, 2020

@nearmax I am not very deeply involved into the topic @vgrichina brought up, so I may be wrong, but it seems that the problem of parsing JSON is somewhat relevant in terms that even if the transaction is serialized with Borsh, you would still need a JSON parser to parse the types.

@vgrichina
Copy link
Contributor

vgrichina commented Feb 22, 2020

@nearmax it's not an orthogonal problem, because it's really hard to solve the problem discussed in https://commonwealth.im/near/proposal/discussion/314-dealing-borsh-arguments-on-hsm if we go with schemaless Borsh for contract args + separate metadata.

We cannot make any queries to node from hardware wallet to fetch any kind of metadata + we don't have any reason to trust any kind of metadata passed to wallet.

So if we need to support some schemaless format – we basically have to be able to determine if it matches some hardcoded metadata based only on method name (as we don't have any trusted info about contract).

Why are we so concerned about using schema vs having schemaless binary format (a-la messagepack, but can be based on Borsh)?

How much extra gas cost would passing argument names in e.g. fungible token contract take?

How does it compare to minimum gas cost per call?

We need to think well about whole stack when designing this, or we'll run into similar issues as with Rust Serde JSON support (e.g. int 128 serialized as number, but literally no other JSON parser supporting this).

@MaksymZavershynskyi
Copy link
Contributor Author

I am not enabling Borsh as default serializer for contract arguments yet. We need to fix borsh serializer in nearlib and reconcile it with three other JS implementations of borsh anyway, because we need it for node RPC. So most of the work mentioned in here: https://app.zenhub.com/workspaces/chainmiddleware-5cea2bcf78297c385cf0ec81/issues/nearprotocol/nearcore/2167 has to be done independently on whether we use Borsh for args or not. Also exposing metadata for contracts is not predicated on using Borsh for args.

Below I address your comments:

@frol @vgrichina I see only one case where schema-less format creates a problem: HSM, and I have already suggested the solution in the thread.

  • For cross-contract calls the contracts need to be aware of the arguments structure at the time of their development independently of whether it is JSON or not;
  • For calling contract from nearlib the plan is to fetch schema from the node and serialize the arguments provided in JSON or TS object or JS objects into borsh format. Nearlib needs to have borsh serializer anyway and needs to work with Borsh schema too for RPC errors and such.

So if we need to support some schemaless format – we basically have to be able to determine if it matches some hardcoded metadata based only on method name (as we don't have any trusted info about contract).

I have described the other solution in the thread that allows to validate the schema based on the transaction result. So metadata is verifiable, the only reason why it is hard on HSM is because it has limited computational power and space, which makes it a corner case.

Why are we so concerned about using schema vs having schemaless binary format (a-la messagepack, but can be based on Borsh)?

Developing serialization format is difficult. We developed Borsh by fixing https://github.com/servo/bincode and tried to keep it as simple as possible. We already know that JSON is problematic because of integers, blobs and other problems. Trying a new format is difficult, we cannot just take some other format and check that it works for everything. Messagepack was not designed to be used for smart contracts, verifying that it works for everything is a large effort that we cannot afford right now.

How much extra gas cost would passing argument names in e.g. fungible token contract take?

I can tell that JSON serialization costs 80% of gas for eth-bridge which made it unfeasible to run. We don't have benchmarks for fun tokens (cc @evgenykuzyakov ) but I am sure it is going to be worse, because 20% of gas in eth-bridge accounted to expensive crypto operations which fun tokens do not have.

We need to think well about whole stack when designing this, or we'll run into similar issues as with Rust Serde JSON support (e.g. int 128 serialized as number, but literally no other JSON parser supporting this).

We did put a lot of though into Borsh. If you think something needs to change in Borsh then let us know.

@MaksymZavershynskyi MaksymZavershynskyi merged commit 52f7605 into master Feb 22, 2020
@vgrichina
Copy link
Contributor

@nearmax we don't have to change Borsh or use alternate format like messagepack to go schemaless.

What I suggest is that we build schema-less format using Borsh itself. Like it's should be easy to come by with convention on how to store generic JSON objects serialized as Borsh. That would be introspectable and easy to parse anywhere.

I can tell that JSON serialization costs 80% of gas for eth-bridge which made it unfeasible to run.

This is actually far more special case than singing transaction using hardware wallet. Like most of the contracts don't implement something like bridge and don't have to transfer a lot of data in arguments. Serializing binary data as array of JSON integers is expected to be slow everywhere and isn't good way to use JSON / evaluate performance (like you'd at least do base64 string if you want reasonable performance).

@MaksymZavershynskyi
Copy link
Contributor Author

@nearmax we don't have to change Borsh or use alternate format like messagepack to go schemaless.

What I suggest is that we build schema-less format using Borsh itself. Like it's should be easy to come by with convention on how to store generic JSON objects serialized as Borsh. That would be introspectable and easy to parse anywhere.

@vgrichina Are you suggesting that some Borsh-serialized objects are preceded by their Borsh-serialized schema in binary representation? I think this is a really great idea and the code in this PR can be easily modified to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Priority 1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants