Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unparsable multiaddrs #70

Open
Stebalien opened this issue Jul 18, 2018 · 12 comments
Open

Handle unparsable multiaddrs #70

Stebalien opened this issue Jul 18, 2018 · 12 comments

Comments

@Stebalien
Copy link
Member

Stebalien commented Jul 18, 2018

Due to the fact that binary multiaddrs don't include protocol definitions, we can't parse them unless we know all the relevant protocols.

However, we can parse a prefix. Therefore, I'd like to propose a special, string-only "unknown" protocol that takes a single multibase encoded argument. That is: `/ip4/1.2.3.4/tcp/123/unknown/bxyz". This would only exist in the string format and would allow us to keep and use multiaddrs we don't fully understand.

This should fix some of the problems described in #6.

@alanshaw
Copy link
Member

A few things I'd like to explore:

  1. How can we retain the original string? In terms of printing these addrs out, it's more useful to be able to print the representation that we recieved, not a /unknown/whatever even if it cannot be used. The use case is an old js-ipfs-api requesting swarm.addrs to be printed out on a webpage. It might not understand how to parse QUIC addrs but it's preferable to print the QUIC addr than an "unknown" addr
  2. How can we distinguish between a multiaddr that is valid or not? If we can pass it any string protocol - how do we determine if it's valid (at least in our version of the codec table)? What would valid mean now? What do we do when passing a string/buffer to the multiaddr constructor?
  3. If we create a "string only" multiaddr - what happens when we request the buffer version - throw error?

@alanshaw
Copy link
Member

@jacobheun it might be useful to have your opinion on this?

@olizilla
Copy link
Contributor

The requirement that a multiaddr can be represented as a buffer is tricky. Every user of multiaddr has a stale copy of the buffer-to-protocol mapping. It's trivial to syntacitically validate a multiaddr string, but hard to semantically validate that protocols are valid if the definition of valid is their existance in the mapping table.

We will keep running into situations where peers and apps and api clients are running with different versions of multiaddr, with different mapping tables, and older ones unable to parse newer ones.

Having a string only format for multiaddr that doesn't guarantee it can be converted to a buffer would at least tick the points 1 to 4 on the problem statement https://github.com/multiformats/multiaddr#introduction

@Stebalien
Copy link
Member Author

I think we're talking past each other a bit here. My primary concern is relay dialing (which should have been in the issue description...). That is, I need to be able to use a multiaddr prefix (encoded) without the rest.

For example: /ip4/1.2.3.4/tcp/1234/p2p/QmRelay/p2p-circuit/ip4/tcp/1234/special_protocol/p2p/QmTarget.

QmRelay and QmTarget may understand special_protocol while I may not.

Basically, I'm worried about creating unnecessary network partitions because we're being too strict in parsing our inputs.

We still need to solve the UX issues.

How can we retain the original string?

We usually don't and can't know the original string (e.g., binary peer address records from the DHT).

How can we distinguish between a multiaddr that is valid or not? If we can pass it any string protocol - how do we determine if it's valid (at least in our version of the codec table)? What would valid mean now? What do we do when passing a string/buffer to the multiaddr constructor?

This proposal tries to cover the use-case where I don't really care about the rest of the multiaddr. Ideally, I'd be able to validate it, but even if I could, I'm just going to pass it off to someone else anyways.

If we create a "string only" multiaddr - what happens when we request the buffer version - throw error?

By string-only, I meant that /unknown/... would only appear in the string representation. When converting back to the binary representation, we'd parse $x/unknown/$y back into ParseMultiaddr($x) + CastMultiaddrBytes(DecodeMultibase($y)). That is, everything after unknown would just be the part of the multiaddr we failed to parse (binary), multibase-encoded.

Now, there is the case where a user gives me /ip4/.../...../someproto and I don't know what someproto is. This proposal doesn't handle that case but I guess we could:

  1. Allocate a special "string" path multicodec.
  2. Use that multicodec and encode the rest of the string as the argument.

That is, if we ran across $x/someproto/$y, we'd encode this to BinaryMultiaddr($x) + StringMultiaddrCodec + Length + $y. When parsing such a multiaddr, implementations would do their best to lower it to a the correct binary format.

However, really, "upgrade your client" may also be a valid solution here.

@alanshaw
Copy link
Member

Thanks for your thoughts!

That is, if we ran across $x/someproto/$y, we'd encode this to BinaryMultiaddr($x) + StringMultiaddrCodec + Length + $y. When parsing such a multiaddr, implementations would do their best to lower it to a the correct binary format.

Do we need a StringMultiaddrCodec - could we just use the unknown codec?

That aside, I'm interested to see if/how this would work in code - I'll see if I can work up a PR to js-multiaddr and report back :D

@Stebalien
Copy link
Member Author

Do we need a StringMultiaddrCodec - could we just use the unknown codec?

So, there wouldn't really be an unknown "codec". Basically, we're trying to handle two inverse cases:

  1. We have a string and need to turn it into binary. In this case, we need a protocol that only exists in binary form. When converting back to a string, the protocol would disappear.
  2. We have a binary address and need to turn it into a string. In this case, we need a protocol that only exists in string form. When encoding to the binary representation, it would disappear as we already know the binary protocol code.

Basically, we need a StringInBinary protocol (StringMultiaddrCodec) and a BinaryInString protocol (the "unknown" protocol).

That aside, I'm interested to see if/how this would work in code - I'll see if I can work up a PR to js-multiaddr and report back :D

The second half (binary in strings) has a PR here: multiformats/go-multiaddr#74.

However, I'm still worried about multiple binary representations of multiaddrs. That's why I'm less sure we want StringInBinary.

Stebalien added a commit to multiformats/multicodec that referenced this issue Nov 26, 2018
That way, we can always tell if something is a path or something else.

We may also be able to take advantage of this later to combine a few
concepts and get rid of the "multiaddrs look like paths but are totally not"
problem. However, we can think about that later. This PR just reserves the code
so we don't run into problems later.

* Remove the distinction between string/binary multiaddrs. Instead, the "string"
  will *also* be a valid binary multiaddr.
* Define a new multipath spec to combine multiaddrs and other paths.

Related to: multiformats/multiaddr#70
Stebalien added a commit to multiformats/multicodec that referenced this issue Mar 18, 2019
That way, we can always tell if something is a path or something else.

We may also be able to take advantage of this later to combine a few
concepts and get rid of the "multiaddrs look like paths but are totally not"
problem. However, we can think about that later. This PR just reserves the code
so we don't run into problems later.

* Remove the distinction between string/binary multiaddrs. Instead, the "string"
  will *also* be a valid binary multiaddr.
* Define a new multipath spec to combine multiaddrs and other paths.

Related to: multiformats/multiaddr#70
@ghost
Copy link

ghost commented Mar 27, 2019

Could the problem of unknown/new address protocols be solved by the following steps?

  1. require that all new protocols are length-prefixed
  2. require that all implementations support the existing non-length-prefixed and non-value protocols

That'd seem like a simple way forward to me.

@ntninja
Copy link
Contributor

ntninja commented Mar 27, 2019

@lgierth: How do we know whether a protocol even requires an argument at all? /utp, /quic and about 10 others do not, for instance.
Also why not simply require all protocols to be length-prefixed then? While it would increase many binary representations by 1 byte per protocol, it would make parsing byte representations much simpler in general and solve the “unknown protocol” issue almost as a side-effect. (Although it would not help with the problem parsing unknown string representations, of course.)

@ghost
Copy link

ghost commented Mar 27, 2019

How do we know whether a protocol even requires an argument at all? /utp, /quic and about 10 others do not, for instance.

Yeah, actually, point 2 above should include most or all existing protocols, not just the length-prefixed ones.

Also why not simply require all protocols to be length-prefixed then?

Backward compatibility. It'd be a huuuuge hassle to fundamentally change existing multiaddr protocols, while it's easy to freeze them as-is and introduce new rules for new protocols.

We must avoid breaking changes at all cost.

@Stebalien
Copy link
Member Author

We must avoid breaking changes at all cost.

Note: We can also transition. That is, we can say introduce a new "multiaddr-2" protocol and parse everything after that using the new system. Eventually, we can remove support for multiaddr-1 (the network will forget old multiaddrs pretty quickly). This would give us a chance to revisit this from scratch.

@saul-jb
Copy link

saul-jb commented Mar 13, 2022

Has there been any more progress made on this?

This would make multiaddr much more useful for projects that need to add complexity to what is already defined by multiaddr without requiring per-application changes to the table.

@Stebalien
Copy link
Member Author

Unfortunately, no. If you're interested, I'd start by forking go-multiaddr and adding an implementation there. I say "fork" because adding support for this will require a non-trivial refactor of everything depending on multiaddrs, so I expect we'll end up with two "flavors" of multiaddrs: partially validated ones and fully validated ones".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants