Why not just disallow duplicate keys? #47

icefoxen · 2017-10-09T20:13:09Z

Duplicate keys are invalid, so why not just disallow duplicate keys and force parsers/handlers to reject them as invalid instead of saying "duplicate keys magically vanish in a pre-defined and consistent way that everything must take care to implement correctly"?

Sure, you can synthesize data with duplicate keys in CBOR or JSON or whatever, but that just makes it a malformed object. You can synthesize data with no link in it at all, that doesn't mean that is a valid IPLD object.

Stebalien · 2017-10-10T17:04:17Z

reject them as invalid

The "be liberal in what you accept" principle. There's no reason to punish users because they used a poorly written program to encode their data.
Performance. Checking for duplicate keys in a large object can be expensive, especially, if we don't actually need to deserialize the entire object, just traverse it.

A more pragmatic reason is that programmers suck at checking for errors up-front; we tend to be lazy. So, let's assume that some programmers write a few IPLD libraries that accept and produce incorrect objects (e.g., duplicate keys in some edge cases). Now, say a detail-oriented programmer comes along and writes an application that actually checks these edge cases and rejects malformed objects. No one will use that new program because it will reject quite a few IPLD objects made by the sloppy programs even if this new program is technically correct. At this point, nobody will be happy. If, on the other hand, we say: "don't produce objects with these properties but you must be able to handle them as follows...", we force programmers to actually handle these edge cases instead of just saying "I'll deal with error handling later (queue a low-priority TODO that gets deleted in the next refactor).

You can synthesize data with no link in it at all, that doesn't mean that is a valid IPLD object.

FYI, objects that don't contain links are still valid IPLD objects.

icefoxen · 2017-10-16T19:33:57Z

Sorry for the delay, here's my thoughts.

The "be liberal in what you accept" principle.

I'm not terribly convinced this is a good idea anymore, given all the things that have come out of it... HTML, JS, OpenGL, etc. Underspecified things never seem to end well. Also, I'm not terribly convinced that we need it anymore; we can create official test suites for a protocol that help ensure that an implementation is valid these days. Having code we can run and at least say "this implementation is valid for all the cases the test suite covers" is far more useful than someone thinking "Well a client MIGHT conceivably do X, and it's not COMPLETE nonsense, so I'll try to handle it", only to get caught off guard when the client does Y instead. It's also useful for the programmer too, since it lets them see concretely what they do right and wrong, instead of saying "well I THINK this is okay".

Errors should be noisy! And "humans are lazy" is not a valid reason for designing lousy systems. Strict systems are much easier to program for than loosey-goosey systems anyway! When writing a client, you know exactly what will happen with a strict system. With one that tries to be permissive, you're never sure whether you're doing things correctly or not. Good engineers seldom want to be doing the wrong thing.

Performance. Checking for duplicate keys in a large object can be expensive, especially, if we don't actually need to deserialize the entire object, just traverse it.

Is it? It should be O(n) in time and memory with the size of the object, if you use a hash set or something to store the keys. More to the point, I would think that ANY object that you receive from an unknown source, you MUST parse and verify at some point anyway just to see whether it's valid. And if you want to look up a key, currently you must make sure the object is a canonical form at least once anyway, so you can ensure you are looking up the right key.

Stebalien · 2017-10-16T22:05:38Z

Underspecified things never seem to end well.

I agree. That's why this is fully specified. Or, more specifically, the IPLD spec says that all formats must specify a way to handle duplicate keys but implementations that write objects in those formats shouldn't produce objects with duplicate keys. Note: We still need to fully specify our formats (some of then can conceivably forbid duplicate keys by spec, that's up to the individual format).

"Well a client MIGHT conceivably do X, and it's not COMPLETE nonsense, so I'll try to handle it"

That's definitely not how implementations should be written. They should follow the spec as written and the spec MUST cover all cases.

Is it? It should be O(n) in time and memory with the size of the object, if you use a hash set or something to store the keys.

Slightly off topic: If you're a theoretical computer scientist, yes. Otherwise, no; random memory access is not O(1) (more like O(sqrt(n))) and this does matter in practice. Also, secure (DoS proof) hash-maps can be a bit slow. This overhead can become significant when dealing with objects that can be up to a megabyte in size.

On topic: That doesn't mean we can't do this. Personally, I'd just say that all keys must be in lexicographical order which makes verification truly O(bytes) (check if the keys are sorted). After verification, this would make lookups O(keys + len(key)) (no need to look at every byte of every key).

And if you want to look up a key, currently you must make sure the object is a canonical form at least once anyway, so you can ensure you are looking up the right key.

I assume that, by "canonical", you mean "well-formed" (we generally use "canonical" to refer to objects that have correctly sorted keys etc...). If so, yes, you have to do a quick pass to validate this (although, honestly, our code isn't the best when it comes to validating things...).

Honestly, I'm not the right person to argue with you about this... I agree with you in principle and hate the HTML/JS browser mess. I don't think this situation is quite as dire as you seem to believe but, in general, you've convinced me. Providing a bunch of "accept this" and "don't accept this" objects as test cases should help alleviate my immediate concern (and break all of our implementations but oh well...).

I'll bring this up and mention you you when I update the spec (we're working on finalizing it now). I believe the objections I'm likely to get are:

What about existing IPLD objects: I believe the existing CBOR encoders we use encode to canonical (and sorted) CBOR. In go (and probably JavaScript), we already re-encode CBOR objects uploaded via the API to ensure they get canonicalized.
What about JSON? While we technically define a DagJSON IPLD format, we only really use it in APIs (we never hash JSON objects). We still want to let users write objects without, e.g., sorting their keys, but we can say that that's not the canonical form (and only the canonical form is suitable for linking).

Concrete reasons to do this:

Browsers seem to use the last specified key while the spec currently says to use the first (better for traversal) so we already have a potential discrepancy.
Slightly faster traversal (assuming we require sorted keys).
Handling edge cases becomes simple (reject it).
Better data deduplication (because "should use the canonical form" becomes "must use the canonical form").
Easier to use existing libraries for decoding, e.g., CBOR objects. Choosing the first key versus the last key versus rejecting the object in the case of duplicate keys is effectively an arbitrary decision so libraries are likely to differ on this. However, e.g., canonical CBOR is well defined and libraries should generally implement it the same way.

Concrete reasons not to do this:

Interoperability with, e.g., existing CBOR objects. It would be nice to just say that any valid CBOR object is a valid IPLD object.
I'm probably missing context on why this decision was made.
You'll have to convince @diasdavid as well :).

icefoxen · 2017-10-17T01:17:16Z

I agree. That's why this is fully specified.

Touché. ;-)

Slightly off topic: If you're a theoretical computer scientist, yes. Otherwise, no...

Good point, though we're wandering into the realms I don't have practical experience with. I guess the simple question is, is checking that keys are in order truly faster than checking there are no duplicates?

On topic: That doesn't mean we can't do this. Personally, I'd just say that all keys must be in lexicographical order which makes verification truly O(bytes) (check if the keys are sorted). After verification, this would make lookups O(keys + len(key)) (no need to look at every byte of every key).

So, this sounds like "yes". You've convinced me too! Though it does mean that if we require lexographic ordering and no-duplicate-keys, checking duplicates becomes trivial; we only ever have to remember the last key we saw while checking the current one. It sounds like you are drifting towards this but haven't said that concretely... Could that be a solution that gets us the best of both worlds, ie, fast verification and zero ambiguity?

JSON is a tricky question though, that is a good point. I don't have a solution for that.

Interoperability with, e.g., existing CBOR objects. It would be nice to just say that any valid CBOR object is a valid IPLD object.

It would be nice, yes, though again, you have to turn it into canonical form to get the correct IPFS block hash for it, if I understand correctly. So, is "any valid CBOR object a valid IPLD object" actually useful? How often do IPLD objects get used that don't come from or get put into IPFS blocks? I actually don't know the answer to this question.

Stebalien · 2017-10-17T02:25:04Z

It would be nice, yes, though again, you have to turn it into canonical form to get the correct IPFS block hash for it, if I understand correctly.

Currently, no. We suggest you canonicalize it but it's not mandatory (IIRC, that's why hammering out the spec is so important).

How often do IPLD objects get used that don't come from or get put into IPFS blocks? I actually don't know the answer to this question.

That's a broad question but I assume you mean "how often do IPLD DagCBOR objects get used that weren't created by either go-ipfs or js-ipfs". I believe the answer to this is almost never.

For a more complete answer (that may repeat things you already know)...

IPLD is a metaformat; it's a way of understanding hash-linked, structured data. For example, both git objects and ethereum blocks are valid IPLD objects (although they're not general purpose formats and they can't express all valid IPLD data structures). This is where we get the practice of being compatible with existing data.

However, this discussion is really about DagJSON and DagCBOR (we usually prefix these with Dag to distinguish from plain arbitrary JSON/CBOR). Given that these are, effectively, our formats, we can probably do whatever we want as long as we don't break any of our own code (the joy of decentralized systems).

rvagg · 2019-08-14T08:27:49Z

Closing due to staleness as per team agreement to clean up the issue tracker a bit (ipld/team-mgmt#28). This doesn't mean this issue is off the table entirely, it's just not on the current active stack but may be revisited in the near future. If you feel there is something pertinent here, please speak up, reopen, or open a new issue. [/boilerplate]

daviddias added the status/deferred Conscious decision to pause or backlog label Mar 19, 2018

rvagg closed this as completed Aug 14, 2019

vmx mentioned this issue Aug 14, 2019

Disallow duplicated keys in Map kinds #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not just disallow duplicate keys? #47

Why not just disallow duplicate keys? #47

icefoxen commented Oct 9, 2017 •

edited

Loading

Stebalien commented Oct 10, 2017

icefoxen commented Oct 16, 2017 •

edited

Loading

Stebalien commented Oct 16, 2017

icefoxen commented Oct 17, 2017

Stebalien commented Oct 17, 2017

rvagg commented Aug 14, 2019

Why not just disallow duplicate keys? #47

Why not just disallow duplicate keys? #47

Comments

icefoxen commented Oct 9, 2017 • edited Loading

Stebalien commented Oct 10, 2017

icefoxen commented Oct 16, 2017 • edited Loading

Stebalien commented Oct 16, 2017

icefoxen commented Oct 17, 2017

Stebalien commented Oct 17, 2017

rvagg commented Aug 14, 2019

icefoxen commented Oct 9, 2017 •

edited

Loading

icefoxen commented Oct 16, 2017 •

edited

Loading