-
Notifications
You must be signed in to change notification settings - Fork 108
Why not just disallow duplicate keys? #47
Comments
A more pragmatic reason is that programmers suck at checking for errors up-front; we tend to be lazy. So, let's assume that some programmers write a few IPLD libraries that accept and produce incorrect objects (e.g., duplicate keys in some edge cases). Now, say a detail-oriented programmer comes along and writes an application that actually checks these edge cases and rejects malformed objects. No one will use that new program because it will reject quite a few IPLD objects made by the sloppy programs even if this new program is technically correct. At this point, nobody will be happy. If, on the other hand, we say: "don't produce objects with these properties but you must be able to handle them as follows...", we force programmers to actually handle these edge cases instead of just saying "I'll deal with error handling later (queue a low-priority TODO that gets deleted in the next refactor).
FYI, objects that don't contain links are still valid IPLD objects. |
Sorry for the delay, here's my thoughts.
I'm not terribly convinced this is a good idea anymore, given all the things that have come out of it... HTML, JS, OpenGL, etc. Underspecified things never seem to end well. Also, I'm not terribly convinced that we need it anymore; we can create official test suites for a protocol that help ensure that an implementation is valid these days. Having code we can run and at least say "this implementation is valid for all the cases the test suite covers" is far more useful than someone thinking "Well a client MIGHT conceivably do X, and it's not COMPLETE nonsense, so I'll try to handle it", only to get caught off guard when the client does Y instead. It's also useful for the programmer too, since it lets them see concretely what they do right and wrong, instead of saying "well I THINK this is okay". Errors should be noisy! And "humans are lazy" is not a valid reason for designing lousy systems. Strict systems are much easier to program for than loosey-goosey systems anyway! When writing a client, you know exactly what will happen with a strict system. With one that tries to be permissive, you're never sure whether you're doing things correctly or not. Good engineers seldom want to be doing the wrong thing.
Is it? It should be O(n) in time and memory with the size of the object, if you use a hash set or something to store the keys. More to the point, I would think that ANY object that you receive from an unknown source, you MUST parse and verify at some point anyway just to see whether it's valid. And if you want to look up a key, currently you must make sure the object is a canonical form at least once anyway, so you can ensure you are looking up the right key. |
I agree. That's why this is fully specified. Or, more specifically, the IPLD spec says that all formats must specify a way to handle duplicate keys but implementations that write objects in those formats shouldn't produce objects with duplicate keys. Note: We still need to fully specify our formats (some of then can conceivably forbid duplicate keys by spec, that's up to the individual format).
That's definitely not how implementations should be written. They should follow the spec as written and the spec MUST cover all cases.
Slightly off topic: If you're a theoretical computer scientist, yes. Otherwise, no; random memory access is not On topic: That doesn't mean we can't do this. Personally, I'd just say that all keys must be in lexicographical order which makes verification truly
I assume that, by "canonical", you mean "well-formed" (we generally use "canonical" to refer to objects that have correctly sorted keys etc...). If so, yes, you have to do a quick pass to validate this (although, honestly, our code isn't the best when it comes to validating things...). Honestly, I'm not the right person to argue with you about this... I agree with you in principle and hate the HTML/JS browser mess. I don't think this situation is quite as dire as you seem to believe but, in general, you've convinced me. Providing a bunch of "accept this" and "don't accept this" objects as test cases should help alleviate my immediate concern (and break all of our implementations but oh well...). I'll bring this up and mention you you when I update the spec (we're working on finalizing it now). I believe the objections I'm likely to get are:
Concrete reasons to do this:
Concrete reasons not to do this:
|
Touché. ;-)
Good point, though we're wandering into the realms I don't have practical experience with. I guess the simple question is, is checking that keys are in order truly faster than checking there are no duplicates?
So, this sounds like "yes". You've convinced me too! Though it does mean that if we require lexographic ordering and no-duplicate-keys, checking duplicates becomes trivial; we only ever have to remember the last key we saw while checking the current one. It sounds like you are drifting towards this but haven't said that concretely... Could that be a solution that gets us the best of both worlds, ie, fast verification and zero ambiguity? JSON is a tricky question though, that is a good point. I don't have a solution for that.
It would be nice, yes, though again, you have to turn it into canonical form to get the correct IPFS block hash for it, if I understand correctly. So, is "any valid CBOR object a valid IPLD object" actually useful? How often do IPLD objects get used that don't come from or get put into IPFS blocks? I actually don't know the answer to this question. |
Currently, no. We suggest you canonicalize it but it's not mandatory (IIRC, that's why hammering out the spec is so important).
That's a broad question but I assume you mean "how often do IPLD DagCBOR objects get used that weren't created by either go-ipfs or js-ipfs". I believe the answer to this is almost never. For a more complete answer (that may repeat things you already know)... IPLD is a metaformat; it's a way of understanding hash-linked, structured data. For example, both git objects and ethereum blocks are valid IPLD objects (although they're not general purpose formats and they can't express all valid IPLD data structures). This is where we get the practice of being compatible with existing data. However, this discussion is really about DagJSON and DagCBOR (we usually prefix these with |
Closing due to staleness as per team agreement to clean up the issue tracker a bit (ipld/team-mgmt#28). This doesn't mean this issue is off the table entirely, it's just not on the current active stack but may be revisited in the near future. If you feel there is something pertinent here, please speak up, reopen, or open a new issue. [/boilerplate] |
Duplicate keys are invalid, so why not just disallow duplicate keys and force parsers/handlers to reject them as invalid instead of saying "duplicate keys magically vanish in a pre-defined and consistent way that everything must take care to implement correctly"?
Sure, you can synthesize data with duplicate keys in CBOR or JSON or whatever, but that just makes it a malformed object. You can synthesize data with no link in it at all, that doesn't mean that is a valid IPLD object.
The text was updated successfully, but these errors were encountered: