Skip to content

Commit

Permalink
Add exploration report about merkle segments
Browse files Browse the repository at this point in the history
This report was originally issue ipld#67.

Closes ipld#67.
  • Loading branch information
vmx committed Aug 20, 2019
1 parent 1fab4cc commit c49769a
Showing 1 changed file with 66 additions and 0 deletions.
66 changes: 66 additions & 0 deletions design/history/exploration-reports/2018.07-merkle-segments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
This document was archived from https://github.com/ipld/specs/issues/67. There was agreement that UTF-8 NFC should be the canonical encoding. Below is the original text of the issue written by @warpfork.

---

## Mission

It's important to specify precisely what is a valid merklepath segement in IPLD.

The spec currently contains a "TODO: list path resolving restrictions" and this could be improved :)

---

## Why

First, a quick clarification: "merklepath segments" are a distinct concept from "IPLD Selectors". Merklepaths are a specific and limited implementation of IPLD Selectors; they can only specify a traversal to a single object; and importantly, we want them to be serializable in a way that's easy for humans to operate. To quote the current spec for motivations:

> IPLD paths MUST layer cleanly over UNIX and The Web (use `/`, have deterministic transforms for ASCII systems).
(Perhaps "ASCII" is a little over-constrained there. The spec *also* says "
IPLD paths MUST be universal and avoid oppressing non-english societies (e.g. use UTF-8, not ASCII)" -- we might want to refine those two lines after we tackle the rest of this issue.)

Second of all, just a *list* of other issues that are fairly closely related to a need for clarity on this subject:

- https://github.com/ipfs/go-ipfs/issues/1710 -- "**IPFS permits undesirable paths**" -- lots of good discussion on encodings, and examples of valid but problematic characters
- https://github.com/ipld/specs/issues/59 -- "**Document restrictions on keys of maps**"
- https://github.com/ipld/specs/issues/58 -- "**Non-string keys**" -- this one has a particularly interesting detail quoted: "The original intention was to actually be quite restrictive: map keys must be unicode strings with no slashes. However, we've loosened that so that they can contain slashes, it's just that those keys can't be pathed through.". (n.b. *this* issue here is named "merklepath segments", not "IPLD keys", specifically to note this same distinction.)
- https://github.com/ipld/specs/issues/55 -- "**Spec out DagPB path resolution**"
- https://github.com/ipld/specs/issues/37 -- "**Spec refining: make sure that an attribute cannot be named `.`**"
- https://github.com/ipld/specs/issues/20 -- "**Spec refining: Merkle-paths to an array's index**"
- https://github.com/ipld/specs/issues/15 -- "**Spec refining: Terminology IPLD vs Merkle**" -- basically, am I titling this issue correctly by saying "merklepath"? Perhaps not ;)
- https://github.com/ipld/specs/issues/1 -- "**Spec refining: Relative paths in IPLD**" -- may require reserving more character sequences as special
- https://github.com/ipfs/unixfs-v2/issues/3 -- "**Handing of non-utf-8 posix filenames**"
- https://github.com/ipfs/go-ipfs/issues/4292 -- "**WebUI should (somehow) indicate a problematic directory entry**"
- perhaps more!

As this list makes evident... we really need to get this nailed down.

---

## Mission, refined

Okay, motivations and intro done. What do we need to *do*?

(1) Update the spec to be consistently clear about IPLD *keys* versus *valid merklepath segments*. This distinction seems to exist already, but it's tricky, so we should hammer it.

(2) Define normal character encoding. (I think it's now well established that this is necessary -- merklepath segments are *absolutely* for direct human use, so we're certainly speaking of chars rather than bytes; and also unicode is complex and ignoring normalization is not viable.)

(3) Define any blacklisting of any further byte sequences which are valid normalized characters but we nonetheless don't want to see in merklepath segments.

(4) Ensure we're clear about what happens when an IPLD key is valid but as a key but not a merklepath segment (e.g. it's unpathable).

(And one more quick note: a lot of this has been in discussion already as part of sussing out the unixfsv2 spec. In unixfsv2, we've come to the conclusion that some of our path handling rules are quantum-entangled with the IPLD spec for merklepaths. Unixfsv2 may apply *more* blacklistings of byte sequences which are problematic than IPLD merklepath segements, so we don't have to worry about *everything* up here; but we do want to spec this *first*, so we can make sure the Unixfsv2 behavior normalizers are a nice subset of the IPLD merklepath rules.)

---

## Progress

**Regarding (1)**: "just a small matter of writing" once we nail the rest...

**Regarding (2)**: **We have an answer [and the answer is "NFC"](https://www.unicode.org/reports/tr15/#Norm_Forms)**. (At least, I think we have an answer with reasonable consensus.) We had a long thread about this [in the context of unixfsv2](https://github.com/ipfs/unixfs-v2/issues/3), but entirely applicable here in general. Everyone seems to agree that UTF8 is a sensible place to be and [NFC encoding is a sensible, already-well-specified normalization](https://github.com/ipfs/unixfs-v2/issues/3#issuecomment-404760564) to use. And importantly, in practice, NFC is the encoding seen in practically all documents everywhere, so choosing NFC means we accept the majority of strings unchanged. Whew. *dusts off hands*

**Regarding (3)**: Lots of example choices in https://github.com/ipfs/go-ipfs/issues/1710 . We need to reify that into a list in the spec.

**Regarding (4)**: Open field?

(I'll update this "progress" section as discussion... progresses.)

0 comments on commit c49769a

Please sign in to comment.