Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP feat(patterns): pattern-based compression take2 #1584

Draft
wants to merge 1 commit into
base: markm-prepare-for-extended-matchers
Choose a base branch
from

Conversation

erights
Copy link
Contributor

@erights erights commented May 10, 2023

Staged on #2248

closes: #2112
refs: #1564 Agoric/agoric-sdk#6432

Description

Adds two new exports to @endo/patterns

mustCompress(
  specimen: Passable, 
  pattern: Pattern, 
  label?: string|number
) => Passable

and its "inverse"

mustDecompress(
  compressed: Passable,
  pattern: Pattern,
  label?: string|number
) => Passable

(From Agoric/agoric-sdk#6432 (comment) ):

For example without compression, the Zoe proposal

    {
      want: {
        Winnings: {
          brand: moolaBrand,
          value: makeCopyBagFromElements([
            { foo: 'a' },
            { foo: 'b' },
            { foo: 'c' },
          ]),
        },
      },
      give: { Bid: { brand, value: 37n } },
      exit: { afterDeadline: { deadline: 11n, timer } },
    },

is stored with a smallcaps body of

'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'

But it compresses with the proposalShape

    harden({
      want: {
        Winnings: {
          brand: moolaBrand,
          value: M.bagOf(harden({ foo: M.string() }), 1n),
        },
      },
      give: { Bid: { brand, value: M.nat() } },
      exit: { afterDeadline: { deadline: M.gte(10n), timer } },
    })

to

[[['c'], ['b'], ['a']], 37n, 11n]

whose smallcaps body is

'#[[["c"],["b"],["a"]],"+37","+11"]'

which is 12% as long.


It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.

mustCompress is analogous to mustMatch, which as a reminder is

mustMatch(
  specimen :Passable,
  pattern: Pattern,
  label?: string|number
) => void

The following equivalences must hold

  • For all s,p,l1,l2 mustMatch(s,p,l1?) must succeed iff muchCompress(s,p,l2?) succeeds. When they succeed, the label does not matter.
  • For both, they do not succeed by throwing an error with a diagnostic that might use label to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.
  • mustMatch(s,p,l1?) and therefore mustCompress(s,p,l2?) succeeds iff compress(s,p) === true.
  • for all s,p,l,c,s2 mustMatch(s,p,l?) === c iff mustDecompress(c,p,l) === s2 where s and s2 have the same distributed object semantics. compareRank(s, s2) === 0, isKey(s) === isKey(s2), isKey(s) => keyEQ(s,s2)`.

The point is that typically c is smaller than s, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.

Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.

Security Considerations

If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.

Aside from that, none.

Scaling Considerations

The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:

  • encode(mustCompress(data, pattern)) typically takes both less time and less space than
    mustMatch(data, pattern) && encode(data).
  • mustDecompress(decode(encodedCompressedData)) typically takes less time than
    decode(encodedUncompressedData).

This will depend of course on what encode scheme is used.

Documentation Considerations

  • Most of this PR note is worth capturing in documentation in the PR itself

Testing Considerations

Already includes good manual tests.

  • should additionally do fuzzing tests, probably using fastCheck.

Compatibility Considerations

A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like compactOrdered, syrup or cbor.

compactOrdered is both rank equality preserving and rank order preserving. Holding the pattern constant, compactOrdered of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys using compactOrdered on the uncompressed form, forfeiting the opportunity to use keyShape for compression.

Upgrade Considerations

When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.

SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern: keyShape, valueShape, and stateShape. Agoric/agoric-sdk#6432 modifies SwingSet to also use the valueShape and stateShape for compression.

A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like M.bagOf(keyShape, countShape) in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with 1. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.

The M.something(...) matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.

  • [ ] Includes *BREAKING*: in the commit message with migration instructions for any breaking change.

This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.

  • Updates NEWS.md for user-facing changes.

Many of the points made in this PR note should be summarized in a NEWS.md entry.

@erights erights self-assigned this May 10, 2023
@erights erights changed the base branch from master to markm-tag-guards May 10, 2023 06:38
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 241b2d3 to f57ac4b Compare May 10, 2023 06:44
@erights erights force-pushed the markm-tag-guards branch from 358d9fa to 9c24c19 Compare May 20, 2023 21:43
@erights erights force-pushed the markm-pattern-based-compression-2 branch from f57ac4b to 533d62a Compare May 20, 2023 21:45
@erights erights force-pushed the markm-tag-guards branch from 9c24c19 to 91d36e7 Compare June 6, 2023 03:20
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 533d62a to 7ce2d16 Compare June 6, 2023 03:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 7ce2d16 to 1025466 Compare August 8, 2023 02:23
@erights erights changed the base branch from markm-tag-guards to markm-tag-guards-2 August 8, 2023 02:24
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 18db466 to accc77c Compare August 8, 2023 02:36
@erights erights force-pushed the markm-tag-guards-2 branch 3 times, most recently from b05871a to 2a13b3d Compare August 9, 2023 02:27
@erights erights force-pushed the markm-pattern-based-compression-2 branch from accc77c to 2e6810f Compare August 9, 2023 02:34
@erights erights changed the base branch from markm-tag-guards-2 to markm-type-guards August 9, 2023 02:35
@erights erights force-pushed the markm-type-guards branch from a0170df to 505f81f Compare August 15, 2023 22:53
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 2e6810f to 99b58d6 Compare August 15, 2023 23:02
@erights erights force-pushed the markm-type-guards branch from 505f81f to c2cd034 Compare August 21, 2023 22:48
Base automatically changed from markm-type-guards to master August 21, 2023 22:58
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 282fd46 to b77b6f7 Compare August 28, 2023 05:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from be5d3aa to 3a169ed Compare August 30, 2023 01:23
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 7125ac7 to 061c7e6 Compare September 16, 2023 02:45
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 5497b03 to ce825a7 Compare September 26, 2023 03:13
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 9f14fe9 to 7c42f56 Compare June 9, 2024 20:44
@erights erights force-pushed the markm-pattern-based-compression-2 branch from e1eb82d to 1c9dc8e Compare June 9, 2024 20:44
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 7c42f56 to 2d20d8e Compare June 13, 2024 13:31
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 1c9dc8e to bb79e79 Compare June 13, 2024 13:32
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 2d20d8e to f013614 Compare June 22, 2024 03:34
@erights erights force-pushed the markm-pattern-based-compression-2 branch from bb79e79 to c079763 Compare June 22, 2024 03:35
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from f013614 to 92befa7 Compare July 3, 2024 00:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch from c079763 to 7af6f89 Compare July 3, 2024 00:23
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 92befa7 to b4b09cd Compare July 13, 2024 23:06
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 7af6f89 to c6d0e20 Compare July 13, 2024 23:08
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from b4b09cd to a71dd8f Compare July 22, 2024 01:11
@erights erights force-pushed the markm-pattern-based-compression-2 branch from c6d0e20 to 5566832 Compare July 22, 2024 01:11
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from a71dd8f to 552cdca Compare August 3, 2024 00:18
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 5566832 to da664f9 Compare August 3, 2024 00:18
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 552cdca to b6ab0e1 Compare August 13, 2024 17:38
@erights erights force-pushed the markm-pattern-based-compression-2 branch from da664f9 to ce1dac5 Compare August 13, 2024 17:39
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from b6ab0e1 to bd279f6 Compare August 14, 2024 20:53
@erights erights force-pushed the markm-pattern-based-compression-2 branch from ce1dac5 to a222c71 Compare August 14, 2024 20:54
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from bd279f6 to 711ef1c Compare September 2, 2024 21:16
@erights erights force-pushed the markm-pattern-based-compression-2 branch from a222c71 to 0c316b9 Compare September 2, 2024 21:16
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 711ef1c to 1e4653e Compare September 7, 2024 20:25
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 0c316b9 to 8e72f8c Compare September 7, 2024 20:26
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 1e4653e to c164404 Compare October 14, 2024 19:16
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 8e72f8c to ce96699 Compare October 14, 2024 19:18
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from c164404 to 21e35d9 Compare October 28, 2024 23:35
@erights erights force-pushed the markm-pattern-based-compression-2 branch from ce96699 to a4dd6c9 Compare October 28, 2024 23:37
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 21e35d9 to 9be1bfe Compare November 17, 2024 00:49
@erights erights force-pushed the markm-pattern-based-compression-2 branch from a4dd6c9 to c933684 Compare November 17, 2024 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need schema-like compression to avoid storing and transmitting redundant data.
1 participant