Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite representation to a sorted binary list and embed it #33

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

Jorropo
Copy link
Contributor

@Jorropo Jorropo commented Dec 19, 2023

Required for libp2p/go-libp2p#2664
Closes #32
Closes #31
Closes #8
Closes #5
Closes #4

This is now a binary format which is embeded as a string, this remove the need to run code at init or when lazy initializing. It also fixes a bug where we would incorrectly assume the dataset only reports CIDR alligned ranges, however it does not.


Embedded size: 1.6MB
In-memory size: 0MB
Load into memory: 0ms
Lookup Performance: ~50ns (2.25x improvement).

@Jorropo Jorropo force-pushed the compact-binary-representation branch from aef8000 to 4324ff2 Compare December 19, 2023 01:35
@Jorropo Jorropo force-pushed the compact-binary-representation branch 2 times, most recently from 6dace2c to 6e80636 Compare December 19, 2023 01:45
@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

This result in a 1.56MiB binary file for the dataset.
It is embedded as a string into the rodata part of the executable. This remove the runtime memory cost (as this is usually mmaped directly from disk) and all the init execution. It also removes some dependencies, in practice here is the difference (tested with go test -c):

> ls -l old new
-rwxr-xr-x 1 hugo hugo 10398323 Dec 19 02:40 old*
-rwxr-xr-x 1 hugo hugo  6954814 Dec 19 02:39 new*

I think the checks keep failing because the source is serving versions of the dataset randomly. Trying with a locally curled file shows no change.

@Jorropo Jorropo force-pushed the compact-binary-representation branch from 6e80636 to 323ec8c Compare December 19, 2023 01:46
@Jorropo Jorropo changed the title rewrite representation to a sorted list and embed it rewrite representation to a sorted binary list and embed it Dec 19, 2023
@Jorropo Jorropo force-pushed the compact-binary-representation branch from 323ec8c to 5e1ecaf Compare December 19, 2023 01:48
@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

The binary file isn't very optimized. it still compress good (~4.5x using zstd -19e), this is because the representation still has many duplicated bits in neighboring regions since it can't do delta encoding:

00000000: 0000 1201 0400 0120 ffff 1201 0400 0120  ....... ....... 
00000010: 7000 0000 0000 0000 0002 0120 ffff ff05  p.......... ....
00000020: 0002 0120 c409 0000 0000 0006 0002 0120  ... ........... 
00000030: ffff ff06 0002 0120 f31d 0000 0000 0007  ....... ........
00000040: 0002 0120 ffff ff08 0002 0120 c409 0000  ... ....... ....
00000050: 0000 0009 0002 0120 ffff ff09 0002 0120  ....... ....... 
00000060: ec1d 0000 0000 000a 0002 0120 ffff ff0d  ........... ....
00000070: 0002 0120 c409 0000 0000 000e 0002 0120  ... ........... 
00000080: ffff ff0e 0002 0120 5212 0000 0000 000f  ....... R.......
00000090: 0002 0120 ffff ffbf 0002 0120 c409 0000  ... ....... ....
000000a0: 0000 00c0 0002 0120 ffff ffdf 0002 0120  ....... ....... 
000000b0: 525c 0000 0000 00e0 0002 0120 ffff ffff  R\......... ....
000000c0: 0002 0120 ec1d 0000 0000 0000 1802 0120  ... ........... 
000000d0: 0000 0020 1802 0120 620b 0000 0100 0020  ... ... b...... 
000000e0: 1802 0120 0100 0020 1802 0120 6f27 0200  ... ... ... o'..
000000f0: 0200 0020 1802 0120 ffff ff21 1802 0120  ... ... ...!... 

We could solve this with some kind of tree which could dedup leading parts of the network. But this is also significantly more complicated.
This can't be done here since this is a flat list which needs to be O(1) indexable for the binary search.

I think it's fine to belive distributors (dist.ipfs.tech, docker, ...) will want to compress the executable. Then theses savings can be seen at the whole executable compression stage.

@Jorropo Jorropo force-pushed the compact-binary-representation branch from 5e1ecaf to ba0de38 Compare December 19, 2023 01:53
@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

@prestonvanloon I would love for you to review this PR 🙏.

@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

We could also maybe optimize this better by truncating networks to 48 bits. Or even less by also truncating leading bits.
This is the smallest routable prefix on the IPv6 table.
We could also maybe truncate the asn numbers in the main table and using a side array to map our indexes to real ases.

All possible place to win a bit of space if you (future reader) feel this is needed.

@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

This is also significantly faster, new:

goos: linux
goarch: amd64
pkg: github.com/libp2p/go-libp2p-asn-util
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkAsnForIPv6-12          21599720                56.85 ns/op

old:

BenchmarkAsnForIPv6-12           9417902               126.3 ns/op

@Jorropo Jorropo force-pushed the compact-binary-representation branch 3 times, most recently from 6cba0a8 to 5c0269e Compare December 19, 2023 02:47
Copy link
Contributor

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • agree this is probably the right level of compression / complexity in generation to aim for here now.
  • a test to validate some sanity-checked network lookups on the generated ap would be good.

For reference, in playing with an ipv4 mapping of similar complexity doing some of the tree pruning complexity was able to get things down another ~3x - just doing this level had the v4 table at ~6mb and aggressive tree work gets it down to 2mb before gzip.

generate/main.go Show resolved Hide resolved
@Jorropo Jorropo force-pushed the compact-binary-representation branch from 5c0269e to e458e30 Compare December 19, 2023 10:57
@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

a test to validate some sanity-checked network lookups on the generated ap would be good.

@willscott there are already tests: https://github.com/libp2p/go-libp2p-asn-util/pull/33/files#diff-94c5a8df386d715f15a01b4c61e58fe44944002da5513c966a602c93aad118c2R11-R32

Is this what you had in mind ?

@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

For reference, in playing with an ipv4 mapping of similar complexity doing some of the tree pruning complexity was able to get things down another ~3x - just doing this level had the v4 table at ~6mb and aggressive tree work gets it down to 2mb before gzip.

I have an other branch which gets it down to the ~500KiB range before zstd but it is extremely significantly more complicated.
It is also very buggy.
It abuse varuints and builds a really tightly packed tree using the same string embeding strategy.
However fun fact, this more compact representation is bigger compressed than this naive one, it has extremely high entropy and use sub-bytes indexing.

@Jorropo
Copy link
Contributor Author

Jorropo commented Dec 19, 2023

Here are the compression ratios with the various levels of zstd:

Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     517 KiB      1.57 MiB  3.108  XXH64  sorted-network-list.bin.1.zst
     1      0     502 KiB      1.57 MiB  3.196  XXH64  sorted-network-list.bin.2.zst
     1      0     483 KiB      1.57 MiB  3.321  XXH64  sorted-network-list.bin.3.zst
     1      0     487 KiB      1.57 MiB  3.298  XXH64  sorted-network-list.bin.4.zst
     1      0     465 KiB      1.57 MiB  3.451  XXH64  sorted-network-list.bin.5.zst
     1      0     451 KiB      1.57 MiB  3.558  XXH64  sorted-network-list.bin.6.zst
     1      0     452 KiB      1.57 MiB  3.552  XXH64  sorted-network-list.bin.7.zst
     1      0     456 KiB      1.57 MiB  3.522  XXH64  sorted-network-list.bin.8.zst
     1      0     456 KiB      1.57 MiB  3.522  XXH64  sorted-network-list.bin.9.zst
     1      0     457 KiB      1.57 MiB  3.516  XXH64  sorted-network-list.bin.10.zst
     1      0     458 KiB      1.57 MiB  3.505  XXH64  sorted-network-list.bin.11.zst
     1      0     458 KiB      1.57 MiB  3.505  XXH64  sorted-network-list.bin.12.zst
     1      0     455 KiB      1.57 MiB  3.526  XXH64  sorted-network-list.bin.13.zst
     1      0     455 KiB      1.57 MiB  3.526  XXH64  sorted-network-list.bin.14.zst
     1      0     455 KiB      1.57 MiB  3.525  XXH64  sorted-network-list.bin.15.zst
     1      0     384 KiB      1.57 MiB  4.183  XXH64  sorted-network-list.bin.16.zst
     1      0     372 KiB      1.57 MiB  4.314  XXH64  sorted-network-list.bin.17.zst
     1      0     359 KiB      1.57 MiB  4.468  XXH64  sorted-network-list.bin.18.zst
     1      0     360 KiB      1.57 MiB  4.461  XXH64  sorted-network-list.bin.19.zst
     1      0     360 KiB      1.57 MiB  4.461  XXH64  sorted-network-list.bin.20.zst
     1      0     360 KiB      1.57 MiB  4.461  XXH64  sorted-network-list.bin.21.zst
     1      0     360 KiB      1.57 MiB  4.461  XXH64  sorted-network-list.bin.22.zst

I think this is fine and we don't need more aggressive tree packing.

@Jorropo Jorropo force-pushed the compact-binary-representation branch from e458e30 to f65e57d Compare December 19, 2023 12:35
This is now a binary format which is embeded as a string, this remove the need to run code at init or when lazy initializing.
It also fixes a bug where we would incorrectly assume the dataset only reports CIDR alligned ranges, however it does not.

This result in a `1.56MiB` binary file for the dataset.
It is embedded as a string into the `rodata` part of the executable. This remove the runtime memory cost (as this is usually mmaped directly from disk) and all the init execution. It also removes some dependencies, in practice here is the difference (tested with `go test -c`):
```console
> ls -l old new
-rwxr-xr-x 1 hugo hugo 10398323 Dec 19 02:40 old*
-rwxr-xr-x 1 hugo hugo  6954814 Dec 19 02:39 new*
```

This is also significantly faster, new:
```
goos: linux
goarch: amd64
pkg: github.com/libp2p/go-libp2p-asn-util
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkAsnForIPv6-12    	21599720	        56.85 ns/op
```
old:
```
BenchmarkAsnForIPv6-12    	 9417902	       126.3 ns/op
```
@Jorropo Jorropo force-pushed the compact-binary-representation branch from f65e57d to c11e9d5 Compare December 19, 2023 13:25
Copy link

Suggested version: v0.4.0

Comparing to: v0.3.0 (diff)

Changes in go.mod file(s):

diff --git a/go.mod b/go.mod
index eb3df07..7ac3e35 100644
--- a/go.mod
+++ b/go.mod
@@ -17,7 +17,7 @@ require (
 	github.com/multiformats/go-varint v0.0.5 // indirect
 	github.com/pmezard/go-difflib v1.0.0 // indirect
 	github.com/spaolacci/murmur3 v1.1.0 // indirect
-	golang.org/x/crypto v0.0.0-20190611184440-5c40567a22f8 // indirect
-	golang.org/x/sys v0.0.0-20190412213103-97732733099d // indirect
+	golang.org/x/crypto v0.1.0 // indirect
+	golang.org/x/sys v0.1.0 // indirect
 	gopkg.in/yaml.v2 v2.2.2 // indirect
 )

gorelease says:

# summary
Suggested version: v0.4.0

gocompat says:

Your branch is up to date with 'origin/master'.

Cutting a Release (and modifying non-markdown files)

This PR is modifying both version.json and non-markdown files.
The Release Checker is not able to analyse files that are not checked in to master. This might cause the above analysis to be inaccurate.
Please consider performing all the code changes in a separate PR before cutting the release.

Automatically created GitHub Release

A draft GitHub Release has been created.
It is going to be published when this PR is merged.
You can modify its' body to include any release notes you wish to include with the release.

@Jorropo Jorropo merged commit 661da59 into master Dec 19, 2023
14 of 16 checks passed
@Jorropo Jorropo deleted the compact-binary-representation branch December 19, 2023 13:28
Jorropo added a commit that referenced this pull request Dec 19, 2023
I broke the API eventho I didn't meant to, I thought I was safe because the gorelease bot didn't reported anything but it seems to be wrong:
#33 (comment)

> gocompat says:
> ```
> Your branch is up to date with 'origin/master'.
> ```
Jorropo added a commit that referenced this pull request Dec 19, 2023
I broke the API eventho I didn't meant to, I thought I was safe because the gorelease bot didn't reported anything but it seems to be wrong:
#33 (comment)

> gocompat says:
> ```
> Your branch is up to date with 'origin/master'.
> ```
Comment on lines +47 to +69
func AsnForIPv6Network(network uint64) (asn uint32) {
n := uint(len(dataset)) / entrySize
var i, j uint = 0, n
for i < j {
h := (i + j) / 2 // wont overflow since the list can't be that large
start, end, asn := readEntry(h)
if start <= network {
if network <= end {
return asn
}
i = h + 1
} else {
j = h
}
}

return &asnStore{cr}, nil
if i >= n {
return 0
}
start, end, asn := readEntry(i)
if start <= network && network <= end {
return asn
}
return 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Copy link

@prestonvanloon prestonvanloon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thank you!

Jorropo added a commit to ipfs/kubo that referenced this pull request Dec 29, 2023
Jorropo added a commit to ipfs/kubo that referenced this pull request Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants