Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't access InRelease files #7373

Closed
NatoBoram opened this issue May 26, 2020 · 21 comments
Closed

Can't access InRelease files #7373

NatoBoram opened this issue May 26, 2020 · 21 comments
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@NatoBoram
Copy link
Contributor

NatoBoram commented May 26, 2020

Version information:

go-ipfs version: 0.6.0-dev-413ab315b
Repo version: 9
System version: arm64/linux
Golang version: go1.14.3
OS: Ubuntu 20.04 LTS aarch64 
Host: Raspberry Pi 4 Model B Rev 1.2 
Kernel: 5.4.0-1011-raspi 
Uptime: 2 days, 43 mins 
Packages: 669 (dpkg), 6 (snap) 
Shell: bash 5.0.16 
Terminal: /dev/pts/0 
CPU: BCM2835 (4) @ 1.500GHz 
Memory: 885MiB / 3793MiB 

Description:

I'm trying to build a mirror of Ubuntu Archives on IPNS using a Raspberry Pi and a 2 TB external HDD. So far, thing are going pretty well, but I think I've encountered a breaking bug.

deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal           main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-updates   main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-backports main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-security  main restricted universe multiverse # IPNS
deb http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-proposed  main restricted universe multiverse # IPNS
Err:11 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-updates InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:12 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-backports InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:13 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-security InRelease
  Connection failed [IP: 127.0.0.1 8080]
Err:14 http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu focal-proposed InRelease
  Connection failed [IP: 127.0.0.1 8080]
Fetched 265 kB in 4min 0s (1 102 B/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
12 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-updates/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-backports/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-security/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Failed to fetch http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists/focal-proposed/InRelease  Connection failed [IP: 127.0.0.1 8080]
W: Some index files failed to download. They have been ignored, or old ones used instead.

According to those logs, the problem occurs at http://localhost:8080/ipns/QmSbCLwYuqBGQYTG4PBHaFunsKcpLLn97ApNn1wf6cV8jd/ubuntu/dists.

I'm using this to query multiple public gateways to know if they can access the file.

To speed up discovery, ipfs swarm connect /p2p/QmV8TePNsdZiXUpq62739hp5MJLSk8SdpSWcpLxaqhRQdR.

@NatoBoram NatoBoram added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 26, 2020
@Stebalien
Copy link
Member

Is it a symlink?

@NatoBoram
Copy link
Contributor Author

NatoBoram commented May 27, 2020

There's a huge probability it is; There's way too many links in there. I noticed some of them were just downloaded as files and it looks like some other just aren't reachable.

@Stebalien
Copy link
Member

Could you give me your full multiaddr? I can't find your node.

@Stebalien
Copy link
Member

(but yeah, we need to follow symlinks on the gateway)

@Stebalien
Copy link
Member

Stebalien commented May 27, 2020 via email

@NatoBoram
Copy link
Contributor Author

NatoBoram commented May 27, 2020

Oh. I think we found the problem.

ipfs get bafybeihocm6ufvyz44kde6fewu2wsj4qfiecfzbjubbekvcnw3hr7u3smq/ubuntu/dists/focal-updates/
Saving file(s) to focal-updates
 311.33 MiB / 311.33 MiB [==================================================================================] 100.00% 1s
Error: data in file did not match. mirrors/ubuntu/dists/focal-updates/InRelease offset 0

Because rsync takes 10 minutes to re-sync and IPFS takes multiple hours to re-sync, there's no way the InRelease file can match.

Is there a way to make the adding process faster? Right now, the command I'm using is ipfs add --recursive --hidden --quieter --wrap-with-directory --chunker=rabin --nocopy --fscache --cid-version=1.

I saw in ipfs-inactive/package-managers#18 that removing --nocopy held huge improvements, but that's kinda hard when Ubuntu Archives is 1.24 TB and I have only 2 TB available 🤔

@Stebalien
Copy link
Member

Removing --fscache may help. Other than that, which datastore are you using? Could you post the output of ipfs config show?

@NatoBoram
Copy link
Contributor Author

NatoBoram commented Jun 4, 2020

ipfs config show
{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "Gateway": "/ip4/0.0.0.0/tcp/8080",
    "NoAnnounce": [],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
  ],
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "child": {
        "path": "badgerds",
        "syncWrites": false,
        "truncate": true,
        "type": "badgerds"
      },
      "prefix": "badger.datastore",
      "type": "measure"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true,
      "Interval": 10
    }
  },
  "Experimental": {
    "FilestoreEnabled": true,
    "GraphsyncEnabled": true,
    "Libp2pStreamMounting": true,
    "P2pHttpProxy": true,
    "ShardingEnabled": true,
    "StrategicProviding": true,
    "UrlstoreEnabled": true
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "QmV8TePNsdZiXUpq62739hp5MJLSk8SdpSWcpLxaqhRQdR"
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 900,
      "LowWater": 600,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "DisableRelay": false,
    "EnableAutoRelay": true,
    "EnableRelayHop": true
  }
}

Since I got the data in file did not match error, I removed the --nocopy option, but now I need 2.48 TB of storage and I only have 1.80 TB. I think this project will sink for me ^^

Right now, I'm using Btrfs and duperemove to save on the duplication, but it looks like not much of the Badger Datastore can be deduplicated. If I could deduplicate just enough to not go over my 1.8 TB budget, I would be able to publish this mirror and actually use it.

apt show duperemove
Package: duperemove
Version: 0.11.1-3
Priority: optional
Section: universe/admin
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Peter Záhradník <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 260 kB
Depends: libc6 (>= 2.14), libglib2.0-0 (>= 2.31.8), libsqlite3-0 (>= 3.7.15)
Enhances: btrfs-progs
Homepage: https://markfasheh.github.io/duperemove/
Download-Size: 70.6 kB
APT-Sources: http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages
Description: extent-based deduplicator for file systems
 Duperemove is a tool for finding duplicated extents and submitting them for
 deduplication.  When given a list of files it will hash their contents on a
 block by block basis and compare those hashes to each other, finding and
 categorizing extents that match each other.
 .
 On BTRFS and, experimentally, XFS, it can then reflink such extents in a
 race-free way.  Unlike hardlink-based solutions, affected files appear
 independent in any way other than reduced disk space used.

@Stebalien
Copy link
Member

Got it. I wanted to make sure you were using badger without sync writes enabled.

I'm not sure why removing --nocopy helps and I'm not entirely sure that that's still true after some optimizations we've made.

Note: I'd consider using snapshots to decouple these. That is, you can:

  1. Rsync in one loop every 10? minutes.
  2. In a separate loop:
    1. Take a btrfs snapshot (use flock to make sure an rsync run isn't running?).
    2. Add this btrfs snapshot to IPFS with nocopy.

That will mean that the IPFS mirror will always be a bit behind but you'll never have to stall the HTTP mirror to wait on the IPFS mirror. This will also ensure that you never modify files after adding them to IPFS.

@NatoBoram
Copy link
Contributor Author

Oh, that's very interesting. For the --nocopy option to work, new files have to be in a different path than the old files, and unchanged files mustn't be removed. That means I'll end up with an ever-growing amount of snapshots, roughly once per ipfs add.

Is there a way to cleanup the snapshots? What happens if I add using --nocopy a file that already exists elsewhere?

@Stebalien
Copy link
Member

That means I'll end up with an ever-growing amount of snapshots, roughly once per ipfs add.

Yes, but the snapshots should dedup.

Is there a way to cleanup the snapshots? What happens if I add using --nocopy a file that already exists elsewhere?

Unfortunately, I don't think it's possible to override old files with new files. I believe for performance reasons, we don't bother replacing old "filestore no copy" records with ones pointing to new files.

Honestly, I think the best approach here would be to create a new repo, add a new snapshot, then delete the old repos and the old snapshots (once every few days). I assume the repos (with --nocopy) aren't too large, right?

Otherwise, we may be able to find a way to bypass the "do I already have this block check" by adding yet another flag (but I'd prefer not to if possible).

@NatoBoram
Copy link
Contributor Author

Otherwise, we may be able to find a way to bypass the "do I already have this block check" by adding yet another flag (but I'd prefer not to if possible).

This seems very useful. In fact, it's confusing that it's not already the case; If I add a new file using --nocopy, then I expect the unpinned ones to be replaced. Another approach could be to add multiple sources to --nocopy files, but I'm not sure if it's that useful. I think I prefer just overriding the previous link.

I believe the benefices are real. Should I raise an issue for that?

@Stebalien
Copy link
Member

It deserves an issue, but I'm not sure about the best approach. A really nice property of the current blockstore is that it's idempotent. This change would break that.

@ivan386
Copy link
Contributor

ivan386 commented Jun 6, 2020

@Stebalien
1 Check have
2 Validate
3 Replace if old is bad block

@Stebalien
Copy link
Member

I'm closing this as it's not really a bug. Removing/changing a file on disk after adding it to go-ipfs with the --nocopy flag isn't allowed.

@NatoBoram
Copy link
Contributor Author

NatoBoram commented Jun 18, 2020

Hey! I just wanted to add that I've updated my script to manage snapshots as you suggested.

image

I had to create a Btrfs subvolume and move the mirror over, but this done overnight, I'm now adding it back to IPFS using a fresh badgerds. It seems to take a very long time.

The problem with the program I made is that it's now dependent on Btrfs. While I do love Btrfs, I'm not sure if it's a great idea for my ipfs-mirror-manager to be tied to a filesystem. Moreover, the --nocopy option makes it mandatory for the node to boot from the same drive as the mirror itself. It would be nice to be able to separate them.

Nonetheless, successfully pulling off an IPFS mirror of the Ubuntu archive on a Raspberry Pi would be very impressive, and I'm extremely proud that IPFS has come this far.

At this time, the .ipfs folder is only 1.3G. The total disk usage is 1.3T.

@Stebalien
Copy link
Member

So, my ideal solution here would be to just not use the go-ipfs daemon, but instead write a custom dropbox like IPFS service by cobbling together bitswap, libp2p, a datastore, and the DHT. It would:

  1. Monitor a directory for changes.
  2. When a file is added, it would chunk, hash, and index (but not copy) the file. You could even store the results in an sql database instead of using a datastore.
  3. When a file is removed/changed, it would remove references to the file.

The database schema would be:

  • Table: files
    • filename (primary key)
    • modtime
  • Table: blocks
    • id (primary key)
    • cid (indexed)
    • filename (indexed)
    • offset

On start:

  • scan for changed files, comparing with the mod times in the database.
    On add/update.
  • Add the file to the files table.
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Chunk the file, adding each block to the blocks table.
    On remove:
  • Run DELETE FROM blocks where filename=filename (just in case)
  • Remove the file from the files table.

@rpodgorny
Copy link

So, my ideal solution here would be to just not use the go-ipfs daemon, but instead write a custom dropbox like IPFS service by cobbling together bitswap, libp2p, a datastore, and the DHT. It would:

1. Monitor a directory for changes.

2. When a file is added, it would chunk, hash, and _index_ (but not copy) the file. You could even store the results in an sql database instead of using a datastore.

3. When a file is removed/changed, it would remove references to the file.

The database schema would be:

* Table: files
  
  * filename (primary key)
  * modtime

* Table: blocks
  
  * id (primary key)
  * cid (indexed)
  * filename (indexed)
  * offset

On start:

* scan for changed files, comparing with the mod times in the database.
  On add/update.

* Add the file to the files table.

* Run `DELETE FROM blocks where filename=filename` (just in case)

* Chunk the file, adding each block to the blocks table.
  On remove:

* Run `DELETE FROM blocks where filename=filename` (just in case)

* Remove the file from the files table.

perhaps you should create a new issue to track the development of this idea

@Stebalien
Copy link
Member

Stebalien commented Jun 19, 2020 via email

@NatoBoram
Copy link
Contributor Author

I can't really afford the time it would take to build a custom IPFS daemon, I have to do with what I have. And now, what I have is a mirror that takes around 2 days per updates. I posted it on Reddit.

In the meantime, is there any way to optimize it?

Right now, the command I'm using is ipfs add --recursive --hidden --quieter --progress --chunker=rabin --nocopy --cid-version=1.

CPU usage is about 40% and HDD read speeds are at about 15-30 Mbps.

@Stebalien
Copy link
Member

Don't use --chunker=rabin. Our rabin implementation is terrible. For now, I recommend --chunker=buzhash. You could also try passing --inline to inline small (<=32 bytes) files into directory entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

4 participants