Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Update Hydras to new HTTP Delegated Routing #180

Closed
2 tasks
BigLep opened this issue Nov 18, 2022 · 9 comments
Closed
2 tasks

Update Hydras to new HTTP Delegated Routing #180

BigLep opened this issue Nov 18, 2022 · 9 comments
Assignees

Comments

@BigLep
Copy link

BigLep commented Nov 18, 2022

Done Criteria

Hydras are using the HTTP Delegated Routing version compatible with ipfs/specs#337 in production.

Why Important

See motivation in ipfs/specs#337

Notes

@guseggert
Copy link
Contributor

guseggert commented Dec 8, 2022

Main code change in #185

I have also turned off OpenSSL in the Docker build since it keeps causing problems, it's now using Go's crypto. I'll monitor perf around that.

I've deployed this to the test instance, see libp2p/hydra-booster-infra#14. I've also updated the dashboards with the new metrics.

I'll let it bake overnight, if everything looks good tomorrow then I'll deploy to the whole fleet.

@BigLep
Copy link
Author

BigLep commented Dec 9, 2022

Hi @guseggert . Did the prod deployment happen? Are there client side (Hydra) and server-side (cid.contact) graphs you're monitoring?

@guseggert
Copy link
Contributor

No not yet, it was getting late Fri and I didn't want to deploy late on a Fri. Today I looked into why CPU usage was much higher than expected (almost 2x). I expected something related to disabling OpenSSL, but CPU profiles showed most time spent in GC, and allocation profiles showed top allocations were in libp2p resource manager's metric publishing, which generates a ton of garbage in the tags that it adds to metrics. So I disabled that--we don't use it anyway, hydra calculates its own resource manager metrics. That's now deployed to test and CPU usage looks much better, as does long-tail latency on cid.contact requests.

This became an issue now because I also upgraded libp2p to the latest version to pick up all the security updates.

Letting this bake again tonight and will take a look in the AM. Will also open an issue w/ go-libp2p to reduce the garbage generated by the resource manager metrics.

@BigLep
Copy link
Author

BigLep commented Dec 14, 2022

@guseggert : how is this looking?

Also, please share the issue with go-libp2p when you have it.

@guseggert
Copy link
Contributor

I was able to grab another profile showing the OpenCensus tag allocations from OpenCensus, opened an issue with go-libp2p here: libp2p/go-libp2p#1955

I've been fighting with resource manager and I have given up on it and turned it off, and things are looking better now. Every time I would fix one limit, another would pop up and cause some degenerate behavior somewhere else, and chasing down the root cause of throttles is non-trivial. We need to move forward here so I am just disabling resource manager for now.

@BigLep
Copy link
Author

BigLep commented Dec 14, 2022

@guseggert : can you also point to how you were configuring the resource manager? (I'm asking so can learn what pain another integrate experienced.) I would have expected us toonly have limits like Kubo's strategy.

@guseggert
Copy link
Contributor

guseggert commented Dec 14, 2022

Each hydra host is effectively running many Kubo nodes at the same time, and they also don't handle Bitswap traffic, so the traffic pattern is pretty different from a single Kubo node. We have high-traffic gateway hosts to compare with but they are even more different (eg accelerated DHT client).

The RM config currently deployed to prod hydras is here: https://github.com/libp2p/hydra-booster/blob/master/head/head.go#L82 . Note that those are per-head limits. After upgrading from go-libp2p v0.21 to v0.24, there was significantly more throttling, so I've been tweaking them locally and in a branch. As part of that, I pulled resource manager and connection manager out to be shared across heads instead, which makes reasoning about limits easier.. When RM throttling was interfering, there was much less processed reqs by the DHT but much higher mem usage and goroutines, mostly stuck on the identify handshake...I didn't trace through the code but I'm suspecting that they were somehow stuck due to RM throttling, since everything's running fine now with RM off.

@guseggert
Copy link
Contributor

guseggert commented Dec 16, 2022

Coordinated with @masih this morning to flip the full Hydra fleet over to the HTTP API. Things are looking fine. The p50 cid.contact latency has dropped from ~36 ms (via reframe) to ~18 ms (via HTTP API).

@BigLep
Copy link
Author

BigLep commented Jan 20, 2023

Resolving since done criteria is satisfied.

@BigLep BigLep closed this as completed Jan 20, 2023
@github-project-automation github-project-automation bot moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Jan 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants