Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait failed pieces greatly increase image download time #1083

Closed
likunbyl opened this issue Feb 18, 2022 · 8 comments
Closed

Wait failed pieces greatly increase image download time #1083

likunbyl opened this issue Feb 18, 2022 · 8 comments
Assignees

Comments

@likunbyl
Copy link

Bug report:

I deployed dragonfly v2.0.2-rc.4 through Helm chart 0.5.38. While distribute an image to 16 nodes, the image download time is higher than before.

before, we ran dragonfly v2.0.2-alpha.6, the average image download time is 7.06s, now, it's 8.98s.

I checked the logs , and found several dfdaemon instances have more than 2s delay when they wait for failed pieces:

{"level":"info","ts":"2022-02-17 21:36:00.347","caller":"peer/peertask_conductor.go:888","msg":"get piece 34 from 10.21
8.45.144:65002/10.218.45.144-33797-ff76c585-7097-4f02-befe-464c15969c20, digest: 78d276a903dcd881a34c667172341c78, star
t: 142606336, size: 4194304","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7
ab985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:00.507","caller":"peer/peertask_conductor.go:888","msg":"get piece 35 from 10.21
8.45.144:65002/10.218.45.144-33797-ff76c585-7097-4f02-befe-464c15969c20, digest: 9f733a2d81932d68cc8df22f8fc49899, star
t: 146800640, size: 3093475","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7
ab985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:00.667","caller":"peer/peertask_conductor.go:779","msg":"all pieces requests sen
t, just wait failed pieces","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7a
b985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.272","caller":"peer/peertask_conductor.go:1160","msg":"peer task done, cost: 
4881ms","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985de93c9f618e76fb3
6c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.272","caller":"peer/peertask_conductor.go:920","msg":"peer task success, stop
 to wait failed piece","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985d
e93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.273","caller":"peer/peertask_conductor.go:1224","msg":"step 3: report success
ful peer result ok","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985de93
c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.273","caller":"peer/peertask_conductor.go:976","msg":"peer task success, peer
 download worker #0 exit","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab9
85de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}

From 21:36:00.667 to 21:36:03.272, there was 2.6s delay.

Not all dfdaemon instances had 2s delay, only 3, or 4 instances had, but most of other instances downloaded imge from these delayed instances , so the average download time increased a lot.

Expected behavior:

Avoid this kind of delay time, decrease the image download time.

How to reproduce it:

Using helm chart 0.5.38 to deploy dragonfly v2.0.2-rc.4, then distribute image to several nodes.

Environment:

  • Dragonfly version: v2.0.2-rc.4
  • OS: Centos 7
  • Kernel (e.g. uname -a): Linux 3.10.0-1160.31.1.el7.x86_64
  • Others: Helm chart 0.5.38
@gaius-qi
Copy link
Member

Use the latest helm charts version(v0.5.46) with optimized image scheduling.

@jim3ma
Copy link
Member

jim3ma commented Feb 18, 2022

Can you provide all log of 10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2 with verbose: true for debugging log ?

@likunbyl
Copy link
Author

the dfdaemon log:

core-dfdaemon-hq265.log

@likunbyl
Copy link
Author

I tried Helm chart 0.5.46, which installed dragonfly v2.0.2-rc.9. It was amazing, the delay to wait failed piece was not that high, normally below 0.5s , that gave the stable image download time of about 6.3s.

And I noticed that all dfdaemons download image from a CDN instance, that's different from the previous version, v2.0.2-rc.4, which always try to download image from another dfdaemon. I guess that's part of the optimized image scheduling.

@jim3ma
Copy link
Member

jim3ma commented Feb 23, 2022

the dfdaemon log:

core-dfdaemon-hq265.log

In this log: grep 10.218.44.130-22037-0180d10d-9203-49fa-b624-b7bd10bf114a /Users/jim/Downloads/core-dfdaemon-hq265.log | grep -P "(wrote piece|peertask_stream.go:216|wait failed pieces|peer download worker|get piece \d+ from)"

all pieces requests sent, just wait failed pieces stands all pieces requests is sent to piece download worker.
Last piece info got at:

{"level":"info","ts":"2022-02-18 21:15:44.572","caller":"peer/peertask_conductor.go:888","msg":"get piece 35 from 10.218.43.158:65002/10.218.43.158-621-a5d0f644-4b4e-459e-b823-3fdce77c56dc, digest: c8b07f67516daf58e697b3d1f3320543, start: 146800640, size: 3093486","peer":"10.218.44.130-22037-0180d10d-9203-49fa-b624-b7bd10bf114a","task":"ffdf7bc24b28847f601b4b240662dbf551f4eeb015cc5583c97ff2e6e242f151","component":"PeerTask"}

But the piece downloads slowly from other peers, ended at:

{"level":"debug","ts":"2022-02-18 21:15:46.129","caller":"peer/peertask_stream.go:216","msg":"all 36 pieces wrote to pipe","peer":"10.218.44.130-22037-0180d10d-9203-49fa-b624-b7bd10bf114a","task":"ffdf7bc24b28847f601b4b240662dbf551f4eeb015cc5583c97ff2e6e242f151","component":"PeerTask"}

Do you limit upload or download with low speed ?

@likunbyl
Copy link
Author

I'm using the default values.yaml from the helm chart 0.5.38:

dfdaemon:
  config:
    download:
      # -- Total download limit per second
      totalRateLimit: 200Mi
      # -- Per peer task limit per second
      perPeerRateLimit: 100Mi
      # -- Calculate digest, when only pull images, can be false to save cpu and memory
      calculateDigest: true
      downloadGRPC:
        # -- Download grpc security option
        security:
          insecure: true
        # -- Download service listen address
        # current, only support unix domain socket
        unixListen:
          socket: /tmp/dfdamon.sock
      peerGRPC:
        # -- Peer grpc security option
        security:
          insecure: true
        tcpListen:
          # -- Listen address
          listen: 0.0.0.0
          # -- Listen port
          port: 65000
    upload:
      # -- Upload limit per second
      rateLimit: 100Mi
      # -- Upload grpc security option
      security:
        insecure: true
      tcpListen:
        # -- Listen address
        listen: 0.0.0.0
        # -- Listen port
        port: 65002

@likunbyl
Copy link
Author

So, although I get ideal result from dragonfly v2.0.2-rc.9, I think that's because all these dfdaemons download image from CDN instance, the wait failed pieces delay is not that big. Obviously when the number of nodes increase to hundreds, a lot of downloading will happen between dfdaemons, maybe the wait failed pieces delay will increase again. How can we decrease these delay ?

@gaius-qi
Copy link
Member

Please try latest version.

@gaius-qi gaius-qi closed this as completed Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants