Wait failed pieces greatly increase image download time #1083

likunbyl · 2022-02-18T07:54:40Z

Bug report:

I deployed dragonfly v2.0.2-rc.4 through Helm chart 0.5.38. While distribute an image to 16 nodes, the image download time is higher than before.

before, we ran dragonfly v2.0.2-alpha.6, the average image download time is 7.06s, now, it's 8.98s.

I checked the logs , and found several dfdaemon instances have more than 2s delay when they wait for failed pieces:

{"level":"info","ts":"2022-02-17 21:36:00.347","caller":"peer/peertask_conductor.go:888","msg":"get piece 34 from 10.21
8.45.144:65002/10.218.45.144-33797-ff76c585-7097-4f02-befe-464c15969c20, digest: 78d276a903dcd881a34c667172341c78, star
t: 142606336, size: 4194304","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7
ab985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:00.507","caller":"peer/peertask_conductor.go:888","msg":"get piece 35 from 10.21
8.45.144:65002/10.218.45.144-33797-ff76c585-7097-4f02-befe-464c15969c20, digest: 9f733a2d81932d68cc8df22f8fc49899, star
t: 146800640, size: 3093475","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7
ab985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:00.667","caller":"peer/peertask_conductor.go:779","msg":"all pieces requests sen
t, just wait failed pieces","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7a
b985de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.272","caller":"peer/peertask_conductor.go:1160","msg":"peer task done, cost: 
4881ms","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985de93c9f618e76fb3
6c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.272","caller":"peer/peertask_conductor.go:920","msg":"peer task success, stop
 to wait failed piece","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985d
e93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.273","caller":"peer/peertask_conductor.go:1224","msg":"step 3: report success
ful peer result ok","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab985de93
c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}
{"level":"info","ts":"2022-02-17 21:36:03.273","caller":"peer/peertask_conductor.go:976","msg":"peer task success, peer
 download worker #0 exit","peer":"10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2","task":"73c8a51826d036f0f7ab9
85de93c9f618e76fb36c52a64706c7a51de353c8f2e","component":"PeerTask"}

From 21:36:00.667 to 21:36:03.272, there was 2.6s delay.

Not all dfdaemon instances had 2s delay, only 3, or 4 instances had, but most of other instances downloaded imge from these delayed instances , so the average download time increased a lot.

Expected behavior:

Avoid this kind of delay time, decrease the image download time.

How to reproduce it:

Using helm chart 0.5.38 to deploy dragonfly v2.0.2-rc.4, then distribute image to several nodes.

Environment:

Dragonfly version: v2.0.2-rc.4
OS: Centos 7
Kernel (e.g. uname -a): Linux 3.10.0-1160.31.1.el7.x86_64
Others: Helm chart 0.5.38

The text was updated successfully, but these errors were encountered:

gaius-qi · 2022-02-18T08:06:34Z

Use the latest helm charts version(v0.5.46) with optimized image scheduling.

jim3ma · 2022-02-18T10:39:16Z

Can you provide all log of 10.218.43.61-6756-88d62873-1a33-46c6-b8cd-508e1855b1a2 with verbose: true for debugging log ?

likunbyl · 2022-02-18T15:45:40Z

the dfdaemon log:

core-dfdaemon-hq265.log

likunbyl · 2022-02-22T06:28:47Z

I tried Helm chart 0.5.46, which installed dragonfly v2.0.2-rc.9. It was amazing, the delay to wait failed piece was not that high, normally below 0.5s , that gave the stable image download time of about 6.3s.

And I noticed that all dfdaemons download image from a CDN instance, that's different from the previous version, v2.0.2-rc.4, which always try to download image from another dfdaemon. I guess that's part of the optimized image scheduling.

jim3ma · 2022-02-23T02:56:35Z

the dfdaemon log:

core-dfdaemon-hq265.log

all pieces requests sent, just wait failed pieces stands all pieces requests is sent to piece download worker.
Last piece info got at:

{"level":"info","ts":"2022-02-18 21:15:44.572","caller":"peer/peertask_conductor.go:888","msg":"get piece 35 from 10.218.43.158:65002/10.218.43.158-621-a5d0f644-4b4e-459e-b823-3fdce77c56dc, digest: c8b07f67516daf58e697b3d1f3320543, start: 146800640, size: 3093486","peer":"10.218.44.130-22037-0180d10d-9203-49fa-b624-b7bd10bf114a","task":"ffdf7bc24b28847f601b4b240662dbf551f4eeb015cc5583c97ff2e6e242f151","component":"PeerTask"}

But the piece downloads slowly from other peers, ended at:

{"level":"debug","ts":"2022-02-18 21:15:46.129","caller":"peer/peertask_stream.go:216","msg":"all 36 pieces wrote to pipe","peer":"10.218.44.130-22037-0180d10d-9203-49fa-b624-b7bd10bf114a","task":"ffdf7bc24b28847f601b4b240662dbf551f4eeb015cc5583c97ff2e6e242f151","component":"PeerTask"}

Do you limit upload or download with low speed ?

likunbyl · 2022-02-23T04:01:05Z

I'm using the default values.yaml from the helm chart 0.5.38:

dfdaemon:
  config:
    download:
      # -- Total download limit per second
      totalRateLimit: 200Mi
      # -- Per peer task limit per second
      perPeerRateLimit: 100Mi
      # -- Calculate digest, when only pull images, can be false to save cpu and memory
      calculateDigest: true
      downloadGRPC:
        # -- Download grpc security option
        security:
          insecure: true
        # -- Download service listen address
        # current, only support unix domain socket
        unixListen:
          socket: /tmp/dfdamon.sock
      peerGRPC:
        # -- Peer grpc security option
        security:
          insecure: true
        tcpListen:
          # -- Listen address
          listen: 0.0.0.0
          # -- Listen port
          port: 65000
    upload:
      # -- Upload limit per second
      rateLimit: 100Mi
      # -- Upload grpc security option
      security:
        insecure: true
      tcpListen:
        # -- Listen address
        listen: 0.0.0.0
        # -- Listen port
        port: 65002

likunbyl · 2022-02-23T12:14:55Z

So, although I get ideal result from dragonfly v2.0.2-rc.9, I think that's because all these dfdaemons download image from CDN instance, the wait failed pieces delay is not that big. Obviously when the number of nodes increase to hundreds, a lot of downloading will happen between dfdaemons, maybe the wait failed pieces delay will increase again. How can we decrease these delay ?

gaius-qi · 2022-06-29T02:49:47Z

Please try latest version.

likunbyl added the kind/bug label Feb 18, 2022

gaius-qi assigned jim3ma, gaius-qi and 244372610 Mar 31, 2022

gaius-qi closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait failed pieces greatly increase image download time #1083

Wait failed pieces greatly increase image download time #1083

likunbyl commented Feb 18, 2022

gaius-qi commented Feb 18, 2022

jim3ma commented Feb 18, 2022

likunbyl commented Feb 18, 2022

likunbyl commented Feb 22, 2022

jim3ma commented Feb 23, 2022 •

edited

Loading

likunbyl commented Feb 23, 2022

likunbyl commented Feb 23, 2022

gaius-qi commented Jun 29, 2022

Wait failed pieces greatly increase image download time #1083

Wait failed pieces greatly increase image download time #1083

Comments

likunbyl commented Feb 18, 2022

Bug report:

Expected behavior:

How to reproduce it:

Environment:

gaius-qi commented Feb 18, 2022

jim3ma commented Feb 18, 2022

likunbyl commented Feb 18, 2022

likunbyl commented Feb 22, 2022

jim3ma commented Feb 23, 2022 • edited Loading

likunbyl commented Feb 23, 2022

likunbyl commented Feb 23, 2022

gaius-qi commented Jun 29, 2022

jim3ma commented Feb 23, 2022 •

edited

Loading