Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No resources sharing observed in peers with helm based release #k8s #1344

Open
nitinpatil1992 opened this issue May 30, 2022 · 14 comments
Open
Assignees

Comments

@nitinpatil1992
Copy link

nitinpatil1992 commented May 30, 2022

Bug report:

We have deployed the dragonfly on containerd based host machines using helm.

~ kgp -n dragonfly-system
NAME                                      READY   STATUS    RESTARTS   AGE
dragonfly-dfdaemon-2mhvp                  1/1     Running   4          5h48m
dragonfly-dfdaemon-jpsz8                  1/1     Running   4          5h48m
dragonfly-dfdaemon-lczff                  1/1     Running   4          5h48m
dragonfly-dfdaemon-lpdxq                  1/1     Running   4          5h48m
dragonfly-dfdaemon-qshhn                  1/1     Running   4          5h48m
dragonfly-dfdaemon-svgjj                  1/1     Running   3          5h48m
dragonfly-dfdaemon-wfwd2                  1/1     Running   4          5h48m
dragonfly-manager-5794bdfff-d6hzv         1/1     Running   0          2d15h
dragonfly-manager-5794bdfff-m244v         1/1     Running   0          2d15h
dragonfly-manager-5794bdfff-vzj6w         1/1     Running   0          2d15h
dragonfly-mysql-688dc67dcf-28fg6          1/1     Running   0          2d15h
dragonfly-redis-master-654c7d645b-mm29v   1/1     Running   0          2d15h
dragonfly-scheduler-0                     1/1     Running   0          2d15h
dragonfly-scheduler-1                     1/1     Running   0          5h45m
dragonfly-scheduler-2                     1/1     Running   0          5h34m
dragonfly-seed-peer-0                     1/1     Running   3          2d15h
dragonfly-seed-peer-1                     1/1     Running   0          5h43m
dragonfly-seed-peer-2                     1/1     Running   0          5h45m

But when we pull the image on the one of the box, the sibling box doesn't appear to have pulled the image.

# ctr image pull docker.io/library/alpine:3.9
docker.io/library/alpine:3.9:                                                     resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:414e0518bb9228d35e4cd5165567fb91d26c6a214e9c95899e1e056fcd349011:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:65b3a80ebe7471beecbc090c5b2cdd0aafeaefa0715f8f12e40dc918a3a70e32: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:31603596830fc7e56753139f9c2c6bd3759e48a850659506ebfb885d1cf3aef5:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:78a2ce922f8665f5a227dc5cd9fda87221acba8a7a952b9665f99bc771a29963:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 2.2 s                                                                    total:  3.6 Ki (1.6 KiB/s)
unpacking linux/amd64 sha256:414e0518bb9228d35e4cd5165567fb91d26c6a214e9c95899e1e056fcd349011...
done
~ k exec -it dragonfly-dfdaemon-jpsz8  -n dragonfly-system -c dfdaemon -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Here is the config for docker daemon

# cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "<aws-registry>/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

No logs for peer pulling the images in daemonsets

Expected behavior:

Logs need to there when grepped with peer task done

grep "peer task done" /var/log/dragonfly/daemon/core.log

How to reproduce it:

  1. Deploy helm chart on eks environment
  2. wait for dragonfly resources to run
  3. grep core logs in daemonset

Environment:

  • Dragonfly version: v2.0.2
  • OS: Amazon Linux centos
  • Kernel (e.g. uname -a): Linux 5.10.112-108.499.amzn2.x86_64 1 SMP Wed Apr 27 23:39:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Others:
@czomo
Copy link

czomo commented May 30, 2022

Unfortunately, I am facing the same issue. Mentioned it here

@jim3ma
Copy link
Member

jim3ma commented Jun 1, 2022

Can you paste the files in /etc/containerd/certs.d ?
This directory contains image registries mirror configruation.

Example: https://d7y.io/docs/setup/runtime/containerd/mirror#option-2-multiple-registries

For docker.io,

/etc/containerd/certs.d/docker.io/hosts.toml

server = "https://index.docker.io"

[host."http://127.0.0.1:65001"]
  capabilities = ["pull"]
  [host."http://127.0.0.1:65001".header]
    X-Dragonfly-Registry = ["https://index.docker.io"]

@jim3ma jim3ma self-assigned this Jun 1, 2022
@czomo
Copy link

czomo commented Jun 1, 2022

How about single-registry option > Version 2 config without config_path? Is it supported?
In my case there is nothing under /etc/containerd/ other than config.toml(config-kops.yaml)

@jim3ma
Copy link
Member

jim3ma commented Jun 1, 2022

How about single-registry option > Version 2 config without config_path? Is it supported? In my case there is nothing under /etc/containerd/ other than config.toml(config-kops.yaml)

Yes, follow this https://d7y.io/docs/setup/runtime/containerd/mirror/#option-1-single-registry

@czomo
Copy link

czomo commented Jun 1, 2022

I did that. The effects are similar to what @nitinpatil1992 wrote. Also deployed with helm Here is my config.toml.

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.6"

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry-1.docker.io"]
  endpoint = ["http://127.0.0.1:65001","https://registry-1.docker.io"]

Dragonfly version: v2.0.2/v2.0.3 
OS: Ubuntu 20.04.3 LTS 
Kernel (e.g. uname -a): 5.11.0-1021-aws #22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Other: containerd://1.4.12

My Helm values:

containerRuntime:
  containerd:
    enable: true
    configFileName: "config-kops.toml"
manager:
  ingress:
    enable: true
    className: private
    hosts:
      - "dragonfly.example.com"
    tls:
      - secretName: secure-tls
        hosts:
          - "dragonfly.example.com"
cdn:
  enable: true

Is there anything I can provide to redirect us to correct path?

@jim3ma
Copy link
Member

jim3ma commented Jun 1, 2022

Did you restart the containerd daemon ?

@jim3ma
Copy link
Member

jim3ma commented Jun 1, 2022

In https://github.com/containerd/containerd/blob/main/docs/cri/registry.md, mirror config :

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
...

not registry-1.docker.com

@czomo
Copy link

czomo commented Jun 1, 2022

Did you restart the containerd daemon ?

Yeah, it is done by helm charts itself

            if [[ "$need_restart" -gt 0 ]]; then
              nsenter -t 1 -m systemctl -- restart containerd.service
            fi

https://github.com/dragonflyoss/helm-charts/blob/b0bd87eeecb56da480161b8ba491acc3573be835/charts/dragonfly/templates/dfdaemon/dfdaemon-daemonset.yaml#L413

not registry-1.docker.com

Bingo! It seems that nodes started to exchanging blobs. 🤦‍♂️ Now I need to set up auth by providing docker credencials.

Regerding missing peer task done. How about documenting it better, in the official tutorial for helm chart? I can add a note in here Wdyt @jim3ma ?

@jim3ma
Copy link
Member

jim3ma commented Jun 2, 2022

It seems that the contaienrd did not restart. You can check the logs of container update-containerd in dfdaemon.

@greenhandatsjtu
Copy link
Contributor

I met similar issue and turned to follow Containerd > Version 2 config with config_path instructions to setup registry, then it works well.
This is what my /etc/containerd/certs.d/docker.io/hosts.toml looks like:

server = "https://registry-1.docker.io"
[host."http://localhost:65001"]
  capabilities = ["pull"]
  skip_verify = true

Then I pull image using ctr specifying hosts-dir:

ctr images pull --hosts-dir "/etc/containerd/certs.d" docker.io/library/alpine:latest

When pull finished, I can find related logs in dfdaemon:
image

This issue comment may be helpful: containerd/containerd#5407 (comment)

@nitinpatil1992
Copy link
Author

@jim3ma here is out certs.d looks like

ls -al /etc/containerd/certs.d
total 0
drwxr-xr-x 5 root root 62 Jun  9 09:52 .
drwxr--r-- 3 root root 40 Jun  9 09:52 ..
drwxr-xr-x 2 root root 24 Jun  9 09:52 ghcr.io
drwxr-xr-x 2 root root 24 Jun  9 09:52 harbor.example.com
drwxr-xr-x 2 root root 24 Jun  9 09:52 quay.io
$ cat /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"
[host."http://127.0.0.1:65001"]
  capabilities = ["pull", "resolve"]
  [host."http://127.0.0.1:65001".header]
  X-Dragonfly-Registry = ["https://quay.io"]

@czomo can you please share your full containerd config?
Also, did you just use the localhost endpoint to pull image or actual docker host name?

@nitinpatil1992
Copy link
Author

Alson noticed the dfget config under deamon, the download settings has port 65000 but couldn't find out where this is being exposed/used.

download:
  calculateDigest: true
  downloadGRPC:
    security:
      insecure: true
    unixListen:
      socket: /tmp/dfdamon.sock
  peerGRPC:
    security:
      insecure: true
    tcpListen:
      listen: 0.0.0.0
      port: 65000
  perPeerRateLimit: 100Mi
  totalRateLimit: 200Mi

@czomo
Copy link

czomo commented Jun 11, 2022

@jim3ma here is out certs.d looks like

ls -al /etc/containerd/certs.d
total 0
drwxr-xr-x 5 root root 62 Jun  9 09:52 .
drwxr--r-- 3 root root 40 Jun  9 09:52 ..
drwxr-xr-x 2 root root 24 Jun  9 09:52 ghcr.io
drwxr-xr-x 2 root root 24 Jun  9 09:52 harbor.example.com
drwxr-xr-x 2 root root 24 Jun  9 09:52 quay.io
$ cat /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"
[host."http://127.0.0.1:65001"]
  capabilities = ["pull", "resolve"]
  [host."http://127.0.0.1:65001".header]
  X-Dragonfly-Registry = ["https://quay.io"]

@czomo can you please share your full containerd config? Also, did you just use the localhost endpoint to pull image or actual docker host name?

I am using containerd 1.4.12(1.5+ have slightly different structure) hence there is no hosts.toml/certs.d and I am restricted to mirror only one registry. This is how looks like my final and full config. Works however I am hitting pulling limit(~35 nodes - 5k pods). Will be working on adding auth to it in following week

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.6"

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["http://127.0.0.1:65001","https://docker.io"]

Also, did you just use the localhost endpoint to pull image or actual docker host name?

Not sure about this one but rather the localhost 127.0.0.1:65001 as above

@TomasKohout
Copy link

dragonfly version: 2.0.7
helm chart: 0.8.7

Don't know if I hit the same issue, but I was able to make image pull work for private registry, but unfortunately, the tasks are not distributed across dfdaemon agents. Peer tasks only occur in the dfdaemon agent where I trigger the pull via crictl and I'm bitterly stuck with this.

My config for containerd:

[plugins."io.containerd.grpc.v1.cri".registry.configs."127.0.0.1:65001".auth]
  auth = "********"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."my-private-registry.example.com"]
  endpoint = ["http://127.0.0.1:65001","https://my-private-registry.example.com"]

dfdaemon conf:

aliveTime: 0s
gcInterval: 1m0s
keepStorage: false
workHome: 
logDir: 
cacheDir: 
pluginDir: 
dataDir: /var/lib/dragonfly
console: true
health:
 path: /server/ping
 tcpListen:
   port: 40901
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces
scheduler:
 manager:
   enable: true
   netAddrs:
     - addr: dragonfly-manager.dragonfly-system.svc.cluster.local:65003
       type: tcp
   refreshInterval: 5m
 scheduleTimeout: 30s
 disableAutoBackSource: false
 seedPeer:
   clusterID: 1
   enable: false
   type: super
host:
 idc: ""
 location: ""
 netTopology: ""
 securityDomain: ""
download:
 calculateDigest: true
 concurrent:
   goroutineCount: 10
   initBackoff: 0.5
   maxAttempts: 3
   maxBackoff: 3
   thresholdSize: 100M
   thresholdSpeed: 200M
 downloadGRPC:
   security:
     insecure: true
     tlsVerify: false
   unixListen:
     socket: /run/dragonfly/dfdaemon.sock
 peerGRPC:
   security:
     insecure: true
   tcpListen:
     port: 65000
 perPeerRateLimit: 512Mi
 prefetch: false
 totalRateLimit: 1024Mi
upload:
 rateLimit: 1024Mi
 security:
   insecure: true
   tlsVerify: false
 tcpListen:
   port: 65002
objectStorage:
 enable: false
 filter: Expires&Signature&ns
 maxReplicas: 3
 security:
   insecure: true
   tlsVerify: true
 tcpListen:
   port: 65004
storage:
 diskGCThreshold: 50Gi
 multiplex: true
 strategy: io.d7y.storage.v2.simple
 taskExpireTime: 6h
proxy:
 defaultFilter: Expires&Signature&ns
 defaultTag: 
 tcpListen:
   port: 65001
 security:
   insecure: true
   tlsVerify: false
 registryMirror:
   dynamic: false
   insecure: false
   url: https://my-private-registry.example.com
 proxies:
   - regx: blobs/sha256.*
security:
 autoIssueCert: false
 caCert: ""
 certSpec:
   validityPeriod: 4320h
 tlsPolicy: prefer
 tlsVerify: false
network:
 enableIPv6: false
announcer:
 schedulerInterval: 30s

scheduler conf:

server:
  port: 8002
  workHome: 
  logDir: 
  cacheDir: 
  pluginDir: 
  dataDir: 
scheduler:
  algorithm: default
  backSourceCount: 3
  gc:
    hostGCInterval: 1h
    peerGCInterval: 10s
    peerTTL: 24h
    taskGCInterval: 30m
  retryBackSourceLimit: 5
  retryInterval: 50ms
  retryLimit: 10
dynconfig:
  refreshInterval: 10s
  type: manager
host:
  idc: ""
  location: ""
  netTopology: ""
manager:
  addr: dragonfly-manager.dragonfly-system.svc.cluster.local:65003
  schedulerClusterID: 1
  keepAlive:
    interval: 5s
seedPeer:
  enable: true
job:
  redis:
    addrs:
    - dragonfly-redis-master.dragonfly-system.svc.cluster.local:6379
    host: dragonfly-redis-master.dragonfly-system.svc.cluster.local
    port: 6379
    password: dragonfly
storage:
  bufferSize: 100
  maxBackups: 10
  maxSize: 100
security:
  autoIssueCert: false
  caCert: ""
  certSpec:
    validityPeriod: 4320h
  tlsPolicy: prefer
  tlsVerify: false
network:
  enableIPv6: false
metrics:
  enable: false
  addr: ":8000"
  enablePeerHost: false
console: true
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces

manager conf:

server:
 rest:
   addr: :8080
 grpc:
   port:
     start: 65003
     end: 65003
 workHome: 
 logDir: 
 cacheDir: 
 pluginDir: 
database:
 mysql:
   user: dragonfly
   password: dragonfly
   host: dragonfly-mysql.dragonfly-system.svc.cluster.local
   port: 3306
   dbname: manager
   migrate: true
 redis:
   addrs:
   - dragonfly-redis-master.dragonfly-system.svc.cluster.local:6379
   host: dragonfly-redis-master.dragonfly-system.svc.cluster.local
   port: 6379
   password: dragonfly
cache:
 local:
   size: 10000
   ttl: 10s
 redis:
   ttl: 30s
objectStorage:
 accessKey: ""
 enable: false
 endpoint: ""
 name: s3
 region: ""
 secretKey: ""
security:
 autoIssueCert: false
 caCert: ""
 caKey: ""
 certSpec:
   dnsNames:
   - dragonfly-manager
   - dragonfly-manager.dragonfly-system.svc
   - dragonfly-manager.dragonfly-system.svc.cluster.local
   ipAddresses: null
   validityPeriod: 87600h
 tlsPolicy: prefer
network:
 enableIPv6: false
metrics:
 enable: false
 addr: ":8000"
console: true
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces

@gaius-qi gaius-qi removed the kind/bug label Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants