Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POD recovery, Can't Sync pods IP to nodes.conf #20

Closed
brucemei opened this issue Aug 24, 2020 · 10 comments
Closed

POD recovery, Can't Sync pods IP to nodes.conf #20

brucemei opened this issue Aug 24, 2020 · 10 comments

Comments

@brucemei
Copy link

when one POD is fault, the POD IP(my) is invalid in the Persistence nodes.conf;

I suggest container start-up to refresh.

@iamabhishek-dubey
Copy link
Member

Can you please share the logs and screenshot, also are you using the latest version?

@brucemei
Copy link
Author

brucemei commented Aug 28, 2020

the operator resources with master branch sourcecode.

when i apply the redis.yaml(ref: example/redis-cluster-example.yaml), and the redis master&slave success; After delete the redis cluster with the redis.yaml, In a few minutes, i apply the redis.yaml again to create redis-cluster.
the new pods are running OK, but redis cluster is fail, all POD IP are refreshed, But the persistent nodes.conf in PVC is Invalid

redis.yaml as below:

apiVersion: redis.opstreelabs.in/v1alpha1
kind: Redis
metadata:
  name: redis
spec:
  mode: cluster
  size: 3
  global:
    image: opstree/redis:v2.0
    imagePullPolicy: Always
    password: "N1A8mhMAVqxx"
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 100m
        memory: 128Mi
  master:
    service:
      type: ClusterIP
  slave:
    service:
      type: ClusterIP
  redisExporter:
    enabled: true
    image: quay.io/opstree/redis-exporter:1.0
    imagePullPolicy: Always
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 100m
        memory: 128Mi
  storage:
    VolumeClaimTemplates:
      spec:
        accessModes: 
          - ReadWriteOnce
        storageClassName: dev-ceph-block
        resources:
          requests:
            storage: 500M
      selector: {}

image

image

@adevjoe
Copy link

adevjoe commented Sep 9, 2020

i have same problem sometime.
try restart all pods

apiVersion: redis.opstreelabs.in/v1alpha1
kind: Redis
metadata:
  name: redis
spec:
  global:
    image: 'quay.io/opstree/redis:v2.0'
    imagePullPolicy: IfNotPresent
    password: Opstree@12345
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi
  master:
    service:
      type: ClusterIP
  mode: cluster
  redisExporter:
    enabled: true
    image: 'quay.io/opstree/redis-exporter:1.0'
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi
  size: 3
  slave:
    service:
      type: ClusterIP
  storage:
    volumeClaimTemplate:
      selector: {}
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
10.102.211.247:6379> cluster nodes
3af5f2fad054e7a898ce9604dab5f3b904d9872c 10.199.2.25:6379@16379 myself,master - 0 1599641626075 1 connected 0-5460
79ed1a5f8c8a57912d424ffa109798492a44f997 10.199.2.13:6379@16379 master,fail? - 1599641628581 1599641626075 3 connected 10923-16383
c48000914c0fdc73370fdb6832db0f9a7f616b86 10.199.2.17:6379@16379 slave,fail? 3af5f2fad054e7a898ce9604dab5f3b904d9872c 1599641627980 1599641626075 1 connected
dbb656d05bc5ace787665a268e533c9f625425b9 10.199.0.73:6379@16379 master,fail? - 1599641626978 1599641626075 4 connected 5461-10922
16e18678ec1ccf976014f192121dbb8e487e7b32 10.199.2.18:6379@16379 slave,fail? 79ed1a5f8c8a57912d424ffa109798492a44f997 1599641628581 1599641626075 3 connected
50148f4dd9e271803f21858951d3f5d4bd51b1d6 10.199.2.16:6379@16379 slave,fail? dbb656d05bc5ace787665a268e533c9f625425b9 1599641628581 1599641626075 4 connected
10.102.211.247:6379> cluster info
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:5461
cluster_slots_pfail:10923
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:4
cluster_my_epoch:1
cluster_stats_messages_ping_sent:5
cluster_stats_messages_sent:5
cluster_stats_messages_received:0

@ianwatsonrh
Copy link

ianwatsonrh commented Sep 16, 2020

Experiencing the same problem. Stale entries in the nodes.conf is redirecting redis clients to IPs that no longer exist.

I've got around it quickly by extending the redis image and amending the start_redis() command

start_redis() {
    echo "Starting redis service....."
    redis-server /etc/redis/redis.conf --cluster-announce-ip $(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)
}

A more elegant solution is referenced here redis/redis#4289 but was easier for me to amend the image than the operator.

@ianwatsonrh
Copy link

Seems this is partially fixed in github but not on operator hub

@iamabhishek-dubey
Copy link
Member

Fixed in #26

@egorksv
Copy link

egorksv commented Jan 30, 2024

/reopen

@egorksv
Copy link

egorksv commented Jan 30, 2024

This is still an issue for new clusters: old version of nodes.conf (i.e. after cluster re-build) are retained in PVC.

Suggestion: add functionality to RedisCluster reconciler that will run CLUSTER RESET HARD on new nodes if cluster state is "Bootstrap"

@drivebyer
Copy link
Collaborator

This is still an issue for new clusters: old version of nodes.conf (i.e. after cluster re-build) are retained in PVC.

In my environment, this issue only occurs when all Redis nodes are recreated simultaneously, and they cannot distinguish themselves from one another. This does not happen during a rolling update.

@egorksv
Copy link

egorksv commented Jan 31, 2024

That's exactly what I'm talking about: when deploying new cluster and not getting all configuration right (ESPECIALLY SSL/TLS, these are notoriously error-prone), full cluster restart is required, and the fastest way to achieve that is to kill all running pods. Which leads directly to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants