-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent log #6316
Comments
@Zelldon asked to look for append failure prior to this occurring. Here is what I found (not clear yet whether it is actually related):
The first occurrence was also close to an incident: https://docs.google.com/document/d/1_8mUHzksCWEn5vsJ3datYe2bt3e90RYZvvTRtWCgFnI/edit#heading=h.e7don26yefrg |
Created snapshots for the zeebe-brokers: |
What I can see in the log before the inconsistency is detected that we have a lot of started elections, this correlates to the metrics I commented here #6316 (comment) Interesting is that we can see that we took a snapshot with positions: processedPosition=204621068, exporterPosition=204621065 And that the dispatcher seems to clear an append job from its actor queue
If we look at the first occurence of the inconsistency detection then it complains about:
|
If we take a closer look near by the inconsistency detection we see that the broker for that partition became INACTIVE, this normally indicates an error (e.g. uncatched exception) in the raft implementation.
After searching for the cause I found the following:
|
Interesting is that actually only Broker one and two is really participating the raft. Broker zero seems to be partitioned away. Plus it was down for around 30 min, might be related to the google resource outage we experienced on this day.
|
Discussed and debugged further together with @deepthidevaki our current assumptions are the following:
The cluster seems to have recovered and keeps going. I can try to reproduce this via an chaos experiment where I introduce some network partitions. |
I tried to reproduce this with a cluster (one partition three nodes, replication factor three) on stable nodes, but without success. I used the following script to cause repeated network partitions: #!/bin/bash
set -exuo pipefail
# this scripts expects a setup of 5 nodes with replication factor 5 or higher
source utils.sh
partition=1
namespace=$(getNamespace)
gateway=$(getGateway)
broker0=$(getBroker "0")
broker0Ip=$(kubectl get pod "$broker0" -n "$namespace" --template="{{.status.podIP}}")
broker1=$(getBroker "1")
broker1Ip=$(kubectl get pod "$broker1" -n "$namespace" --template="{{.status.podIP}}")
broker2=$(getBroker "2")
broker2Ip=$(kubectl get pod "$broker2" -n "$namespace" --template="{{.status.podIP}}")
# To print the topology in the journal
kubectl exec "$gateway" -n "$namespace" -- zbctl status --insecure
# we put all into one function because we need to make sure that even after preemption the
# dependency is installed
function disconnect() {
toChangedPod="$1"
targetIp="$2"
# update to have access to ip
kubectl exec -n "$namespace" "$toChangedPod" -- apt update
kubectl exec -n "$namespace" "$toChangedPod" -- apt install -y iproute2
kubectl exec "$toChangedPod" -n "$namespace" -- ip route add unreachable "$targetIp"
}
function connect() {
toChangedPod="$1"
targetIp="$2"
# update to have access to ip
kubectl exec -n "$namespace" "$toChangedPod" -- apt update
kubectl exec -n "$namespace" "$toChangedPod" -- apt install -y iproute2
kubectl exec "$toChangedPod" -n "$namespace" -- ip route del unreachable "$targetIp"
}
function netloss() {
toChangedPod="$1"
# update to have access to ip
kubectl exec -n "$namespace" "$toChangedPod" -- apt update
kubectl exec -n "$namespace" "$toChangedPod" -- apt install -y iproute2
kubectl exec "$toChangedPod" -n "$namespace" -- tc qdisc add dev eth0 root netem loss 5%
}
#retryUntilSuccess netloss "$broker1"
#retryUntilSuccess netloss "$broker2"
retryUntilSuccess netloss "$broker0"
# We disconnect Broker 0 from others, still they can send him request
#retryUntilSuccess disconnect "$broker0" "$broker1Ip"
#retryUntilSuccess disconnect "$broker0" "$broker2Ip"
# We disconnect Broker 1 from Broker 2, to make the cluster a bit disruptive
retryUntilSuccess disconnect "$broker1" "$broker2Ip"
previousCoin=1
while true;
do
echo "Disconnected..."
sleep 5
if [ $previousCoin -eq 1 ]
then
retryUntilSuccess connect "$broker1" "$broker2Ip"
else
retryUntilSuccess connect "$broker2" "$broker1Ip"
fi
sleep 145
coin=$(($RANDOM%2))
if [ $coin -eq 1 ]
then
retryUntilSuccess disconnect "$broker1" "$broker2Ip"
else
retryUntilSuccess disconnect "$broker2" "$broker1Ip"
fi
previousCoin=$coin
echo "set prev coin: $previousCoin"
coin=$(($RANDOM%2))
if [ coin -eq 1 ]
then
retryUntilSuccess connect "$broker0" "$broker1Ip"
sleep 45
retryUntilSuccess disconnect "$broker0" "$broker1Ip"
else
retryUntilSuccess connect "$broker0" "$broker2Ip"
sleep 45
retryUntilSuccess disconnect "$broker0" "$broker2Ip"
fi
done
I think what comes to play in the failing cluster is that we have multiple partitions and BIG state (~ 6 gig of snapshots) which can consume a lot of resources when we try to replicate that. I can imagine that this plays also an role here. If we take a look at the metrics of the broken cluster we see for example quite high java heap usage. |
I checked the data of Zeebe 1 with zdb:
We can see it reports for Broker 1 the same positions as above and the index corresponds to the index where it tried to compact. When I check the entries we can see the following: [zell data-1/ cluster: zeebe-long-running ns:zell-inconsistent-log]$ zdb log search -idx 123186195 -p raft-partition/partitions/4/
Searching log raft-partition/partitions/4
Found entry with index '123186195'
Indexed Entry: Indexed{index=123186195, entry=ZeebeEntry{term=2970, timestamp=1612966726056, lowestPosition=204620965, highestPosition=204620965}}
Searched log in 57 ms
[zell data-1/ cluster: zeebe-long-running ns:zell-inconsistent-log]$ zdb log search -idx 123186194 -p raft-partition/partitions/4/
Searching log raft-partition/partitions/4
Found entry with index '123186194'
Indexed Entry: Indexed{index=123186194, entry=ZeebeEntry{term=2984, timestamp=1612967753896, lowestPosition=204621050, highestPosition=204621051}}
Searched log in 58 ms The Zeebe entry on index Interesting is also when I search for
If we check the status of the log the initial entries are printed and we see that there term is out of order.
It seems that the log is equal (inconsistent) on the other nodes \cc @deepthidevaki |
Turned out that the logs were the same because the disk haven't been properly mounted on the VM. After we fixed that we have no access to the different logs, but still it seems to be inconsistent on all nodes. Status
Inconsistency:
|
The order of initial entries doesn't make any sense. The timestamps actually matches actual leader transitions. |
I enriched the output to also contain the index of the initial entries: Broker 0:
Broker-1
Broker-2
|
I have setup a benchmark on stable nodes with network partitioning and packet loss as described above. Plus with large state, so no workers to complete the jobs. It uses the 0.26.0 version and runs over the weekend and died, since it reached 14 gig snapshot size. Still the log seems to be consistent, so I was not able to reproduce it. What we need here is actually more information what actually has happened in atomix, so more logs (stackdriver), but unfortunately atomix log is per default to high (warn). I will open an issue on the controller to lower the log level.
|
If you guys @npepinpe @deepthidevaki @miguelpires have any ideas just speak up, otherwise I will stop the further investigation until it happens again and I will created an issue on the controller to lower the atomix log level. https://github.com/camunda-cloud/zeebe-controller-k8s/issues/415 |
I checked from the Benchmark the logs from the node 1 and it seems that one log is inconsistent 🚨
|
I investigated with @deepthidevaki a bit, it looks a bit like an different error. It seems that this entry is zeroed out, but is still readable. We checked all nodes and Broker zero seems to be consistent, but on Broker 1 and 2 the partition is inconsistent, but on different indexes. It looks like it is only about the last entry. Broker-1
Broker-2
Broker-0
Checking the logs in stackdriver I found several errors for example an SIGSEG on closing the broker (@npepinpe mentiones it is know ?) I also found some errors from the log appender which shows that there are problems on appending
|
Together with @deepthidevaki we checked the logs and related partitions data of the benchmark further, but we found no real cause for this. It seems that this happen after becoming leader and after some appends have succeeded or at least we see no initial entry before and after. It might also that the log has been truncated, because the entries haven't been committed. We are not able to see what is the first position the dispatcher is initialized to, which would help here. Might something I can add as a log statement. In general it looks different then the error in the production system, since first it is prevented from being written to the log and the inconsistency which is detected (by zdb) is only at the end of the log, and it looks like it is being zeroed. Furthermore only two nodes have this inconsistency on different indexes, since the log length is different. It might be due to several out of direct memory and SIGSEG errors, which we can see in the log. It is not clear. In the production system we can see that the inconsistency is really persisted in the log on all nodes and multiple records are following with the lower positions. The position is not zero, it is just decreased. The inconsistent partition contain only one segment so it can't be the case that segments are maybe read in the wrong order by the reader. There are several open questions:
|
I checked the latest KW benchmarks and was wondering why the KW-03 is failing. It seems it has the same issue:
It seems it is related again to big state and multiple leader changes. 🤔 |
OK I think I have it. TL;DR;It seems that it can happen that the partition is not closed fast enough, which means we have an concurrent log appender on becoming leader again, which is writing an event on the stream. It is detect, but it will still be written to the stream. Long version:Remember each step is closed separately and one by one, so we wait until everything is stopped. In the end of the closing we close dispatcher and log appender etc. The problem is if something takes to long to close the dispatcher/logappender is still open.
It seems that the appender has a lot of blocks to append, because it is logged quite often.
At some point we see that the partition switches the role to Candidate again on Broker-1, unfortunately this is only half of the truth. Actually inside Raft we already switch to leader again, but we are not able to see that in the log. This means that our appender is not able to append again to the log.
This is detected by our verification and throws an exception:
If we check the actual code we can see why this is the case. In the ZeebeEntryValidator we check the positions https://github.com/zeebe-io/zeebe/blob/develop/logstreams/src/main/java/io/zeebe/logstreams/impl/log/ZeebeEntryValidator.java#L24-L29
In the LeaderRole we use this code to validate the entry BEFORE we append it.
The issue here is that we not returning, which means after the role change is done to follower we still append it to the log. |
I think it is quite likely that this is also how it becomes inconsistent in the production env. We can see in the log also these append failure. We see the transition to candidate in the Zeebe Partition so it is likely that we already become leader in raft. Furthermore it looks like we also replicating the entry if we successful append to the log, see here https://github.com/zeebe-io/zeebe/blob/develop/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderRole.java#L589 This might cause further disturbance. I assume this stands also in related to the truncation error we see later on this node. Where we not able to truncate the log I will fix this and then we can see whether the issues are gone or not. |
Good job @Zell 👍 😌 I think a part of the problem is also that we get the LeaderRole object everytime we append. This means that we are trying to append with a LeaderRole object at a newer term, while the LogStorageAppender is from a previous term. If it was using the old LeaderRole object to append, it would not be able to append because the role is already closed. |
Thanks @deepthidevaki and thanks for your help on investigating this 🙇♂️ I think you're right and we should solve this as well. I created an PR to prevent appending invaild entries. #6345 |
Describe the bug
Observed in logs: https://console.cloud.google.com/errors/CMyAu52UiKPKggE?service=zeebe&time=P7D&project=camunda-cloud-240911
Log/Stacktrace
Full Stacktrace
Logs
Environment:
The text was updated successfully, but these errors were encountered: