Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follower index is sometimes not cleaned up correctly #6237

Closed
ChrisKujawa opened this issue Jan 31, 2021 · 2 comments · Fixed by #6253
Closed

Follower index is sometimes not cleaned up correctly #6237

ChrisKujawa opened this issue Jan 31, 2021 · 2 comments · Fixed by #6253
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@ChrisKujawa
Copy link
Member

Describe the bug

TL;DR; It seems that the growing heap usage is related to the index of a follower.

We observed in the stable cluster benchmark a growing heap memory usage.
growing

I took some heap dump's on different times and different nodes and I saw that often when the memory usage was quite high we had a lot of long living Long objects. These objects are referenced for example by a TreeMap, which is used by the SparseJournalIndex. But also some are referenced by the ConccurentSkipList which is used by the ZeebeIndexAdapter.

One of the heap I checked was node two which was leader for partition one and two and follower for partition three. This node had a quite high heap usage.

zb2-global

We can see it had 15 million Long objects and a lot TreeMap and ConcurrentSkipList nodes. Based on a OQL query I was able to find relative big TreeMaps.

query

Based on that I was able to find the SparseJournalIndex.

sparseindex

and the related partition.

partition

If we check the ZeebeIndexAdapter we see the usage of the ConcurrentSkipList.

zb2-zeebeindexadapter

Node 0

If we check another node (broker 0), which is follower for all partitions, we see around ~60k indexes in the journal indexes.
node0-sparseindex

node0-sparseindex2

It seems to be a problem with compacting. After a leader change has happened we saw for short time different heap usage.

I check also the logs and it seems that we are able to compact the log at node 2 for partition 3.

Expand all | Collapse all{
 insertId: "469zgyg1exlr5f"  
 jsonPayload: {
  context: {
   actor-name: "Broker-2-DeletionService-3"    
   loggerName: "io.zeebe.broker.logstreams.delete"    
   threadId: 35    
   threadName: "Broker-2-zb-actors-3"    
   threadPriority: 5    
  }
  logger: "io.zeebe.broker.logstreams.delete"   
  message: "Compacting Atomix log up to index 741501199" 
D 2021-01-30T16:16:18.668634Z Compacting Atomix log up to index 741389699 
D 2021-01-30T16:17:18.921243Z Compacting Atomix log up to index 741501199 
D 2021-01-30T16:18:18.339542Z Compacting Atomix log up to index 741613099 
D 2021-01-30T16:19:18.670935Z Compacting Atomix log up to index 741725699 
D 2021-01-30T16:20:18.937142Z Compacting Atomix log up to index 741838299 
D 2021-01-30T16:21:19.167123Z Compacting Atomix log up to index 741945699 
D 2021-01-30T16:22:18.529814Z Compacting Atomix log up to index 742056799 

To Reproduce
Run a benchmark on stable nodes.

Expected behavior
The journal index is not that big and can be compacted.

Environment:

  • OS: k8
  • Zeebe Version: 1.0.0-SNAPSHOT
  • Configuration: [e.g. exporters etc.]
@ChrisKujawa ChrisKujawa added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround Status: Needs Priority labels Jan 31, 2021
@ChrisKujawa
Copy link
Member Author

ChrisKujawa commented Feb 1, 2021

It seems the Follower receives quite often a snapshot where the snapshot index is much bigger then the last index of the journal. This means it deletes the log and replaces it with the snapshot. Problem here is that the journal index is not reseted and keeps growing.

See here:

D 2021-02-01T05:19:49.474454Z Purging snapshots older than FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/857218199-3-1612156788967-1061394335-1061392986, metadata=FileBasedSnapshotMetadata{index=857218199, term=3, timestamp=2021-02-01 05:19:48,967, processedPosition=1061394335, exporterPosition=1061392986}}  
D 2021-02-01T05:19:49.474697Z Deleting snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/857125799-3-1612156727609-1061277959-1061276960, metadata=FileBasedSnapshotMetadata{index=857125799, term=3, timestamp=2021-02-01 05:18:47,609, processedPosition=1061277959, exporterPosition=1061276960}}  
D 2021-02-01T05:19:49.476761Z Search for orphaned snapshots below oldest valid snapshot with index FileBasedSnapshotMetadata{index=857218199, term=3, timestamp=2021-02-01 05:19:48,967, processedPosition=1061394335, exporterPosition=1061392986} in /usr/local/zeebe/data/raft-partition/partitions/1/pending  
D 2021-02-01T05:19:49.477040Z Created new snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/857218199-3-1612156788967-1061394335-1061392986, metadata=FileBasedSnapshotMetadata{index=857218199, term=3, timestamp=2021-02-01 05:19:48,967, processedPosition=1061394335, exporterPosition=1061392986}}  
D 2021-02-01T05:19:49.477080Z Broker-0-DeletionService-1 Compacting Atomix log up to index 857218199  Broker-0-DeletionService-1
I 2021-02-01T05:19:49.477106Z RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '857212321') and replace with received snapshot (index '857218199')  
I 2021-02-01T05:19:49.541609Z RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Rolling back snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/pending/857218199-3-1612156788967-1061394335-1061392986-9251, snapshotStore=io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore@66597266, expectedId=java.nio.HeapByteBuffer[pos=0 lim=10 cap=10], metadata=FileBasedSnapshotMetadata{index=857218199, term=3, timestamp=2021-02-01 05:19:48,967, processedPosition=1061394335, exporterPosition=1061392986}}  
D 2021-02-01T05:19:49.541905Z DELETE dir /usr/local/zeebe/data/raft-partition/partitions/1/pending/857218199-3-1612156788967-1061394335-1061392986-9251  

Another issue I see here is that it seems that after the log is deleted the snapshot is rolled back and also deleted 👀 Looks problematic. Any thoughts @deepthidevaki @npepinpe ?

Update Ok nevermind it seems the logs are bit confusing. Created new snapshot... is printed after the pending is moved, so it is not deleting the real snapshot after compacting, but still it says it is trying to delete/rollback the pending one, which is weird.

@npepinpe
Copy link
Member

npepinpe commented Feb 1, 2021

Please backport this to 0.25 and 0.26, and we will create patch releases for both afterwards.

@ghost ghost closed this as completed in 13758b4 Feb 10, 2021
ghost pushed a commit that referenced this issue Feb 10, 2021
6292: [Backport stable/0.26] fix(atomix): compact all index when reseting the segments r=deepthidevaki a=deepthidevaki

## Description

Backport #6253

closes #6237 

Co-authored-by: Deepthi Devaki Akkoorath <[email protected]>
ghost pushed a commit that referenced this issue Feb 10, 2021
6291: [Backport stable/0.25] fix(atomix): compact all index when reseting the segments r=deepthidevaki a=deepthidevaki

## Description

Backport #6253 

closes #6237 


Co-authored-by: Deepthi Devaki Akkoorath <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024
…6237)

Bumps [com.auth0:mvc-auth-commons](https://github.com/auth0/auth0-java-mvc-common) from 1.10.0 to 1.11.0.
- [Release notes](https://github.com/auth0/auth0-java-mvc-common/releases)
- [Changelog](https://github.com/auth0/auth0-java-mvc-common/blob/master/CHANGELOG.md)
- [Commits](auth0/auth0-java-mvc-common@1.10.0...1.11.0)

---
updated-dependencies:
- dependency-name: com.auth0:mvc-auth-commons
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants