-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting values and running GC doesn't reclaim space #767
Comments
@magik6k Thanks for reporting this. I ran your test script and it looks like the GC didn't work (even with 0.01 discard ratio). Let me dig deeper and get back. |
@magik6k Looks like we have a bug in the diff --git a/main_test.go b/main_test.go
index 0d4099b..8759f9b 100644
--- a/main_test.go
+++ b/main_test.go
@@ -13,6 +13,7 @@ import (
"github.com/dustin/go-humanize"
ds "github.com/ipfs/go-datastore"
+ "github.com/stretchr/testify/require"
"github.com/dgraph-io/badger"
)
@@ -44,25 +45,29 @@ func TestGc(t *testing.T) {
r := rand.New(rand.NewSource(555))
- wb := db.NewWriteBatch()
+ txn := db.NewTransaction(true)
for i := 0; i < preC; i++ { // put non-deletable entries
b, err := ioutil.ReadAll(io.LimitReader(r, entryS))
if err != nil {
t.Fatal(err)
}
- if err := wb.Set(ds.RandomKey().Bytes(), b, 0); err != nil {
+ if err := txn.Set(ds.RandomKey().Bytes(), b); err != nil {
t.Fatal(err)
}
+ if int64(i)%1000 == 0 {
+ require.NoError(t, txn.Commit())
+ txn = db.NewTransaction(true)
+ }
}
- if err := wb.Flush(); err != nil {
+ if err := txn.Commit(); err != nil {
t.Fatal(err)
}
pds(t, "non-deletable put")
- wb = db.NewWriteBatch()
+ txn = db.NewTransaction(true)
es := make([][]byte, entryC)
for i := 0; i < entryC; i++ { // put deletable entries
b, err := ioutil.ReadAll(io.LimitReader(r, entryS))
@@ -70,12 +75,19 @@ func TestGc(t *testing.T) {
t.Fatal(err)
}
es[i] = ds.RandomKey().Bytes()
- if err := wb.Set(es[i], b, 0); err != nil {
+ if err := txn.Set(es[i], b); err != nil {
t.Fatal(err)
}
+
+ if int64(i)%1000 == 0 {
+ if err := txn.Commit(); err != nil {
+ t.Fatal(err)
+ }
+ txn = db.NewTransaction(true)
+ }
}
- if err := wb.Flush(); err != nil {
+ if err := txn.Commit(); err != nil {
t.Fatal(err)
}
@@ -94,13 +106,24 @@ func TestGc(t *testing.T) {
pds(t, "del-open")
- wb = db.NewWriteBatch()
- for _, e := range es {
- if err := wb.Delete(e); err != nil {
+ txn = db.NewTransaction(true)
+ for i, e := range es {
+ if err := txn.Delete(e); err != nil {
t.Fatal(err)
}
+ if int64(i)%1000 == 0 {
+ if err := txn.Commit(); err != nil {
+ t.Fatal(err)
+ }
+ txn = db.NewTransaction(true)
+ }
+ }
+ if err := txn.Commit(); err != nil {
+ t.Fatal(err)
}
- if err := wb.Flush(); err != nil {
+ db.Close()
+ db, err = badger.Open(opts)
+ if err != nil {
t.Fatal(err)
}
NOTE - It is important that the db is closed and reopened. We perform compaction when the DB is closed. Without compaction the GC wouldn't be able to free up the space. Compaction happens automatically but in this case since the data isn't enough for compaction to be triggered, we force compaction by closing the DB. This is what I get on running the script above
The |
Is there a way to trigger compaction without closing the database? |
Try running Lines 1185 to 1191 in 1fcc96e
|
Every Transaction stores the latest value of readTs it is aware of. When the transaction is discarded (which happens even when we commit), the global value of readTs is updated. Previously, the readTs of transaction inside the write batch struct was set to 0. So the global value of readTs would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global readTs, the compaction algorithm would skip all the values. With this commit, the compaction algorithm works fine with key-values inserted via Transaction API or via the Write Batch API. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and dgraph-io#767
Every Transaction stores the latest value of readTs it is aware of. When the transaction is discarded (which happens even when we commit), the global value of readTs is updated. Previously, the readTs of transaction inside the write batch struct was set to 0. So the global value of readTs would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global readTs, the compaction algorithm would skip all the values. With this commit, the compaction algorithm works fine with key-values inserted via Transaction API or via the Write Batch API. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and dgraph-io#767
Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…er versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…er versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…er versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…d older versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…d older versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io#767
…d older versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See #767
@magik6k with #778, the script in https://gist.github.com/magik6k/8c379cc02b443495e4809170fb8803a9 would still produce similar results but the values would be removed eventually. The GC would reclaim space after some time. |
Hello @jarifibrahim I have the same problem and got disk leak while trying to delete keys/values using The commit: d98dd68#diff-42ea5667b327bb207485077410d5f499 How about reopen this issue? |
@linxGnu The value log GC isn't supposed to reclaim space immediately. The change in #778 was reverted because we had issues with it. The issue here isn't with the GC, it's with the Write Batch API. You need not worry about GC. It will eventually clean up the space. There are multiple factors involved when it tries to find a vlog file to clean. Take a look at the following script. It works perfectly fine https://gist.github.com/jarifibrahim/78621293e68dffbc30be860f3c9df549#file-main_test-go and it's output https://gist.github.com/jarifibrahim/78621293e68dffbc30be860f3c9df549#file-output-txt. I am not sure if this is an actual bug. I mean the GC did reclaim space. It just didn't do it immediately. |
@jarifibrahim Thank you very much for your details. I would take a look again and report if I still seeing disk not reclaim when using Write Batch API 👍 |
@linxGnu Just to help you understand how GC works --
Value Log GC is supposed to clean up space eventually. There might be cases when GC doesn't clean up the data, but it will be cleaned up eventually. |
Reading badger DB issues list, yielded the following. RunValueLogGC() does clean up online. But on small databases (150MB) is not big enough, the only way to update stats for GC is to close DB. see Note in dgraph-io/badger#767 (comment) Based on this information the logic is redone, to call Close only if RunValueLogGC did not succeed.
…d older versions of keys during compactions. Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated. https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/txn.go#L501-L503 Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch). Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call. See https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/levels.go#L480-L484 and https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/txn.go#L138-L145 The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values. With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API. See dgraph-io/badger#767
Hello @jarifibrahim, I have a couple of questions regarding BadgerDB's garbage collection and file deletion process: Does BadgerDB delete files automatically, or do users need to call RunValueLogGC at intervals to delete discarded files after compaction? Is there any way we can obtain information about which files will be deleted during the process of RunValueLogGC before they are actually deleted? This would be helpful if we want to store this data to a cheaper storage solution for backup or archiving purposes. I would appreciate any insights or guidance you can provide on these topics. Thank you! |
Hi @ashish314, If you want to take a backup of data in badger, you should use the Backup APIs in badger. You can see more details here https://dgraph.io/docs/badger/get-started/#database-backup/ I am not sure if you need to call RunValueLogGC periodically, it seems to me that you should. |
HI @ashish314!
Badger will perform cleanup automatically. This means it will delete old data and files automatically.
We don't expose this information but you shouldn't need to worry about this data. GC removes only deleted/expired/duplicate/stale data. All the useful data is kept as it is. You can take periodic backups of your data if you'd like using the backup API. |
I was trying to get Badger GC in go-ipfs to reclaim space, but it didn't seem to work, so I've wrote this rather basic test case to see if it works in the simple case of keys being added, then deleted, and GC ran:
https://gist.github.com/magik6k/8c379cc02b443495e4809170fb8803a9
EDIT: gist / results updated as I discovered that I was calling Delete on wrong data...
These are my (reproducible) results:
DiscardRatio=0.5
Cases below are wrong because of a bug in my code:
DiscardRatio=0.01
DiscardRatio=0.5
DiscardRatio=0.9
The text was updated successfully, but these errors were encountered: