Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL recovery issue #37

Closed
weeco opened this issue Oct 31, 2021 · 10 comments
Closed

WAL recovery issue #37

weeco opened this issue Oct 31, 2021 · 10 comments

Comments

@weeco
Copy link

weeco commented Oct 31, 2021

Hey @nakabonne ,
I appreciate your quick work on the WAL, that's awesome! I just tried it and apparently I ran into an issue with the recovery from the WAL file. I let it run for ~2minutes where it added a few metrics into the TSDB and I noticed that the WAL file is also written. During recovery the following error popped up:

failed to recover WAL: failed to read WAL: encounter an error while reading WAL segment file "0": failed to read the metric name: unexpected EOF

My settings are:

	tsdbOpts := []tstorage.Option{
		tstorage.WithTimestampPrecision(tstorage.Seconds),
		tstorage.WithPartitionDuration(cfg.CacheRetention / 2),
		tstorage.WithDataPath(cfg.Persistence.Disk.DataPath),
		tstorage.WithRetention(cfg.Persistence.Disk.Retention),
	}

        storage, err := tstorage.NewStorage(tsdbOpts...)

I sent you the WAL file via EMail.

@weeco
Copy link
Author

weeco commented Oct 31, 2021

Another WAL related question: Is it expected to grow really fast? I don't scrape a lot of metrics (I'd guess 1000 metrics / 15s), but after 2 minutes the WAL file has already reached 11MB in file size.

@nakabonne
Copy link
Owner

@weeco Thanks for giving it a quick try, and good catch! Let me look into what's going on.

Another WAL related question: Is it expected to grow really fast?

To be honest, the current WAL format prioritizes simplicity. I'm in the middle of looking for another format that reduces the size and doesn't slow down writing and recovery.

Anyway, your try and report boost me a lot. A big thank you goes out to you!

@nakabonne
Copy link
Owner

It turns out the root cause is the unit to be flushed to disk is now left to bufio.Writer, which breaks off in the middle of the WAL row.

I'll fix it ASAP, but until then, possibly you can workaround by setting tstorage.WithWALBufferedSize(0) which specifies not to buffer any bytes at all.

@nakabonne
Copy link
Owner

root cause is the unit to be flushed to disk is now left to bufio.Writer

I haven't solved this problem yet, but for now I've made the WAL allow invalid rows at v0.3.1. You should now be able to recover with the same settings as before (the last few lines may still be discarded though).

@weeco
Copy link
Author

weeco commented Nov 1, 2021

I tried this version with two of my previously failed cases (the WAL and Disk files are stored in separate disks, hence easy to test). The first one reports this issue upon starting:

failed to read metadata: open config\tsdb-data\xy\p-1635675472-1635686277\meta.json: The system cannot find the file specified.

The second one threw a panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x20179e9]

goroutine 1 [running]:
github.com/nakabonne/tstorage.(*partitionNode).value(0x0, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/partition_list.go:228 +0x49
github.com/nakabonne/tstorage.(*partitionListImpl).getHead(0xc000590210, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/partition_list.go:67 +0xf5
github.com/nakabonne/tstorage.(*storage).ensureActiveHead(0xc0005a4000, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:341 +0x87
github.com/nakabonne/tstorage.(*storage).InsertRows.func1(0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:291 +0xf2
github.com/nakabonne/tstorage.(*storage).InsertRows(0xc0005a4000, 0xc0020d6000, 0x19667, 0x1d76d, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:320 +0x2c2
github.com/nakabonne/tstorage.(*storage).recoverWAL(0xc0005a4000, 0xc00070c000, 0x20, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:602 +0x4d6
github.com/nakabonne/tstorage.NewStorage(0xc00058c140, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
	C:/Users/h9pe/go/pkg/mod/github.com/nakabonne/[email protected]/storage.go:242 +0x1a05
github.com/cloudhut/kowl/backend/pkg/tsdb.NewService(0x1, 0x13a52453c000, 0x0, 0x1, 0x1, 0xc00003d440, 0x1e, 0x0, 0xc00015ad80, 0xc000004210, ...)
	C:/Users/h9pe/Documents/projects/cloudhut/kafka-owl/backend/pkg/tsdb/service.go:43 +0x572
github.com/cloudhut/kowl/backend/pkg/api.New(0xc0002ff800, 0x0)
	C:/Users/h9pe/Documents/projects/cloudhut/kafka-owl/backend/pkg/api/api.go:72 +0x1234
main.main()
	C:/Users/h9pe/Documents/projects/cloudhut/kafka-owl/backend/cmd/api/main.go:21 +0x585

A WAL file written with v0.3.1 also caused the panic I pasted above.

@nakabonne
Copy link
Owner

Nice catch! I addressed the second one: https://github.com/nakabonne/tstorage/releases/tag/v0.3.2

But I'm not sure why the first one happen as I'm not familiar with Windows file system. Is there open config\tsdb-data\xy\p-1635675472-1635686277\meta.json on your disk? And isn't it corrupted?

@weeco
Copy link
Author

weeco commented Nov 1, 2021

@nakabonne Now I don't get an error on startup anymore, but at query/select time. The returned error says (I think the first part is from within Kowl):

failed to get datapoints from TSDB: failed to select data points: this partition is expired

--

But I'm not sure why the first one happen as I'm not familiar with Windows file system. Is there open config\tsdb-data\xy\p-1635675472-1635686277\meta.json on your disk? And isn't it corrupted?

The file does not exist, but I'm unsure why it doesn't exist. I'll keep observing it and I will test more as soon as WAL is a bit more stable. Given that I just run it in debug mode currently the missing/malfunctioning WAL functionality doesn't make it too much fun to test the TSDB atm, because it takes some time until the DB is filled with meaningful data :).

@nakabonne
Copy link
Owner

Now I don't get an error on startup anymore, but at query/select time. The returned error says (I think the first part is from within Kowl):

Seems like that partition is expired. tstorage periodically removes those expired partitions, but it shouldn't return an error when selecting from those. I fixed it to not return any error: v0.3.3
Thank you!

@weeco
Copy link
Author

weeco commented Nov 2, 2021

@nakabonne I think we made it!

image

Stopping the application abruptly and restarting it works just fine at the moment. I'll keep playing around with it, but generally it seems to work now :). Thank you for your support!

@weeco weeco closed this as completed Nov 2, 2021
@nakabonne
Copy link
Owner

Happy to hear that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants