-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msg="failed to cut traces" err="snappy: Writer is closed" #1374
Comments
Thanks for the detailed report @jvilhuber. This is really strange behaviour. Errors aren't supposed to kill ingesters because it could be a transient error with the filesystem/kubernetes-pvcs etc, but nevertheless the Digging through the code, the error originates from the snappy package here:
The only place that it is ever used is in the tempo/vendor/github.com/golang/snappy/encode.go Lines 282 to 289 in 49fbcf1
And this is where the polling logic is written in Tempo to handle snappy writers: tempo/tempodb/encoding/v2/pool.go Lines 284 to 305 in 49fbcf1
Notice how the writer is closed on every But somehow we are either double closing the writer or the writer is never reset after closing once. This could potentially happen here: tempo/tempodb/encoding/v2/data_writer.go Lines 59 to 60 in 49fbcf1
The writer is closed and reset soon after, but there's a line of code between the close and the reset - |
I too have started getting this error today after upgrading to Tempo 1.3.2 (using the tempo-distributed helm chart) |
Took less than a day to start happening again (we do have a lot of traces), and the error isn't what I expected (I thought we had checked for disk-space issues):
Update: I checked on that pod, and I see
so perhaps the disk-space-issue is transient, but the ingester never recovers. Update 2: Seems our PVC was hovering around 80%, and we do have some large traces. So I expect periodically a large trace would cause disk-space issues, and then be deleted (never triggering any alarms), but causing the ingester to go bye-bye. |
Aha! Thanks for the additional information, looks like that is the culprit. Interesting that a transient disk issue could result in this behaviour, practically went undetected by the ingester. I think we should keep this issue open as a task to rearrange these lines: tempo/tempodb/encoding/v2/data_writer.go Lines 59 to 74 in 49fbcf1
The resulting logic should be:
That way the writers/buffer will be reset wherever the function returns. |
Describe the bug
After some time, the ingesters seem to lose connection to something, and I start seeing 100+/sec of these errors:
level=error ts=2022-04-12T03:33:13.326708706Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
When I look at the "Tempo Operational" Dashboard (can't remember where I found that), I see that tempo claims to still be functioning normally (at least nothing jumps out at me; see image in comment below): Traces created shows a normal graph. And yet searching for traces by ID in tempo results in "not found" for most traces. The only symptom I noticed that might indicate that things are not working as intended are panels related to blocks like "Blocks Flushed" and "Blocks Cleared" which remain at 0: No activity.
When I restart the ingesters, the errors disappear for a while, searches for traces are successful again, and the blocks-panels show activity.
I'm not sure what else to look for, so please ask and I'll see what I can gather.
To Reproduce
I wish I knew. Let tempo in our deployment run for a while and this starts happening.
Expected behavior
If some connection problem occurs, ingesters should close the connection and reconnect. An error might be useful to alert us to the connection problem source, so it can be fixed.
Environment:
tempoVersion='1.3.2'
Additional Context
Compactor pods: 10, 10Gigs Ram
Distributor pods: 8, 2gigs Ram
Ingester pods (stateful set): 10, 7Gigs Ram, 5Gigs Storage via PVC
Querier pods: 3, 1.5Gigs Ram
Query Frontends: 2, 2Gigs Ram
The text was updated successfully, but these errors were encountered: