Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg="failed to cut traces" err="snappy: Writer is closed" #1374

Closed
jvilhuber opened this issue Apr 12, 2022 · 5 comments
Closed

msg="failed to cut traces" err="snappy: Writer is closed" #1374

jvilhuber opened this issue Apr 12, 2022 · 5 comments

Comments

@jvilhuber
Copy link

jvilhuber commented Apr 12, 2022

Describe the bug
After some time, the ingesters seem to lose connection to something, and I start seeing 100+/sec of these errors:
level=error ts=2022-04-12T03:33:13.326708706Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"

When I look at the "Tempo Operational" Dashboard (can't remember where I found that), I see that tempo claims to still be functioning normally (at least nothing jumps out at me; see image in comment below): Traces created shows a normal graph. And yet searching for traces by ID in tempo results in "not found" for most traces. The only symptom I noticed that might indicate that things are not working as intended are panels related to blocks like "Blocks Flushed" and "Blocks Cleared" which remain at 0: No activity.

When I restart the ingesters, the errors disappear for a while, searches for traces are successful again, and the blocks-panels show activity.

I'm not sure what else to look for, so please ask and I'll see what I can gather.

To Reproduce
I wish I knew. Let tempo in our deployment run for a while and this starts happening.

Expected behavior
If some connection problem occurs, ingesters should close the connection and reconnect. An error might be useful to alert us to the connection problem source, so it can be fixed.

Environment:
tempoVersion='1.3.2'

  • Infrastructure: Kubernetes, microservices
  • Deployment tool: manual jsonnet, derived from the helm output.

Additional Context
Compactor pods: 10, 10Gigs Ram
Distributor pods: 8, 2gigs Ram
Ingester pods (stateful set): 10, 7Gigs Ram, 5Gigs Storage via PVC
Querier pods: 3, 1.5Gigs Ram
Query Frontends: 2, 2Gigs Ram

data:
  config.yaml: |-
    "compactor":
      "compaction":
        "block_retention": "144h"
        "chunk_size_bytes": 10485760
      "ring":
        "kvstore":
          "store": "memberlist"
    "distributor":
      "receivers":
        "jaeger":
          "protocols":
            "grpc": null
            "thrift_binary": null
            "thrift_compact": null
            "thrift_http": null
        "opencensus": null
        "otlp":
          "protocols":
            "grpc": null
            "http": null
        "zipkin": null
    "http_api_prefix": ""
    "ingester":
      "lifecycler":
        "ring":
          "kvstore":
            "store": "memberlist"
          "replication_factor": 3
      "max_block_bytes": 1000000
      "max_block_duration": "5m"
      "trace_idle_period": "10s"
    "memberlist":
      "abort_if_cluster_join_fails": false
      "bind_port": 7946
      "join_members":
      - "gossip-ring.tempo.svc.cluster.local:7946"
    "overrides":
      "max_bytes_per_trace": 5000000
      "max_search_bytes_per_trace": 30000
      "max_traces_per_user": 0
    "querier":
      "frontend_worker":
        "frontend_address": "query-frontend-discovery.tempo.svc.cluster.local:9095"
    "search_enabled": true
    "server":
      "grpc_server_max_recv_msg_size": 5000001
      "grpc_server_max_send_msg_size": 5000001
      "http_listen_port": 3200
      "log_level": "debug"
    "storage":
      "trace":
        "backend": "s3"
        "blocklist_poll": "5m"
        "cache": "memcached"
        "memcached":
          "consistent_hash": true
          "host": "memcached"
          "service": "memcached-client"
          "timeout": "200ms"
        "pool":
          "queue_depth": 4000
        "s3":
          "bucket": "XXX"
          "endpoint": "XXX"
          "insecure": false
          "region": "XXX"
        "wal":
          "path": "/var/tempo/wal"
@jvilhuber
Copy link
Author

You can see where I restarted the ingesters:
Screen Shot 2022-04-12 at 13 24 51

@annanay25
Copy link
Contributor

annanay25 commented Apr 12, 2022

Thanks for the detailed report @jvilhuber.

This is really strange behaviour. Errors aren't supposed to kill ingesters because it could be a transient error with the filesystem/kubernetes-pvcs etc, but nevertheless the snappy: Writer is closed is confounding.

Digging through the code, the error originates from the snappy package here:

var errClosed = errors.New("snappy: Writer is closed")

The only place that it is ever used is in the Close() function (this is good because then we only need to check if we are calling close correctly):

func (w *Writer) Close() error {
w.Flush()
ret := w.err
if w.err == nil {
w.err = errClosed
}
return ret
}

And this is where the polling logic is written in Tempo to handle snappy writers:

// GetWriter gets or creates a new CompressionWriter and reset it to write to dst
func (pool *SnappyPool) GetWriter(dst io.Writer) (io.WriteCloser, error) {
if w := pool.writers.Get(); w != nil {
writer := w.(*snappy.Writer)
writer.Reset(dst)
return writer, nil
}
return snappy.NewBufferedWriter(dst), nil
}
// PutWriter places back in the pool a CompressionWriter
func (pool *SnappyPool) PutWriter(writer io.WriteCloser) {
_ = writer.(*snappy.Writer).Close()
pool.writers.Put(writer)
}
// ResetWriter implements WriterPool
func (pool *SnappyPool) ResetWriter(dst io.Writer, resetWriter io.WriteCloser) (io.WriteCloser, error) {
writer := resetWriter.(*snappy.Writer)
writer.Reset(dst)
return writer, nil
}

Notice how the writer is closed on every PutWriter call and reset on every GetWriter. So this logic is sound.

But somehow we are either double closing the writer or the writer is never reset after closing once. This could potentially happen here:

// force flush everything
p.compressionWriter.Close()

The writer is closed and reset soon after, but there's a line of code between the close and the reset - marshalPageToWriter. If that returns an error, the writer will never be reset. So can you check if there were any other errors in your logs before this one started popping up? Something about failing to write to io.Writer? That could be the reason for this error.. in which case we just need to move those lines around...

@warrenhodg
Copy link

I too have started getting this error today after upgrading to Tempo 1.3.2 (using the tempo-distributed helm chart)

@jvilhuber
Copy link
Author

jvilhuber commented Apr 13, 2022

Took less than a day to start happening again (we do have a lot of traces), and the error isn't what I expected (I thought we had checked for disk-space issues):

level=error ts=2022-04-12T12:32:07.515177529Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="write /var/tempo/wal/bb7fe8d2-e153-4228-a4a5-ece267c44885:single-tenant:v2:snappy:v1: no space left on device"

level=error ts=2022-04-12T12:32:07.618466804Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=b025060b-5964-42b5-bf22-1e0401dc18d0 attempts=1 err="write /var/tempo/wal/blocks/single-tenant/b025060b-5964-42b5-bf22-1e0401dc18d0/search-header: no space left on device"
	
level=error ts=2022-04-12T12:32:07.633761285Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=ad3f3469-9d91-4a13-be39-d3c9cf2fbe41 attempts=1 err="error appending to backend block: write /var/tempo/wal/blocks/single-tenant/ad3f3469-9d91-4a13-be39-d3c9cf2fbe41/search: no space left on device"
	
level=error ts=2022-04-12T12:32:14.072651869Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:24.07146046Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:34.091628266Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:39.834831296Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=ad3f3469-9d91-4a13-be39-d3c9cf2fbe41 attempts=2 err="error completing wal block with local backend: error completing compactor block: unexpected error writing index write /var/tempo/wal/blocks/single-tenant/ad3f3469-9d91-4a13-be39-d3c9cf2fbe41/index: no space left on device"
	
level=error ts=2022-04-12T12:32:39.964128374Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=b025060b-5964-42b5-bf22-1e0401dc18d0 attempts=2 err="error completing wal block with local backend: error completing compactor block: unexpected error writing index write /var/tempo/wal/blocks/single-tenant/b025060b-5964-42b5-bf22-1e0401dc18d0/index: no space left on device"
	
level=error ts=2022-04-12T12:32:44.073948104Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"

Update: I checked on that pod, and I see

/dev/nvme4n1              4976      4171       788  84% /var/tempo

so perhaps the disk-space-issue is transient, but the ingester never recovers.

Update 2: Seems our PVC was hovering around 80%, and we do have some large traces. So I expect periodically a large trace would cause disk-space issues, and then be deleted (never triggering any alarms), but causing the ingester to go bye-bye.

@annanay25
Copy link
Contributor

Aha! Thanks for the additional information, looks like that is the culprit. Interesting that a transient disk issue could result in this behaviour, practically went undetected by the ingester.

I think we should keep this issue open as a task to rearrange these lines:

// force flush everything
p.compressionWriter.Close()
// now marshal the buffer as a page to the output
bytesWritten, err := marshalPageToWriter(p.compressedBuffer.Bytes(), p.outputWriter, constDataHeader)
if err != nil {
return 0, err
}
// reset buffers for the next write
p.objectBuffer.Reset()
p.compressedBuffer.Reset()
p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter)
if err != nil {
return 0, err
}

The resulting logic should be:

	defer() {
		// reset buffers for the next write
		p.objectBuffer.Reset()
		p.compressedBuffer.Reset()
		p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter)
		if err != nil {
			return 0, err
		}
	}

	// force flush everything
	p.compressionWriter.Close()

	p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter)
	if err != nil {
		return 0, err
	}

	// now marshal the buffer as a page to the output
	bytesWritten, err := marshalPageToWriter(p.compressedBuffer.Bytes(), p.outputWriter, constDataHeader)
	if err != nil {
		return 0, err
	}

That way the writers/buffer will be reset wherever the function returns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants