msg="failed to cut traces" err="snappy: Writer is closed" #1374

jvilhuber · 2022-04-12T06:22:43Z

Describe the bug
After some time, the ingesters seem to lose connection to something, and I start seeing 100+/sec of these errors:
level=error ts=2022-04-12T03:33:13.326708706Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"

When I look at the "Tempo Operational" Dashboard (can't remember where I found that), I see that tempo claims to still be functioning normally (at least nothing jumps out at me; see image in comment below): Traces created shows a normal graph. And yet searching for traces by ID in tempo results in "not found" for most traces. The only symptom I noticed that might indicate that things are not working as intended are panels related to blocks like "Blocks Flushed" and "Blocks Cleared" which remain at 0: No activity.

When I restart the ingesters, the errors disappear for a while, searches for traces are successful again, and the blocks-panels show activity.

I'm not sure what else to look for, so please ask and I'll see what I can gather.

To Reproduce
I wish I knew. Let tempo in our deployment run for a while and this starts happening.

Expected behavior
If some connection problem occurs, ingesters should close the connection and reconnect. An error might be useful to alert us to the connection problem source, so it can be fixed.

Environment:
tempoVersion='1.3.2'

Infrastructure: Kubernetes, microservices
Deployment tool: manual jsonnet, derived from the helm output.

Additional Context
Compactor pods: 10, 10Gigs Ram
Distributor pods: 8, 2gigs Ram
Ingester pods (stateful set): 10, 7Gigs Ram, 5Gigs Storage via PVC
Querier pods: 3, 1.5Gigs Ram
Query Frontends: 2, 2Gigs Ram

data:
  config.yaml: |-
    "compactor":
      "compaction":
        "block_retention": "144h"
        "chunk_size_bytes": 10485760
      "ring":
        "kvstore":
          "store": "memberlist"
    "distributor":
      "receivers":
        "jaeger":
          "protocols":
            "grpc": null
            "thrift_binary": null
            "thrift_compact": null
            "thrift_http": null
        "opencensus": null
        "otlp":
          "protocols":
            "grpc": null
            "http": null
        "zipkin": null
    "http_api_prefix": ""
    "ingester":
      "lifecycler":
        "ring":
          "kvstore":
            "store": "memberlist"
          "replication_factor": 3
      "max_block_bytes": 1000000
      "max_block_duration": "5m"
      "trace_idle_period": "10s"
    "memberlist":
      "abort_if_cluster_join_fails": false
      "bind_port": 7946
      "join_members":
      - "gossip-ring.tempo.svc.cluster.local:7946"
    "overrides":
      "max_bytes_per_trace": 5000000
      "max_search_bytes_per_trace": 30000
      "max_traces_per_user": 0
    "querier":
      "frontend_worker":
        "frontend_address": "query-frontend-discovery.tempo.svc.cluster.local:9095"
    "search_enabled": true
    "server":
      "grpc_server_max_recv_msg_size": 5000001
      "grpc_server_max_send_msg_size": 5000001
      "http_listen_port": 3200
      "log_level": "debug"
    "storage":
      "trace":
        "backend": "s3"
        "blocklist_poll": "5m"
        "cache": "memcached"
        "memcached":
          "consistent_hash": true
          "host": "memcached"
          "service": "memcached-client"
          "timeout": "200ms"
        "pool":
          "queue_depth": 4000
        "s3":
          "bucket": "XXX"
          "endpoint": "XXX"
          "insecure": false
          "region": "XXX"
        "wal":
          "path": "/var/tempo/wal"

The text was updated successfully, but these errors were encountered:

jvilhuber · 2022-04-12T06:25:41Z

You can see where I restarted the ingesters:

annanay25 · 2022-04-12T08:33:10Z

Thanks for the detailed report @jvilhuber.

This is really strange behaviour. Errors aren't supposed to kill ingesters because it could be a transient error with the filesystem/kubernetes-pvcs etc, but nevertheless the snappy: Writer is closed is confounding.

Digging through the code, the error originates from the snappy package here:

tempo/vendor/github.com/golang/snappy/encode.go

Line 110 in 49fbcf1

var errClosed = errors.New("snappy: Writer is closed")

The only place that it is ever used is in the Close() function (this is good because then we only need to check if we are calling close correctly):

tempo/vendor/github.com/golang/snappy/encode.go

Lines 282 to 289 in 49fbcf1

    
           func (w *Writer) Close() error { 
        
           	w.Flush() 
        
           	ret := w.err 
        
           	if w.err == nil { 
        
           		w.err = errClosed 
        
           	} 
        
           	return ret 
        
           }

And this is where the polling logic is written in Tempo to handle snappy writers:

tempo/tempodb/encoding/v2/pool.go

Lines 284 to 305 in 49fbcf1

    
           // GetWriter gets or creates a new CompressionWriter and reset it to write to dst 
        
           func (pool *SnappyPool) GetWriter(dst io.Writer) (io.WriteCloser, error) { 
        
           	if w := pool.writers.Get(); w != nil { 
        
           		writer := w.(*snappy.Writer) 
        
           		writer.Reset(dst) 
        
           		return writer, nil 
        
           	} 
        
           	return snappy.NewBufferedWriter(dst), nil 
        
           } 
        
           // PutWriter places back in the pool a CompressionWriter 
        
           func (pool *SnappyPool) PutWriter(writer io.WriteCloser) { 
        
           	_ = writer.(*snappy.Writer).Close() 
        
           	pool.writers.Put(writer) 
        
           } 
        
           // ResetWriter implements WriterPool 
        
           func (pool *SnappyPool) ResetWriter(dst io.Writer, resetWriter io.WriteCloser) (io.WriteCloser, error) { 
        
           	writer := resetWriter.(*snappy.Writer) 
        
           	writer.Reset(dst) 
        
           	return writer, nil 
        
           }

Notice how the writer is closed on every PutWriter call and reset on every GetWriter. So this logic is sound.

But somehow we are either double closing the writer or the writer is never reset after closing once. This could potentially happen here:

tempo/tempodb/encoding/v2/data_writer.go

Lines 59 to 60 in 49fbcf1

    
           // force flush everything 
        
           p.compressionWriter.Close()

The writer is closed and reset soon after, but there's a line of code between the close and the reset - marshalPageToWriter. If that returns an error, the writer will never be reset. So can you check if there were any other errors in your logs before this one started popping up? Something about failing to write to io.Writer? That could be the reason for this error.. in which case we just need to move those lines around...

warrenhodg · 2022-04-12T16:40:02Z

I too have started getting this error today after upgrading to Tempo 1.3.2 (using the tempo-distributed helm chart)

jvilhuber · 2022-04-13T04:00:59Z

Took less than a day to start happening again (we do have a lot of traces), and the error isn't what I expected (I thought we had checked for disk-space issues):

level=error ts=2022-04-12T12:32:07.515177529Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="write /var/tempo/wal/bb7fe8d2-e153-4228-a4a5-ece267c44885:single-tenant:v2:snappy:v1: no space left on device"

level=error ts=2022-04-12T12:32:07.618466804Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=b025060b-5964-42b5-bf22-1e0401dc18d0 attempts=1 err="write /var/tempo/wal/blocks/single-tenant/b025060b-5964-42b5-bf22-1e0401dc18d0/search-header: no space left on device"
	
level=error ts=2022-04-12T12:32:07.633761285Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=ad3f3469-9d91-4a13-be39-d3c9cf2fbe41 attempts=1 err="error appending to backend block: write /var/tempo/wal/blocks/single-tenant/ad3f3469-9d91-4a13-be39-d3c9cf2fbe41/search: no space left on device"
	
level=error ts=2022-04-12T12:32:14.072651869Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:24.07146046Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:34.091628266Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"
	
level=error ts=2022-04-12T12:32:39.834831296Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=ad3f3469-9d91-4a13-be39-d3c9cf2fbe41 attempts=2 err="error completing wal block with local backend: error completing compactor block: unexpected error writing index write /var/tempo/wal/blocks/single-tenant/ad3f3469-9d91-4a13-be39-d3c9cf2fbe41/index: no space left on device"
	
level=error ts=2022-04-12T12:32:39.964128374Z caller=flush.go:220 org_id=single-tenant msg="error performing op in flushQueue" op=0 block=b025060b-5964-42b5-bf22-1e0401dc18d0 attempts=2 err="error completing wal block with local backend: error completing compactor block: unexpected error writing index write /var/tempo/wal/blocks/single-tenant/b025060b-5964-42b5-bf22-1e0401dc18d0/index: no space left on device"
	
level=error ts=2022-04-12T12:32:44.073948104Z caller=flush.go:155 org_id=single-tenant msg="failed to cut traces" err="snappy: Writer is closed"

Update: I checked on that pod, and I see

/dev/nvme4n1              4976      4171       788  84% /var/tempo

so perhaps the disk-space-issue is transient, but the ingester never recovers.

Update 2: Seems our PVC was hovering around 80%, and we do have some large traces. So I expect periodically a large trace would cause disk-space issues, and then be deleted (never triggering any alarms), but causing the ingester to go bye-bye.

annanay25 · 2022-04-13T04:56:53Z

Aha! Thanks for the additional information, looks like that is the culprit. Interesting that a transient disk issue could result in this behaviour, practically went undetected by the ingester.

I think we should keep this issue open as a task to rearrange these lines:

tempo/tempodb/encoding/v2/data_writer.go

Lines 59 to 74 in 49fbcf1

    
           // force flush everything 
        
           p.compressionWriter.Close() 
        
           // now marshal the buffer as a page to the output 
        
           bytesWritten, err := marshalPageToWriter(p.compressedBuffer.Bytes(), p.outputWriter, constDataHeader) 
        
           if err != nil { 
        
           	return 0, err 
        
           } 
        
           // reset buffers for the next write 
        
           p.objectBuffer.Reset() 
        
           p.compressedBuffer.Reset() 
        
           p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter) 
        
           if err != nil { 
        
           	return 0, err 
        
           }

The resulting logic should be:

	defer() {
		// reset buffers for the next write
		p.objectBuffer.Reset()
		p.compressedBuffer.Reset()
		p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter)
		if err != nil {
			return 0, err
		}
	}

	// force flush everything
	p.compressionWriter.Close()

	p.compressionWriter, err = p.pool.ResetWriter(p.compressedBuffer, p.compressionWriter)
	if err != nil {
		return 0, err
	}

	// now marshal the buffer as a page to the output
	bytesWritten, err := marshalPageToWriter(p.compressedBuffer.Bytes(), p.outputWriter, constDataHeader)
	if err != nil {
		return 0, err
	}

That way the writers/buffer will be reset wherever the function returns.

annanay25 mentioned this issue Apr 13, 2022

Fix closed writer panic by resetting buffers/writer correctly on error path #1379

Merged

1 task

mdisibio closed this as completed Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msg="failed to cut traces" err="snappy: Writer is closed" #1374

msg="failed to cut traces" err="snappy: Writer is closed" #1374

jvilhuber commented Apr 12, 2022 •

edited

Loading

jvilhuber commented Apr 12, 2022

annanay25 commented Apr 12, 2022 •

edited

Loading

warrenhodg commented Apr 12, 2022

jvilhuber commented Apr 13, 2022 •

edited

Loading

annanay25 commented Apr 13, 2022

msg="failed to cut traces" err="snappy: Writer is closed" #1374

msg="failed to cut traces" err="snappy: Writer is closed" #1374

Comments

jvilhuber commented Apr 12, 2022 • edited Loading

jvilhuber commented Apr 12, 2022

annanay25 commented Apr 12, 2022 • edited Loading

warrenhodg commented Apr 12, 2022

jvilhuber commented Apr 13, 2022 • edited Loading

annanay25 commented Apr 13, 2022

jvilhuber commented Apr 12, 2022 •

edited

Loading

annanay25 commented Apr 12, 2022 •

edited

Loading

jvilhuber commented Apr 13, 2022 •

edited

Loading