Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs aggregator dies/is stopped #1832

Open
mieubrisse opened this issue Nov 19, 2023 · 3 comments
Open

Logs aggregator dies/is stopped #1832

mieubrisse opened this issue Nov 19, 2023 · 3 comments
Assignees
Labels
bug Something isn't working cli For bugs relating to the CLI painful Painful bug

Comments

@mieubrisse
Copy link
Collaborator

What's your CLI version?

0.85.29

Description & steps to reproduce

I tried a kurtosis run today, and got this:

INFO[2023-11-19T13:17:16-03:00] No Kurtosis engine was found; attempting to start one...
INFO[2023-11-19T13:17:16-03:00] Starting the centralized logs components...
INFO[2023-11-19T13:17:16-03:00] Centralized logs components started.
INFO[2023-11-19T13:17:17-03:00] Successfully started Kurtosis engine
INFO[2023-11-19T13:17:17-03:00] Creating a new enclave for Starlark to run inside...
Error:  An error occurred running command 'run'
  Caused by: An error occurred calling the run function for command 'run'
  Caused by: An error occurred getting the enclave context for enclave ''
  Caused by: Unable to create new enclave with name ''
  Caused by: An error occurred creating an enclave with name ''
  Caused by: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR

I dumped the engine logs, and found:

2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] Logs aggregator exists but is not running. Instead container status is 'STOPPED'. This is unexpected as docker should have restarted the container automatically. 
2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] This can be fixed by restarting the engine using `kurtosis engine restart` and attempting to create the enclave again. 
2023-11-19 13:17:17 2023/11/19 16:17:17 http2: panic serving 172.17.0.1:36748: Propagate must be provided with a cause
2023-11-19 13:17:17 goroutine 24 [running]:
2023-11-19 13:17:17 golang.org/x/net/http2.(*serverConn).runHandler.func1()
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2361 +0x140
2023-11-19 13:17:17 panic({0x135bb60, 0x1a0e0c0})
2023-11-19 13:17:17     /usr/local/go/src/runtime/panic.go:884 +0x20c
2023-11-19 13:17:17 github.com/kurtosis-tech/stacktrace.Propagate(...)
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/github.com/kurtosis-tech/[email protected]/stacktrace.go:85
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/container-engine-lib/lib/backend_impls/docker/docker_kurtosis_backend.(*DockerKurtosisBackend).CreateLogsCollectorForEnclave(0x4000098d80, {0x1a30d50, 0x4000556100}, {0x40004950e0, 0x20}, 0x2?, 0x0?)
2023-11-19 13:17:17     /home/circleci/project/container-engine-lib/lib/backend_impls/docker/docker_kurtosis_backend/docker_kurtosis_backend.go:424 +0x238
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/container-engine-lib/lib/backend_impls/docker/docker_kurtosis_backend.(*DockerKurtosisBackend).CreateEnclave(0x4000098d80, {0x1a30d50, 0x4000556100}, {0x40004950e0, 0x20}, {0x40001455b0, 0xe})
2023-11-19 13:17:17     /home/circleci/project/container-engine-lib/lib/backend_impls/docker/docker_kurtosis_backend/docker_kurtosis_backend_enclave_functions.go:152 +0xc4c
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/container-engine-lib/lib/backend_impls/metrics_reporting.(*MetricsReportingKurtosisBackend).CreateEnclave(0x0?, {0x1a30d50?, 0x4000556100?}, {0x40004950e0, 0x20}, {0x40001455b0?, 0x40004bac60?})
2023-11-19 13:17:17     /home/circleci/project/container-engine-lib/lib/backend_impls/metrics_reporting/metrics_reporting_kurtosis_backend.go:114 +0x3c
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/engine/server/engine/enclave_manager.(*EnclaveCreator).CreateEnclave(0x40001c2800, {0x1a30d50, 0x4000556100}, {0x0, 0x0}, 0x0?, {0x40001455b0, 0xe}, {0x4000331096, 0x2}, ...)
2023-11-19 13:17:17     /home/circleci/project/engine/server/engine/enclave_manager/enclave_creator.go:57 +0xc4
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/engine/server/engine/enclave_manager.(*EnclaveManager).CreateEnclave(0x4000149900, {0x1a30d50, 0x4000556100}, {0x4000331089, 0x7}, {0x0, 0x0}, 0x0?, {0x0, 0x0}, ...)
2023-11-19 13:17:17     /home/circleci/project/engine/server/engine/enclave_manager/enclave_manager.go:180 +0x380
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/engine/server/engine/server.(*EngineConnectServerService).CreateEnclave(0x4000132620, {0x1a30d50, 0x4000556100}, 0x22ef8?)
2023-11-19 13:17:17     /home/circleci/project/engine/server/engine/server/engine_connect_server_service.go:103 +0x1c0
2023-11-19 13:17:17 connectrpc.com/connect.NewUnaryHandler[...].func1({0x1a35830, 0x4000132e70})
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/connectrpc.com/[email protected]/handler.go:52 +0x14c
2023-11-19 13:17:17 connectrpc.com/connect.NewUnaryHandler[...].func2({0xffff831585f8, 0x400054a040})
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/connectrpc.com/[email protected]/handler.go:81 +0x294
2023-11-19 13:17:17 connectrpc.com/connect.(*Handler).ServeHTTP(0x40001c1080, {0x1a24040, 0x4000550018}, 0x400055a200)
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/connectrpc.com/[email protected]/handler.go:239 +0x628
2023-11-19 13:17:17 github.com/kurtosis-tech/kurtosis/api/golang/engine/kurtosis_engine_rpc_api_bindings/kurtosis_engine_rpc_api_bindingsconnect.NewEngineServiceHandler.func1({0x1a24040, 0x4000550018}, 0x400055a200)
2023-11-19 13:17:17     /home/circleci/project/api/golang/engine/kurtosis_engine_rpc_api_bindings/kurtosis_engine_rpc_api_bindingsconnect/engine_service.connect.go:269 +0x220
2023-11-19 13:17:17 net/http.HandlerFunc.ServeHTTP(0x4000501860?, {0x1a24040?, 0x4000550018?}, 0x0?)
2023-11-19 13:17:17     /usr/local/go/src/net/http/server.go:2109 +0x38
2023-11-19 13:17:17 net/http.(*ServeMux).ServeHTTP(0x40005019a8?, {0x1a24040, 0x4000550018}, 0x400055a200)
2023-11-19 13:17:17     /usr/local/go/src/net/http/server.go:2487 +0x140
2023-11-19 13:17:17 golang.org/x/net/http2.(*serverConn).runHandler(0x40001f00a0?, 0x1a30df8?, 0x40004a4150?, 0x18694a0?)
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2368 +0xc4
2023-11-19 13:17:17 created by golang.org/x/net/http2.(*serverConn).scheduleHandler
2023-11-19 13:17:17     /home/circleci/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2303 +0x22c

I checked, and my logs aggregator was indeed stopped, though I'm not sure why:

Screen Shot 2023-11-19 at 13 19 54

I also have a logs-collector somehow hanging around, even though I don't have any enclaves running:

$ kurtosis enclave ls
UUID   Name   Status   Creation Time

This is the output of docker container inspect kurtosis-logs-aggregator:

[
    {
        "Id": "ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a",
        "Created": "2023-11-17T08:54:59.858197634Z",
        "Path": "/bin/sh",
        "Args": [
            "-c",
            "printf '\n[sources.\"fluent_bit\"]\ntype = \"fluent\"\naddress = \"0.0.0.0:9714\"\n\n[sinks.uuid_file]\ntype = \"file\"\ninputs = [\"fluent_bit\"]\npath = \"/var/log/kurtosis/%%Y/%%V/{{ enclave_uuid }}/{{ service_uuid }}.json\"\t\nencoding.codec = \"json\"\nbuffer.when_full = \"block\"\n' \u003e /etc/vector/vector.toml \u0026\u0026 /usr/bin/vector -c=/etc/vector/vector.toml"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2023-11-17T08:55:00.116585384Z",
            "FinishedAt": "2023-11-18T15:36:40.796399588Z"
        },
        "Image": "sha256:ec83552fc89eacc0af3a4d1576523a74f8b31b5931aa560b90423dcd3c6e1355",
        "ResolvConfPath": "/var/lib/docker/containers/ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a/hostname",
        "HostsPath": "/var/lib/docker/containers/ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a/hosts",
        "LogPath": "/var/lib/docker/containers/ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a/ebaa0ce2ae972c613cfe03403cae2040ef4f317968ca5b2faacd3c3ff23f961a-json.log",
        "Name": "/kurtosis-logs-aggregator",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "kurtosis-logs-storage:/var/log/kurtosis/"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "on-failure",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "ConsoleSize": [
                0,
                0
            ],
            "CapAdd": [],
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": [],
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ],
            "Init": false
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/eeaa3853b171c6f86b7b1fa3c1893bb5d21e9ca5ee72fc9dd4d6bf7ab9a1f84a-init/diff:/var/lib/docker/overlay2/73a73c54f4733e3093b694a5477e7d1cc8ec08b38ec33ffc6cadaff24554b3f4/diff:/var/lib/docker/overlay2/0649dd28395f8fcd43911f7fe784e1424786f892c5d3ec21cd33d75b51b491e5/diff:/var/lib/docker/overlay2/c1f399053d8ca7a31e3b1156ab6b8115ded5d353a375c601927295de44fe64ae/diff:/var/lib/docker/overlay2/bcb7c67adf3f86995fddd2b45f36b34a2bc43595fe3bb7e48c34f96c4d4bd077/diff:/var/lib/docker/overlay2/fcd0f8d8397e8d2d3653911c289eda57bee2a2bd7222539e2da5a76a06a4a808/diff:/var/lib/docker/overlay2/9c880fe16116762d9d91885e8c6490bb21f350849b88079a22e2f587108762b5/diff:/var/lib/docker/overlay2/04b6c1d99cf7535684fc7333499d7adbc5e8d72401086835fc1c6715f0cc013d/diff",
                "MergedDir": "/var/lib/docker/overlay2/eeaa3853b171c6f86b7b1fa3c1893bb5d21e9ca5ee72fc9dd4d6bf7ab9a1f84a/merged",
                "UpperDir": "/var/lib/docker/overlay2/eeaa3853b171c6f86b7b1fa3c1893bb5d21e9ca5ee72fc9dd4d6bf7ab9a1f84a/diff",
                "WorkDir": "/var/lib/docker/overlay2/eeaa3853b171c6f86b7b1fa3c1893bb5d21e9ca5ee72fc9dd4d6bf7ab9a1f84a/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "volume",
                "Name": "kurtosis-logs-storage",
                "Source": "/var/lib/docker/volumes/kurtosis-logs-storage/_data",
                "Destination": "/var/log/kurtosis",
                "Driver": "local",
                "Mode": "z",
                "RW": true,
                "Propagation": ""
            }
        ],
        "Config": {
            "Hostname": "ebaa0ce2ae97",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": true,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "-c",
                "printf '\n[sources.\"fluent_bit\"]\ntype = \"fluent\"\naddress = \"0.0.0.0:9714\"\n\n[sinks.uuid_file]\ntype = \"file\"\ninputs = [\"fluent_bit\"]\npath = \"/var/log/kurtosis/%%Y/%%V/{{ enclave_uuid }}/{{ service_uuid }}.json\"\t\nencoding.codec = \"json\"\nbuffer.when_full = \"block\"\n' \u003e /etc/vector/vector.toml \u0026\u0026 /usr/bin/vector -c=/etc/vector/vector.toml"
            ],
            "Image": "timberio/vector:0.31.0-debian",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/bin/sh"
            ],
            "OnBuild": null,
            "Labels": {
                "com.kurtosistech.app-id": "kurtosis",
                "com.kurtosistech.container-type": "kurtosis-logs-aggregator"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "90b24aa1e00d5a26d6cf689794d149f8732b2ae474a7382f496eb65a8f79283d",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/90b24aa1e00d",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "8662e507c238504d517d3e5496325a3bdaea893a404df2b8d29602fdf219eea8",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

And the container logs:

2023-11-17 05:55:00 2023-11-17T08:55:00.121833Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-11-17 05:55:00 2023-11-17T08:55:00.123136Z  INFO vector::app: Loading configs. paths=["/etc/vector/vector.toml"]
2023-11-17 05:55:00 2023-11-17T08:55:00.125273Z  INFO vector::topology::running: Running healthchecks.
2023-11-17 05:55:00 2023-11-17T08:55:00.125364Z  INFO vector: Vector has started. debug="false" version="0.31.0" arch="aarch64" revision="0f13b22 2023-07-06 13:52:34.591204470"
2023-11-17 05:55:00 2023-11-17T08:55:00.125370Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2023-11-17 05:55:00 2023-11-17T08:55:00.125380Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-17 05:55:00 2023-11-17T08:55:00.125612Z  INFO source{component_kind="source" component_id=fluent_bit component_type=fluent component_name=fluent_bit}: vector::sources::util::net::tcp: Listening. addr=0.0.0.0:9714

Desired behavior

I never get this error; Kurtosis correctly manages the logs-aggregator

What is the severity of this bug?

Critical; I am blocked and Kurtosis is unusable for me because of this bug.

What area of the product does this pertain to?

CLI: the Command Line Interface

@mieubrisse mieubrisse added the bug Something isn't working label Nov 19, 2023
@github-actions github-actions bot added cli For bugs relating to the CLI critical Critical bug or feature labels Nov 19, 2023
@mieubrisse
Copy link
Collaborator Author

Confirmed that it's not a problem of switching contexts; I just switched to my cloud context and back again and the logs-aggregator was started along with the engine

Screen Shot 2023-11-19 at 13 25 18

(though these silly logs-collectors are still hanging around..)

@tedim52
Copy link
Contributor

tedim52 commented Nov 22, 2023

status codes of the exited containers are 137. According to chat gpt:
An exit status of 137 typically indicates that a process was terminated by a signal. In the context of Docker, specifically, this often happens when a container is killed due to exceeding its allocated memory limit.

I wonder if this is caused by the logs collector resource leak.

github-merge-queue bot pushed a commit that referenced this issue Nov 22, 2023
## Description:
This change strengthens the restart policy for the logs aggregator.
Prior to this, the restart only occurred on failure. Now, we make docker
attempt to always restart the logs aggregator. This should help address
#1832 where the logs
aggregator was stopped with a `137` status code but wasn't restarted.

This change also addresses a `Propagate must be provided with a cause`
panic occurred here:
#1832. This was caused
by nil err's being propagated in the create logs collector code. This
change fixes that issue.

## Is this change user facing?
NO

## References:
#1832
#1311
@tedim52
Copy link
Contributor

tedim52 commented Nov 30, 2023

Update - this likely wasn't being caused by the log collector memory leak. Going to look deeper into memory limits on the logs aggregator.

@tedim52 tedim52 reopened this Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cli For bugs relating to the CLI painful Painful bug
Projects
None yet
Development

No branches or pull requests

3 participants