Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

K8S windows - pods are crashing due to CNI panic #3389

Closed
AmitDaniel-wsc opened this issue Jul 1, 2018 · 35 comments
Closed

K8S windows - pods are crashing due to CNI panic #3389

AmitDaniel-wsc opened this issue Jul 1, 2018 · 35 comments
Assignees
Labels

Comments

@AmitDaniel-wsc
Copy link

Is this a request for help?:
yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
Version: v0.19.0
GitCommit: 312770f
GitTreeState: clean

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes , v1.10.5

What happened:
Created a K8S cluster with 3 masters ( Linux ) and 10 agents ( Windows )
agentWindowsSku = Datacenter-Core-1709-with-Containers-smalldisk
agentWindowsVersion = 1709.0.20180524
Tried to install with AzureCNI and without AzureCNI
Once i'm starting to add pods i'm getting a weird errors :

First error :

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "prod-wsc-cliprouploader-57878b4f9b-hjhvn_wsc" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON

Second error :

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "prod-wsc-cliprouploader-57878b4f9b-k45dg_wsc" network: runtime error: invalid memory address or nil pointer dereference; goroutine 1 [running]: github.com/Azure/azure-container-networking/cni.(*Plugin).Execute.func1(0xc042063ec0) /go/src/github.com/Azure/azure-container-networking/cni/plugin.go:94 +0xc3 panic(0x6ab420, 0x84f790) /usr/local/go/src/runtime/panic.go:489 +0x2dd github.com/Azure/azure-container-networking/cni/network.(*netPlugin).Add.func1(0xc042063910, 0xc0421432d0, 0xc0420638d0, 0xc0420638f0, 0xc0420d5c40) /go/src/github.com/Azure/azure-container-networking/cni/network/network.go:150 +0xa6 github.com/Azure/azure-container-networking/cni/network.(*netPlugin).Add(0xc0420d5c40, 0xc0421432d0, 0x829be0, 0xc0420c29f0) /go/src/github.com/Azure/azure-container-networking/cni/network/network.go:285 +0x138d github.com/Azure/azure-container-networking/cni.(PluginApi).Add-fm(0xc0421432d0, 0xc042145860, 0x5) /go/src/github.com/Azure/azure-container-networking/cni/plugin.go:112 +0x40 github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc04212fd00, 0xc0421432d0, 0x82be20, 0xc04216e060, 0xc042063e80, 0x0, 0xc04200c180) /go/src/github.com/containernetworking/cni/pkg/skel/skel.go:168 +0x1a6 github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc04212fd00, 0xc042063e80, 0xc042063e68, 0x82be20, 0xc04216e060, 0xc042063e38) /go/src/github.com/containernetworking/cni/pkg/skel/skel.go:199 +0x38b github.com/containernetworking/cni/pkg/skel.PluginMainWithError(0xc042063e80, 0xc042063e68, 0x82be20, 0xc04216e060, 0xc04216e060) /go/src/github.com/containernetworking/cni/pkg/skel/skel.go:236 +0xf4 github.com/Azure/azure-container-networking/cni.(*Plugin).Execute(0xc0420fc068, 0x82bde0, 0xc0420d5c40, 0x0, 0x0) /go/src/github.com/Azure/azure-container-networking/cni/plugin.go:112 +0x12e main.main() /go/src/github.com/Azure/azure-container-networking/cni/network/plugin/main.go:93 +0x4db

What you expected to happen:
Pods will start with no errors .

How to reproduce it (as minimally and precisely as possible):
Create a cluster with latest version of acs-engine + change the windows image to 1709
create a cluster with and without azureCNI and start adding pods to your cluster .
after 10-15 pods you'll get the error .

Anything else we need to know:
I tried to start the cluster with older version of acs-engine ( 0.18.5 ) and i'm still getting this error .

Thanks ,
Amit

@AmitDaniel-wsc
Copy link
Author

AmitDaniel-wsc commented Jul 1, 2018

More details :

I can see that the azure-cni-networkmonitor ( Daemon sets ) are failing .
running only on the 3 masters and failing on all of the agents and this is the error :

Unable to mount volumes for pod "azure-cni-networkmonitor-2h6wj_kube-system(21d1dda6-7cfa-11e8-9a65-000d3a3827e6)": timeout expired waiting for volumes to attach or mount for pod "kube-system"/"azure-cni-networkmonitor-2h6wj". list of unmounted volumes=[ebtables-rule-repo]. list of unattached volumes=[log ebtables-rule-repo default-token-rls2x]

Normal SuccessfulMountVolume 18m kubelet, 39662acs000007 MountVolume.SetUp succeeded for volume "log"
Normal SuccessfulMountVolume 18m kubelet, 39662acs000007 MountVolume.SetUp succeeded for volume "default-token-rls2x"
Warning FailedMount 53s (x8 over 16m) kubelet, 39662acs000007 Unable to mount volumes for pod "azure-cni-networkmonitor-t6m8f_kube-system(21d1fbe7-7cfa-11e8-9a65-000d3a3827e6)": timeout expired waiting for volumes to attach or mount for pod "kube-system"/"azure-cni-networkmonitor-t6m8f". list of unmounted volumes=[ebtables-rule-repo]. list of unattached volumes=[log ebtables-rule-repo default-token-rls2x]
Warning FailedMount 20s (x17 over 18m) kubelet, 39662acs000007 MountVolume.SetUp failed for volume "ebtables-rule-repo" : hostPath type check failed: /var/run/ is not a directory

And of course that /var/run/ is not a directory cause my agents are windows based .

Thanks ,
Amit

@AmitDaniel-wsc
Copy link
Author

Couple of new errors i'm getting while trying to start pods :

Error1:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "POD-NAME" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

Error2:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "POD-NAME" network: netplugin failed but error parsing its diagnostic message "{\n "cniVersion": "0.3.0",\n "interfaces": [\n {\n "name": "eth0"\n }\n ],\n "ips": [\n {\n "version": "4",\n "address": "10.240.0.68/12",\n "gateway": "10.240.0.1"\n }\n ],\n "routes": [\n {\n "dst": "0.0.0.0/0",\n "gw": "10.240.0.1"\n }\n ],\n "dns": {\n "nameservers": [\n "168.63.129.16"\n ]\n }\n}{\n "code": 100,\n "msg": "Failed to create endpoint: HNS failed with error : The object already exists. "\n}": invalid character '{' after top-level value

@idanshahar
Copy link

idanshahar commented Jul 1, 2018

@AmitDaniel-wsc Could you please share your API model?

@eliko86
Copy link

eliko86 commented Jul 1, 2018

@idanshahar @AmitDaniel-wsc
here is the API model:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.10",
"kubernetesConfig": {
"networkPlugin": "azure"
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "wsc-k8s-win-consoles",
"vmSize": "Standard_D4_v3"
},
"agentPoolProfiles": [
{
"name": "windowspool1",
"count": 10,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "VirtualMachineScaleSets",
"storageProfile": "ManagedDisks",
"osType": "Windows"
}
],
"windowsProfile": {
"adminUsername": "wscuser",
"adminPassword": "",
"imageVersion": "1709.0.20180524"
},
"linuxProfile": {
"adminUsername": "wscuser",
"ssh": {
"publicKeys": [
{
"keyData": ""
}
]
}
},
"servicePrincipalProfile": {
"clientId": "",
"secret": ""
}
}
}

@feiskyer
Copy link
Member

feiskyer commented Jul 2, 2018

The first JSON parsing error is also reported at #3155. @saiyan86 Do you know why this is happening?

The second panic error is a known issue for Azure CNI plugin (see Azure/azure-container-networking#176). It has been fixed in Azure/azure-container-networking#177, and it may need sometime for its release.

@y325A
Copy link

y325A commented Jul 2, 2018

Hi I opened #3153 with what seems to be the same issues.

If you check the logs on your Windows nodes, I found errors of IP exhaustion coming from the CNI plugin.

@AmitDaniel-wsc
Copy link
Author

Hi .
I tried to create new cluster ( acs-engine latest version , K8S 1.10 )
tried with kubenet instead of azure-cni
and i'm getting this error while trying to create iis pod .

goroutine 1 [running]:
visualstudio.com/containernetworking/cni/cni.(*NetworkConfig).GetNetworkInfo(0xc04204c140, 0x29)
/home/madhanm/repo/gopath/src/visualstudio.com/containernetworking/cni/cni/cni.go:175 +0x70e
visualstudio.com/containernetworking/cni/cni/network.(*netPlugin).Add(0xc042044500, 0xc042094070, 0x1, 0x0)
/home/madhanm/repo/gopath/src/visualstudio.com/containernetworking/cni/cni/network/network.go:108 +0x35b
visualstudio.com/containernetworking/cni/cni.(PluginApi).Add-fm(0xc042094070, 0xc04204bc38, 0x5)
/home/madhanm/repo/gopath/src/visualstudio.com/containernetworking/cni/cni/plugin.go:49 +0x40
github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc042046280, 0xc042094070, 0x6311a0, 0xc0420ca2a0, 0xc042065e78, 0x0, 0x41055f)
/home/madhanm/repo/gopath/src/github.com/containernetworking/cni/pkg/skel/skel.go:162 +0x1a6
github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc042046280, 0xc042065e78, 0xc042065e90, 0x6311a0, 0xc0420ca2a0, 0xc0420ca2a0)
/home/madhanm/repo/gopath/src/github.com/containernetworking/cni/pkg/skel/skel.go:173 +0x2a9
github.com/containernetworking/cni/pkg/skel.PluginMainWithError(0xc042065e78, 0xc042065e90, 0x6311a0, 0xc0420ca2a0, 0xc0420ca2a0)
/home/madhanm/repo/gopath/src/github.com/containernetworking/cni/pkg/skel/skel.go:210 +0xf4
visualstudio.com/containernetworking/cni/cni.(*Plugin).Execute(0xc042068028, 0x631520, 0xc042044500, 0x0, 0xc04207fc20)
/home/madhanm/repo/gopath/src/visualstudio.com/containernetworking/cni/cni/plugin.go:49 +0xf9
main.main()
/home/madhanm/repo/gopath/src/visualstudio.com/containernetworking/cni/cni/network/plugin/main.go:43 +0x35c
E0702 13:27:22.154062 2728 cni.go:259] Error adding network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0702 13:27:22.154062 2728 cni.go:227] Error while adding to cni network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0702 13:27:23.741737 2728 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "iis_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0702 13:27:23.741737 2728 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "iis_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0702 13:27:23.741737 2728 kuberuntime_manager.go:646] createPodSandbox for pod "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "iis_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0702 13:27:23.741737 2728 pod_workers.go:186] Error syncing pod dc00ed67-7df8-11e8-a1a2-000d3a58e297 ("iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)"), skipping: failed to "CreatePodSandbox" for "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)" with CreatePodSandboxError: "CreatePodSandbox for pod "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "iis_default" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input"
I0702 13:27:24.112447 2728 kubelet.go:1906] SyncLoop (PLEG): "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)", event: &pleg.PodLifecycleEvent{ID:"dc00ed67-7df8-11e8-a1a2-000d3a58e297", Type:"ContainerDied", Data:"c50b9690ae40a45d1e92c0a86d20841b0f51f9063ae43e0841934a2bf94a3607"}
W0702 13:27:24.112447 2728 pod_container_deletor.go:77] Container "c50b9690ae40a45d1e92c0a86d20841b0f51f9063ae43e0841934a2bf94a3607" not found in pod's containers
I0702 13:27:24.414232 2728 kuberuntime_manager.go:403] No ready sandbox for pod "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)" can be found. Need to start a new one
E0702 13:27:26.354370 2728 docker_sandbox.go:657] ResolvConfPath is empty.
I0702 13:27:26.375379 2728 kubelet.go:1906] SyncLoop (PLEG): "iis_default(dc00ed67-7df8-11e8-a1a2-000d3a58e297)", event: &pleg.PodLifecycleEvent{ID:"dc00ed67-7df8-11e8-a1a2-000d3a58e297", Type:"ContainerStarted", Data:"42dd5865284d6507b24bcb7f93fc28217618d7d171e060471d0ddb6548871805"}
{"level":"debug","msg":"[cni-net] Plugin wcn-net version b9c3a1d-dirty.","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[net] Network interface: {Index:13 MTU:1500 Name:vEthernet (Ethernet 3) HardwareAddr:00:0d:3a:58:e1:e4 Flags:up|broadcast|multicast} with IP addresses: [fe80::5151:1843:cb3:c26d/64 10.240.0.13/16]","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[net] Network interface: {Index:3 MTU:1280 Name:Teredo Tunneling Pseudo-Interface HardwareAddr:00:00:00:00:00:00:00:e0 Flags:pointtopoint|multicast} with IP addresses: [fe80::ffff:ffff:fffe/64]","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[net] Network interface: {Index:9 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:08:2e:58 Flags:up|broadcast|multicast} with IP addresses: [fe80::1100:53c4:295:abee/64 172.18.80.1/20]","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[cni-net] Plugin started.","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[cni-net] Processing ADD command with args {ContainerID:42dd5865284d6507b24bcb7f93fc28217618d7d171e060471d0ddb6548871805 Netns:none IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=iis;K8S_POD_INFRA_CONTAINER_ID=42dd5865284d6507b24bcb7f93fc28217618d7d171e060471d0ddb6548871805 Path:/opt/wincni.exe/bin;c:\k\cni}.","time":"2018-07-02T13:27:26Z"}
{"level":"debug","msg":"[cni-net] Read network configuration \u0026{CniVersion:0.2.0 Name:l2bridge Type:wincni.exe Ipam:{Type: Environment:azure AddrSpace: Subnet:\u003cnone\u003e Address: QueryInterval: Routes:[{Dst:{IP:\u003cnil\u003e Mask:\u003cnil\u003e} GW:10.240.0.1}]} DNS:{Nameservers:[10.0.0.10] Domain: Search:[svc.cluster.local] Options:[]} RuntimeConfig:{PortMappings:[]} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 52 52 46 48 46 48 47 49 54 34 44 34 49 48 46 50 52 48 46 48 46 48 47 49 54 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.","time":"2018-07-02T13:27:26Z"}
panic: runtime error: index out of range

@idanshahar
Copy link

@AmitDaniel-wsc I tried to reproduce
Please try to use this template with acs-engine v0.19.0

{
	"apiVersion": "vlabs",
	"properties": {
		"orchestratorProfile": {
			"orchestratorType": "Kubernetes",
			"orchestratorRelease": "1.10",
			"kubernetesConfig": {
				"networkPlugin": "azure"
			}
		},
		"masterProfile": {
			"count": 1,
			"dnsPrefix": "idantest",
			"vmSize": "Standard_D4_v3"
		},
		"agentPoolProfiles": [{
			"name": "windowspool1",
			"count": 2,
			"vmSize": "Standard_D2_v2",
			"availabilityProfile": "VirtualMachineScaleSets",
			"storageProfile": "ManagedDisks",
			"osType": "Windows"
		}],
		"windowsProfile": {
			"adminUsername": "azureuser",
			"adminPassword": ""
		},
		"linuxProfile": {
			"adminUsername": "azureuser",
			"ssh": {
				"publicKeys": [{
					"keyData": ""
				}]
			}
		},
		"servicePrincipalProfile": {
			"clientId": "",
			"secret": ""
		}
	}
}

Could you please share your kuberenetes deployment.yaml and the Dockerfile?

@AmitDaniel-wsc
Copy link
Author

Hi @idanshahar

I must use the windows-server-core-1709 version
the image version is : "1709.0.20180524" ( I'm fine to use latest or other version ) but 1803 is not good for us.
So i'm not sure what is the different between your template and my template .

what do you mean by kubernetes deployment.yaml and Dockerfile ( Which dockerfile ? )

@idanshahar
Copy link

idanshahar commented Jul 2, 2018

@AmitDaniel-wsc
Hi, take a look here in order to change the os to 1709

The difference between my template to yours is that I removed this: "imageVersion": "1709.0.20180524"

the yaml that describes your pods... and the Dockerfile of your service

@sharmasushant
Copy link
Contributor

@jackfrancis Can you please see why cni network monitor is starting on windows VMs? It should not be started by default, and specifically not on windows agents (even if requested).

@AmitDaniel-wsc I suspect the error you are seeing is not due to Azrue CNI or wincni. Can you please share logs of Azure CNI from C:\K\azure-vnet* This will tell us the reason of failure.

@madhanrm can you please take a look if HNS is failing to attach endpoint? Also, can you confirm if 1803 have some fixes that are not there in 1709 (and can cause above error)? What logs will you need to verify that?

@AmitDaniel-wsc
Copy link
Author

AmitDaniel-wsc commented Jul 3, 2018

@sharmasushant
Please download the logs from here :
https://ufile.io/sir0p

Thanks !

@PatrickLang
Copy link
Contributor

let's use this to track network monitor on windows: #3404

@PatrickLang
Copy link
Contributor

#3405 is also reporting the json parsing error

@PatrickLang
Copy link
Contributor

@sharmasushant this will prevent the cni network monitor from scheduling on Windows #3407

@PatrickLang
Copy link
Contributor

Also, bonus points for issue 3389 being a favorite Windows service port!

@digeler
Copy link

digeler commented Jul 6, 2018

@patrick,Francis there is going to be release of cni 1.0.7 today.
Can we make sure this is pushed to acs engine also with the other issues?
Ipam duplicate ip:
Azure/azure-container-networking#188
Panic issue:

#3359

#3404

So we can go ahead with installing the latest version.
Thanks

@jackfrancis
Copy link
Member

@digeler We're testing Azure CNI v1.0.7 robustly, stand by...

@jackfrancis
Copy link
Member

Azure CNI v1.0.7 has landed in master

@AmitDaniel-wsc
Copy link
Author

AmitDaniel-wsc commented Jul 8, 2018

@jackfrancis Please open this issue again .

I created new cluster with the latest version of acs-engine i built ( I verified that i have all the master commits in this version )

and i'm getting the same error with the parsing json :

"Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "BLABLA" netowrk: netplugin failed but error parsing its diagnostic message " : unexpected end of JSON input "

Update 1 : Another error :

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "prod-wsc-cliprouploader-57878b4f9b-7986z_wsc" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
Error: failed to start container "cliprouploader": Error response from daemon: container cliprouploader encountered an error during CreateProcess: failure in a Windows system call: The RPC server is unavailable. (0x6ba) extra info: {"ApplicationName":"","CommandLine":"cmd /S /C %exename%","User":"","WorkingDirectory":"C:\\app","Environment":{"EnvName":"Prod","KUBERNETES_PORT":"tcp://10.0.0.1:443","KUBERNETES_PORT_443_TCP":"tcp://10.0.0.1:443","KUBERNETES_PORT_443_TCP_ADDR":"10.0.0.1","KUBERNETES_PORT_443_TCP_PORT":"443","KUBERNETES_PORT_443_TCP_PROTO":"tcp","KUBERNETES_SERVICE_HOST":"10.0.0.1","KUBERNETES_SERVICE_PORT":"443","KUBERNETES_SERVICE_PORT_HTTPS":"443","RoleInstanceName":"prod-wsc-cliprouploader","cliprouploaderVersion":"cliprouploader.20180625.1","exename":"CliproUploadsConsole"},"EmulateConsole":false,"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0]}
error determining status: rpc error: code = Unknown desc = Error: No such container: 4ec5a115b5b6b82f7017f6886bcc485c08d2da0678b3245513141a0161bfe874

What are we doing now ?

@sharmasushant
Copy link
Contributor

@madhanrm
@AmitDaniel-wsc can you please share logs from c:\k\azure-vnet*
Please attach log files here in github and not upload to external site.

@AmitDaniel-wsc
Copy link
Author

Sure .
Attached .

azure_vnet_logs.txt

@AmitDaniel-wsc
Copy link
Author

@sharmasushant This is all the logs combined into one file.

@sharmasushant
Copy link
Contributor

sharmasushant commented Jul 8, 2018

@AmitDaniel-wsc Are you sure that logs are from same node where you created "BLABLA"? In fact, I do not see any error in the logs that you attached. There are multiple log fiels with .0, .1, and so on. Can you pease attach all of them from the node where POD creation failed.

I also do not find the logs of second POD prod-wsc-cliprouploader-57878b4f9b-7986z_wsc that you are pasted.

@AmitDaniel-wsc
Copy link
Author

@sharmasushant I created 30 pods so probably one of them was created on this Node . Also i'm not sure that the errors comes from the Nodes or from the master .

Let me know if you want more logs from the master .

@sharmasushant
Copy link
Contributor

@AmitDaniel-wsc Not from master. To look into the issue, we will need the logs from the node where POD creation failed. I see that 38 containers successfully got allocated IPs from Azure CNI in the logs that you shared.

@AmitDaniel-wsc
Copy link
Author

@sharmasushant It failed on this node - After couple of mins that deployment finished successfully so i'm not sure what is written to the logs .
The fact is that i'm getting the same error as before .

@sharmasushant
Copy link
Contributor

@AmitDaniel-wsc Ok, can you please share complete kubelet logs showing the failure with timestamps? Its strange that I don't see any calls in Azure CNI for the containers you are reporting failures for.

@madhanrm Can you please take a look. Is it possible for k8s to not call CNI and assume failure in some code path? Or to think CNI failed even when it successfully finished ADD?

@AmitDaniel-wsc
Copy link
Author

@sharmasushant Kubelet.log is empty .
Kubelet.err.log is attached .
Let me know if you need more info .

kubelet_error_log.txt

@sharmasushant
Copy link
Contributor

sharmasushant commented Jul 8, 2018

Ok, from the look of it, it seems like some issue in kubernetes.. take POD prod-wsc-cliprouploader-57878b4f9b-fntdp as an example

At time 7:37:34, Azure CNI clearly successfully finished ADD

2018/07/08 07:37:34 [cni-net] Processing ADD command with args {ContainerID:3ce4302adeec197e23920906cea87f4625b51863e7ad6808d1dbc343b7e5612b Netns:none IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=wsc;K8S_POD_NAME=prod-wsc-cliprouploader-57878b4f9b-fntdp;
K8S_POD_INFRA_CONTAINER_ID=3ce4302adeec197e23920906cea87f4625b51863e7ad6808d1dbc343b7e5612b

2018/07/08 07:37:34 [net] Attaching ep 90e00b57-9e84-46c5-88e4-f0cc5c2678f6 to container 3ce4302adeec197e23920906cea87f4625b51863e7ad6808d1dbc343b7e5612b
2018/07/08 07:37:34 [cni-net] ADD command completed with result:Interfaces:[{Name:eth0 Mac: Sandbox:}], IP:[{Version:4 Interface: Address:{IP:10.240.0.49 Mask:fff00000} Gateway:10.240.0.1}], Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.240.0.1}], DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[] Options:[]} err:.
2018/07/08 07:37:34 [cni-net] Plugin stopped.

However, cni in github.com/containernetworking/cni/pkg/invoke/raw_exec.go thinks that ADD has failed for some reason. The error finally bubbles up to github.com/kubernetes/pkg/kubelet/dockershim/network/cni/cni.go and shows up in kubelet

We will need someone from github.com/containernetworking to help look at what went wrong below (especially given that it does not fail all the time)

I0708 07:37:35.773519 2708 fake_cpu_manager.go:40] [fake cpumanager] AddContainer (pod: prod-wsc-cliprouploader-57878b4f9b-fntdp, container: cliprouploader, container id: a04f8da8aa167716e496bd7a6dc4573dc3278bf3da10223e47f6633a694f0931)
2018/07/08 07:37:35 Going to send Telemetry report to hostnetagent http://169.254.169.254/machine/plugins?comp=netagent&type=cnireport
2018/07/08 07:37:35 "Start Flag false CniSucceeded true Name CNI Version v1.0.7 ErrorMessage vnet []
Context AzureCNI SubContext "
2018/07/08 07:37:35 OrchestratorDetails &{ kubectl command failed due to exit status 1}
2018/07/08 07:37:35 OSDetails &{windows }
2018/07/08 07:37:35 SystemDetails &{0 0 0 0 0 0 }
2018/07/08 07:37:35 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.34 00:0d:3a:36:6b:63 vEthernet (Ethernet 3) 30 0 }2018/07/08 07:37:35 BridgeDetails
2018/07/08 07:37:35 Send telemetry success 200
2018/07/08 07:37:35 SetReportState succeeded
E0708 07:37:35.849545 2708 cni.go:259] Error adding network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0708 07:37:35.849545 2708 cni_windows.go:49] error while adding to cni network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

@AmitDaniel-wsc
Copy link
Author

@sharmasushant Ok
So anything i can do ?

@sharmasushant
Copy link
Contributor

We can post an issue to github.com/containernetworking with the above kubelet logs and the POD I mentioned and ask for their help in understanding why cni.go thinks that ADD has failed.

@AmitDaniel-wsc
Copy link
Author

@jackfrancis @sharmasushant Please reopen this ticket .

@PatrickLang
Copy link
Contributor

Since we've made changes and errors are different, can we use a new issue? Is #3447 the same deployment you're using? It's better for us to track what changes were made per issue, rather than reusing old ones.

@PatrickLang
Copy link
Contributor

Azure/azure-container-networking#195 is open for this error:

 network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON

PatrickLang added a commit to PatrickLang/acs-engine that referenced this issue Aug 10, 2018
Resolves Azure#3389 / Azure#3447

Includes two important Azure-CNI changes for Windows
  Fix for unparseable error returned by CNI (Azure#195)
  Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218)
Full notes at https://github.com/Azure/azure-container-networking/releases
PatrickLang added a commit to PatrickLang/acs-engine that referenced this issue Aug 10, 2018
Resolves Azure#3389 / Azure#3447 / Azure#3153

Includes two important Azure-CNI changes for Windows
  Fix for unparseable error returned by CNI (Azure#195)
  Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218)
Full notes at https://github.com/Azure/azure-container-networking/releases
@PatrickLang PatrickLang self-assigned this Aug 10, 2018
PatrickLang added a commit to PatrickLang/acs-engine that referenced this issue Aug 10, 2018
Resolves Azure#3389 / Azure#3447 / Azure#3153

Includes two important Azure-CNI changes for Windows
  Fix for unparseable error returned by CNI (Azure#195)
  Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218)
Full notes at https://github.com/Azure/azure-container-networking/releases
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants