Kubernetes 基于 Namespace 的物理队列实现

Author: ninehills
Labels: blog
Created: 2020-04-10T04:22:00Z
Link and comments: #77

Kubernetes 基于 Namespace 的物理队列实现

作者：swulling+pub@gmail.com
摘要：Kubernetes 实现基于 Namespace 的物理队列，即Namespace下的Pod和Node的强绑定

0x00 背景

Kuberntes 目前在实际业务部署时，有两个流派：一派推崇小集群，一个或数个业务共享小集群，全公司有数百上千个小集群组成；另一派推崇大集群，每个AZ（可用区）一个或数个大集群，各个业务通过Namespace的方式进行隔离。

两者各有优劣，但是从资源利用率提升和维护成本的角度，大集群的优势更加突出。但同时大集群也带来相当多的安全、可用性、性能的挑战和维护管理成本。

本文属于Kubernetes多租户大集群实践的一部分，用来解决多租户场景下，如何实现传统的物理队列隔离。

物理队列并不是一个通用的业界名词，它来源于一种集群资源管理模型，该模型简化下如下：

逻辑队列（Logical Queue）：逻辑队列是虚拟资源分配的最小单元，将虚拟资源配额（Quota）配置在逻辑队列上（如CPU 200 标准核、内存 800GB等）
- 逻辑队列对应Kubernetes的Namespace概念。参考Resource Quotas
- 不同的逻辑队列之间可以设置Qos优先级，实现优先级调度。参考Limit Priority Class consumption by default可以限制每个Namespace下Pod的优先级选择
- 配额分两种：Requests（提供保障的资源）和Limits（资源的最大限制），其中仅Requests才能算Quota，Limits 由管理员视情况选择
物理队列（Physical Queue）：物理队列对应底层物理机资源，同一台物理机仅能从属于同一个物理队列。物理队列的资源总额就是其下物理机可提供的资源的总和。
- 物理队列当前在Kubernetes下缺乏概念映射
- 逻辑队列和物理队列是多对多绑定的关系，即同一个逻辑队列可以跨多个物理队列。
- 逻辑队列的配额总和 / 物理队列的资源总和 = 全局超售比
租户：租户可以绑定多个逻辑队列，对应关系仅影响往对应的Namespace中部署Pod的权限。

资源结构如图所示：

0x01 原理

物理队列实现：

给节点配置Label和Taint，Label用于选择，Taint用于拒绝非该物理队列的Pod部署。

和Namespace的自动绑定的原理：

配置两个Admission Controller: PodNodeSelector和PodTolerationRestriction，参考Admission Controllers
给Namespace增加默认的NodeSelector和Tolerations策略，并自动应用到该 Namespace 下的全部新增 Pod 上，从而自动将Pod绑定到物理队列上。

0x02 配置

测试集群版本：1.16.4, 1.17.0，使用的测试集群为Kind创建

1.18.0 版本的Kind集群创建有问题，后续进行测试

# this config file contains all config fields with comments
# NOTE: this is not a particularly useful config file
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
# patch the generated kubeadm config with some extra settings
kubeadmConfigPatches:
- |
  apiVersion: kubelet.config.k8s.io/v1beta1
  kind: KubeletConfiguration
  evictionHard:
    nodefs.available: "0%"
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  apiServer:
    extraArgs:
      enable-admission-plugins: PodNodeSelector,PodTolerationRestriction
# 1 control plane node and 3 workers
nodes:
# the control plane node config
- role: control-plane
# the three workers
- role: worker
- role: worker
- role: worker

集群开启Admission Controller: PodNodeSelector,PodTolerationRestriction

可以使用 kubeadmin 或 api-server启动参数：

apiServer:
  extraArgs:
    enable-admission-plugins: PodNodeSelector,PodTolerationRestriction

创建 Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: public
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: "node-restriction.kubernetes.io/physical_queue=public-phy"
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "effect": "NoSchedule", "key": "node-restriction.kubernetes.io/physical_queue", "value": "public-phy"}]'
    # scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[{"operator": "Equal", "effect": "NoSchedule", "key": "node-restriction.kubernetes.io/physical_queue", "value": "public-phy"}]'

此处要点：

文档有问题，toleration 配置是一个list，配置错误在部署时会提示解析JSON错误
tolerationsWhitelist配置后，就算配置有defaultTolerations且相同，也需要在Pod中指定对应的toleration，所以不能配置
NoSchedule 已经足够限制，无需 NoExecute，Node配置的时候同样配置，此处可根据需求进行选择。
物理队列的前缀建议为 node-restriction.kubernetes.io/physical_queue，此处是根据文档的建议，后续可以配合NodeRestriction admission plugin限制kubelet自定配置
目前Namespace尚不能绑定多个物理队列:
- NodeSelector 无法支持in语法，见 0x04
- defaultTolerations 可以配置多个 Torleration

给Node绑定物理队列

$ kubectl label node kind-worker node-restriction.kubernetes.io/physical_queue=public-phy
$ kubectl taint nodes kind-worker node-restriction.kubernetes.io/physical_queue=public-phy:NoSchedule

0x03 测试

测试的Deployment如下

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

验证提交到指定物理队列中的Pod默认增加NodeSelector和Toleration

$ kubectl apply -f nginx_deployment.yaml --namespace public
$ kubectl describe pod nginx-deployment-574b87c764-kb9k7 --namespace public
Node-Selectors:  node-restriction.kubernetes.io/physical_queue=public-phy
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 node-restriction.kubernetes.io/physical_queue=public-phy:NoSchedule

验证Pod是否都配置到一起

kubectl describe node kind-worker

验证和物理队列中指定的NodeSelector冲突的Pod无法提交

$ kubectl delete deployment nginx-deployment --namespace public
# 修改nginx_deployment.yaml ，增加spec.template.spec.nodeSelector
      nodeSelector:
        node-restriction.kubernetes.io/physical_queue: second-phy
# 验证能否部署
$ kubectl apply -f nginx_deployment.yaml --namespace public
# 查看deployments
$ kubectl describe replicaset nginx-deployment-585fcd8d7d --namespace public
  Warning  FailedCreate  49s (x15 over 2m11s)  replicaset-controller  Error creating: pods is forbidden: pod node label selector conflicts with its namespace node label selector

0x04 相关问题

Q: NodeSelector无法使用Set-based语法，导致逻辑队列（NameSpace）无法绑定多个物理队列

后续考虑使用Node Affinity配置节点亲和性。但是目前并没有现成的Adminssion Controller去给Namespace绑定默认的节点亲和性，如有需求需要自己开发。

NodeSelector 和 Toleration 的功能，可以被 Node Affinity 进行替代，且后者提供更高级的调度功能，后续尝试是否基于此进行资源调度的整体设计。

此外Node Affinity还可以实现一个逻辑队列绑定多个物理队列的情况下，配置物理队列的调度权重的功能，即优先部署到某个物理队列。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

77.md

77.md

Kubernetes 基于 Namespace 的物理队列实现

Kubernetes 基于 Namespace 的物理队列实现

0x00 背景

0x01 原理

0x02 配置

0x03 测试

0x04 相关问题

Q: NodeSelector无法使用Set-based语法，导致逻辑队列（NameSpace）无法绑定多个物理队列

Files

77.md

Latest commit

History

77.md

File metadata and controls

Kubernetes 基于 Namespace 的物理队列实现

Kubernetes 基于 Namespace 的物理队列实现

0x00 背景

0x01 原理

0x02 配置

0x03 测试

0x04 相关问题

Q: NodeSelector无法使用Set-based语法，导致逻辑队列（NameSpace）无法绑定多个物理队列