最近更新时间:2024-03-22 14:35:02
在用户将自建集群和容器实例结合使用,利用云上资源来应对业务高峰的场景下,KCI调度插件支持在自建资源的分配率达到一个可配置的阈值后,就将新创建的Pod优先调度到容器实例上。
金山云提供了配合virtual-kubelet使用的scheduler-extender,支持如下调度策略:
当用户自建集群资源分配率未达到指定阈值时,优先使用自建资源创建pod
当自建集群资源分配超过阈值时,优先调度到virtual-kubelet节点上(即使用云上资源创建pod),并且可为不同vk节点设置权重
目前提供了2个指标,“CPU分配率”和“内存分配率”。只要有一个指标超过阈值,即优先调度到vk上
计算资源分配率时,只统计role为node的节点(不统计master节点),为保证调度插件功能正常,使用前需要确认集群node节点上,已经打上label:kubernetes.io/role=node
使用前,需要移除virtual-kubelet上的污点或为pod添加污点容忍,使得vk可以参与k8s的正常调度
yaml详情如下:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: extender-conf
namespace: kube-system
data:
cpulimit: "0.5" # 所有pod的request资源/所有node的allocatable资源的值,如果不关心它则设置为-1
memlimit: "0.7" # 如果不关心这个指标,设置为-1
weight: '[ # weight是可选配置,配置不同vk节点的调度权重
{
"nodeName": "rbkci-virtual-kubelet", # vk节点的名字
"weight": 5
},
{
"nodeName": "virtual-kubelet-cn-beijing-i",
"weight": 1
}
]'
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: scheduler-extender
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: scheduler-extender-admin
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
namespace: kube-system
name: scheduler-extender
---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-scheduler-policy
namespace: kube-system
data:
policy.cfg : |
{
"kind" : "Policy",
"apiVersion" : "v1",
"extenders" : [{
"urlPrefix": "http://kci-scheduler-extender.kube-system/scheduler",
"filterVerb": "predicates/always_true",
"prioritizeVerb": "priorities/group_score",
"preemptVerb": "preemption",
"bindVerb": "",
"weight": 1,
"enableHttps": false,
"nodeCacheCapable": false
}]
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kci-scheduler-extender
namespace: kube-system
labels:
app: kci-scheduler-extender
spec:
replicas: 1
selector:
matchLabels:
app: kci-scheduler-extender
template:
metadata:
labels:
app: kci-scheduler-extender
spec:
serviceAccountName: scheduler-extender
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: NotIn
values:
- virtual-kubelet
containers:
- name: kci-scheduler-extender
image: hub.kce.ksyun.com/ksyun/ksc-scheduler-extender:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
labels:
app: kci-scheduler-extender
name: kci-scheduler-extender
namespace: kube-system
spec:
ports:
- name: server-port
port: 80
protocol: TCP
targetPort: 80
selector:
app: kci-scheduler-extender
type: ClusterIP
检查kube-scheduler是否配置了configmap的访问权限
# kubectl get clusterrole system:kube-scheduler -o yaml
...
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
如果没有上面的配置,需要执行kubectl edit clusterrole system:kube-scheduler,把上面几行添加进去
修改kube-scheduler的配置,具体改动如下:
command中增加 --policy-configmap=custom-scheduler-policy
dnsPolicy设置为:ClusterFirstWithHostNet
注意:
kube-scheduler一般以static pod的方式部署,其配置文件通常位于
/etc/kubernetes/manifests
目录下。如您想了解更多关于static pod的信息,请查看Kubernetes官方文档。kube-scheduler一般部署在多个节点上,每个副本都要修改。
重启kube-scheduler生效
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- /usr/local/bin/kube-scheduler
- --logtostderr=true
- --v=0
- --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig
- --leader-elect=true
- --leader-elect-lease-duration=60s
- --leader-elect-renew-deadline=30s
- --leader-elect-retry-period=10s
- --kube-api-qps=100
- --policy-configmap=custom-scheduler-policy # 新增--policy-configmap配置
image: hub.kce.ksyun.com/ksyun/kube-scheduler:v1.17.6-mp
imagePullPolicy: Always
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10251
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /etc/kubernetes
name: k8s
readOnly: true
- mountPath: /etc/localtime
name: time-zone
readOnly: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet # dnsPolicy设置为ClusterFirstWithHostNet
tolerations:
- operator: Exists
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- hostPath:
path: /etc/kubernetes
name: k8s
- hostPath:
path: /etc/localtime
name: time-zone
1.查看extender-conf的ConfigMap文件,了解当前的调度策略
kubectl -n kube-system describe cm extender-conf
预期输出:
Name: extender-conf
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
cpulimit:
----
0.5
memlimit:
----
0.7
weight:
----
[ { "nodeName": "rbkci-virtual-kubelet", "weight": 3 }, { "nodeName": "cn-zhangjiakou.vnd-8vb0w6ot6evayqdha0a3", "weight": 1 } ]
Events: <none>
当前调度策略表示:当自建集群的cpu分配率超过50%,或者内存分配率超过70%时,新创建的Pod将会调度到virtual-kubelet节点上,且此时每创建4个pod,将有3个调度到权重为3的vk上,有1个调度到权重为1的vk上。
2.查看调度策略效果
(1)执行以下命令,查看集群节点
kubectl get node -o wide
预期输出:
NAME STATUS ROLES AGE VERSION
10.0.0.179 Ready node 46d v1.21.3
10.0.0.214 Ready master 46d v1.21.3
10.0.0.8 Ready master 46d v1.21.3
10.0.0.83 Ready master 46d v1.21.3
10.0.0.96 Ready node 46d v1.21.3
cn-zhangjiakou.vnd-8vbahalwna5205drcwvr Ready agent 135m v1.21.3
rbkci-virtual-kubelet Ready agent 34d v1.19.3-vk-v1.1.0
从以上预期输出可以看出,当前集群有10.0.0.179、10.0.0.96这两个Worker节点和rbkci-virtual-kubelet、cn-zhangjiakou.vnd-8vbahalwna5205drcwvr这两个virtual-kubelet节点。
(2)部署测试Deployment,其yaml如下:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 100m
memory: 100Mi
tolerations: # 如未移除virtual-kubelet上的污点,请增加此污点容忍使vk可以参与调度
- key: "rbkci-virtual-kubelet.io/provider"
operator: "Equal"
value: "kingsoftcloud"
effect: "NoSchedule"
(3)执行以下命令,查看Pod分布节点
kubectl get pod -o wide
预期输出:
NAME READY STATUS RESTARTS AGE IP NODE
nginx-68c8867f7b-rj9dg 1/1 Running 0 4s 10.244.1.74 10.0.0.179
从以上预期输出可以看出,由于cpu或内存分配率没有到达阈值,新创建的Pod调度到了自建集群的Worker节点上
(4)执行以下命令,将Deployment副本数扩容为7
kubectl scale deploy nginx --replicas=7
再次查看Pod分布节点
kubectl get pod -o wide
预期输出:
NAME READY STATUS RESTARTS AGE IP NODE
nginx-68c8867f7b-94r4v 1/1 Running 0 3m19s 10.0.0.219 rbkci-virtual-kubelet
nginx-68c8867f7b-9wskx 1/1 Running 0 3m6s 10.0.0.219 cn-zhangjiakou.vnd-8vbahalwna5205drcwvr
nginx-68c8867f7b-pl4xl 1/1 Running 0 3m13s 10.0.0.219 rbkci-virtual-kubelet
nginx-68c8867f7b-pmxjf 1/1 Running 0 3m45s 10.0.0.219 rbkci-virtual-kubelet
nginx-68c8867f7b-rj9dg 1/1 Running 0 3m52s 10.0.0.219 10.0.0.179
nginx-68c8867f7b-sf7xk 1/1 Running 0 5m9s 10.0.0.219 10.0.0.179
nginx-68c8867f7b-twjl2 1/1 Running 0 4m9s 10.0.0.219 10.0.0.179
从以上预期输出可以看出,前3个Pod由于cpu或内存分配率没有到达阈值,调度到了自建集群的Worker节点上。通过查看节点的资源分配率,可以看到从第4个Pod起集群中所有pod的cpu request资源/所有node的allocatable资源的值将大于0.5,因此后4个Pod调度到了vk节点上,且根据权重设置,有3个调度到权重为3的vk上,有1个调度到权重为1的vk上。
纯净模式