Scheduler 调度器深度解析
调度器职责
kube-scheduler 负责将未绑定的 Pod 分配到合适的 Node 上。它监听 API Server 中 spec.nodeName 为空的 Pod,通过两阶段算法选出最优节点。
未调度 Pod → Scheduler → 选出最优 Node → 写入 spec.nodeName → kubelet 拉起两阶段调度算法
阶段一:过滤(Filtering)
排除不满足条件的节点,剩余节点进入候选集:
所有 Node
│
▼ 过滤插件(并行执行)
├── NodeResourcesFit:资源是否充足(CPU/Memory/GPU)
├── NodeAffinity:节点亲和性规则
├── PodAffinity/AntiAffinity:Pod 亲和/反亲和
├── TaintToleration:污点容忍
├── NodePorts:端口是否冲突
├── VolumeBinding:存储卷是否可绑定
└── NodeUnschedulable:节点是否可调度
│
▼
候选节点集合阶段二:打分(Scoring)
对候选节点打分(0-100),选出最高分节点:
候选节点
│
▼ 打分插件(并行执行)
├── LeastAllocated:优先选资源使用率低的节点(分散)
├── MostAllocated:优先选资源使用率高的节点(集中)
├── NodeAffinity:节点亲和性加分
├── InterPodAffinity:Pod 亲和性加分
├── ImageLocality:本地已有镜像加分
└── TaintToleration:污点容忍加分
│
▼
最高分节点 → 绑定(Binding)节点亲和性
yaml
spec:
affinity:
nodeAffinity:
# 硬性要求(必须满足)
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values: [amd64]
- key: node-role
operator: In
values: [gpu-node]
# 软性偏好(尽量满足)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: zone
operator: In
values: [us-east-1a]Pod 亲和与反亲和
yaml
spec:
affinity:
# Pod 亲和:与带 app=cache 的 Pod 调度到同一 zone
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cache
topologyKey: topology.kubernetes.io/zone
# Pod 反亲和:同一 Node 上不能有相同 app 的 Pod
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname污点与容忍
bash
# 给节点打污点
kubectl taint nodes node1 gpu=true:NoSchedule
kubectl taint nodes node1 maintenance=true:NoExecute
# 污点效果:
# NoSchedule:不调度新 Pod(已有 Pod 不受影响)
# PreferNoSchedule:尽量不调度
# NoExecute:驱逐已有 Pod + 不调度新 Podyaml
# Pod 容忍污点
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300 # 容忍 300s 后驱逐拓扑分布约束
确保 Pod 均匀分布在不同区域/节点:
yaml
spec:
topologySpreadConstraints:
- maxSkew: 1 # 最大偏差
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # 不满足时拒绝调度
labelSelector:
matchLabels:
app: my-app
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # 不满足时尽力调度
labelSelector:
matchLabels:
app: my-app优先级与抢占
yaml
# 定义优先级类
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "高优先级业务 Pod"
preemptionPolicy: PreemptLowerPriority # 可抢占低优先级 Pod
---
# Pod 使用优先级
spec:
priorityClassName: high-priority调度框架扩展点
PreEnqueue → QueueSort → PreFilter → Filter → PostFilter
→ PreScore → Score → NormalizeScore → Reserve
→ Permit → PreBind → Bind → PostBind自定义调度器
go
// 实现自定义调度插件
type MyPlugin struct{}
func (p *MyPlugin) Name() string { return "MyPlugin" }
// 实现 Filter 接口
func (p *MyPlugin) Filter(ctx context.Context, state *framework.CycleState,
pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// 自定义过滤逻辑
if nodeInfo.Node().Labels["custom-label"] != "required" {
return framework.NewStatus(framework.Unschedulable, "节点缺少必要标签")
}
return nil
}
// 注册插件
func main() {
command := app.NewSchedulerCommand(
app.WithPlugin("MyPlugin", func(obj runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &MyPlugin{}, nil
}),
)
command.Execute()
}调度性能调优
bash
# 查看调度延迟
kubectl get events --field-selector reason=Scheduled
# 调度器关键参数
--percentageOfNodesToScore=50 # 只对 50% 节点打分(大集群优化)
--pod-max-in-unschedulable-pods-duration=5m监控指标
scheduler_scheduling_attempt_duration_seconds # 调度耗时
scheduler_pending_pods # 待调度 Pod 数
scheduler_preemption_attempts_total # 抢占次数
scheduler_schedule_attempts_total{result} # 调度结果(scheduled/unschedulable/error)