Skip to content

Scheduler 调度器深度解析

调度器职责

kube-scheduler 负责将未绑定的 Pod 分配到合适的 Node 上。它监听 API Server 中 spec.nodeName 为空的 Pod,通过两阶段算法选出最优节点。

未调度 Pod → Scheduler → 选出最优 Node → 写入 spec.nodeName → kubelet 拉起

两阶段调度算法

阶段一:过滤(Filtering)

排除不满足条件的节点,剩余节点进入候选集:

所有 Node

    ▼ 过滤插件(并行执行)
    ├── NodeResourcesFit:资源是否充足(CPU/Memory/GPU)
    ├── NodeAffinity:节点亲和性规则
    ├── PodAffinity/AntiAffinity:Pod 亲和/反亲和
    ├── TaintToleration:污点容忍
    ├── NodePorts:端口是否冲突
    ├── VolumeBinding:存储卷是否可绑定
    └── NodeUnschedulable:节点是否可调度


候选节点集合

阶段二:打分(Scoring)

对候选节点打分(0-100),选出最高分节点:

候选节点

    ▼ 打分插件(并行执行)
    ├── LeastAllocated:优先选资源使用率低的节点(分散)
    ├── MostAllocated:优先选资源使用率高的节点(集中)
    ├── NodeAffinity:节点亲和性加分
    ├── InterPodAffinity:Pod 亲和性加分
    ├── ImageLocality:本地已有镜像加分
    └── TaintToleration:污点容忍加分


最高分节点 → 绑定(Binding)

节点亲和性

yaml
spec:
  affinity:
    nodeAffinity:
      # 硬性要求(必须满足)
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values: [amd64]
          - key: node-role
            operator: In
            values: [gpu-node]
      # 软性偏好(尽量满足)
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values: [us-east-1a]

Pod 亲和与反亲和

yaml
spec:
  affinity:
    # Pod 亲和:与带 app=cache 的 Pod 调度到同一 zone
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: cache
        topologyKey: topology.kubernetes.io/zone

    # Pod 反亲和:同一 Node 上不能有相同 app 的 Pod
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: my-app
        topologyKey: kubernetes.io/hostname

污点与容忍

bash
# 给节点打污点
kubectl taint nodes node1 gpu=true:NoSchedule
kubectl taint nodes node1 maintenance=true:NoExecute

# 污点效果:
# NoSchedule:不调度新 Pod(已有 Pod 不受影响)
# PreferNoSchedule:尽量不调度
# NoExecute:驱逐已有 Pod + 不调度新 Pod
yaml
# Pod 容忍污点
spec:
  tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300  # 容忍 300s 后驱逐

拓扑分布约束

确保 Pod 均匀分布在不同区域/节点:

yaml
spec:
  topologySpreadConstraints:
  - maxSkew: 1                              # 最大偏差
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule        # 不满足时拒绝调度
    labelSelector:
      matchLabels:
        app: my-app
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway       # 不满足时尽力调度
    labelSelector:
      matchLabels:
        app: my-app

优先级与抢占

yaml
# 定义优先级类
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "高优先级业务 Pod"
preemptionPolicy: PreemptLowerPriority  # 可抢占低优先级 Pod
---
# Pod 使用优先级
spec:
  priorityClassName: high-priority

调度框架扩展点

PreEnqueue → QueueSort → PreFilter → Filter → PostFilter
    → PreScore → Score → NormalizeScore → Reserve
    → Permit → PreBind → Bind → PostBind

自定义调度器

go
// 实现自定义调度插件
type MyPlugin struct{}

func (p *MyPlugin) Name() string { return "MyPlugin" }

// 实现 Filter 接口
func (p *MyPlugin) Filter(ctx context.Context, state *framework.CycleState,
    pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    // 自定义过滤逻辑
    if nodeInfo.Node().Labels["custom-label"] != "required" {
        return framework.NewStatus(framework.Unschedulable, "节点缺少必要标签")
    }
    return nil
}

// 注册插件
func main() {
    command := app.NewSchedulerCommand(
        app.WithPlugin("MyPlugin", func(obj runtime.Object, h framework.Handle) (framework.Plugin, error) {
            return &MyPlugin{}, nil
        }),
    )
    command.Execute()
}

调度性能调优

bash
# 查看调度延迟
kubectl get events --field-selector reason=Scheduled

# 调度器关键参数
--percentageOfNodesToScore=50  # 只对 50% 节点打分(大集群优化)
--pod-max-in-unschedulable-pods-duration=5m

监控指标

scheduler_scheduling_attempt_duration_seconds  # 调度耗时
scheduler_pending_pods                         # 待调度 Pod 数
scheduler_preemption_attempts_total            # 抢占次数
scheduler_schedule_attempts_total{result}      # 调度结果(scheduled/unschedulable/error)

本站内容由 褚成志 整理编写,仅供学习参考