k8s进阶:Scheduler和部署策略

调度

调度流程

ETCD存储节点配置信息,ApiServer类似集群大脑.首先调度器会维护一个优先级队列,根据优先级将需要调度的pods存在队列中,informer当有新pod需要调度时,负责通知ApiServer.
ApiServer会将ETCD中的节点配置信息下发给Scheduler,Scheduler会维护一个Cache存储节点信息,以免每次调度都要请求.
Scheduler根据优先级队列以及内部策略(预选/优选)为pod选择出具体节点,形成pod与node的绑定关系(bonding)
- 预选策略: 选出哪些node不能跑这个pod (根据node selector / tag / label …)
- 优选策略: 在可选的node当中算出各自的分值,分值最高的就是最合适该pod运行的节点
Apiserver根据绑定关系通知具体node上的kubelet,实施pod的生存周期

Affinity

affinity:
  #节点亲和性
  nodeAffinity:
    #require:强制要求,必须match
    requireDuringSchedulingIgnoredDuringExecution:
      #节点选择策略,定义多个nodeSelectorTerms之间是或关系
      nodeSelectorTerms:
      #多个matchExpressions之间是且关系
      - matchExpressions:
        - key: beta.kubernetes.io/arch
          operator: In
          values:
          - amd64
    #prefer:优选要求,非必须
    preferredDuringSchedulingIgnoreDuringExecution:
    #权重
    - weight: 1
      preference:
        matchExpressions:
        - key: disktype
          operator: NotIn
          values:
          - ssd
  #pod亲和性
  podAffinity:
    requireDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - web-demo
      #限制范围,kubernetes.io/hostname是k8s集群每个节点默认有的一个label,所以这里意思就是集群的所有节点都在调度范围内;所以这里的意思是这个pod要求与拥有label: app=web-demo 的pod运行在同一个节点上
      topologyKey: kubernetes.io/hostname
    preferredDuringSchedulingIgnoreDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web-demo-node
        topologyKey: kubernetes.io/hostname
  #加上Anti表示取反
  podAntiAffinity:
    requireDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        #比如这里配置这个pod自己的label,且这个pod跑两个副本时,则表示这两个pod不跑在同一节点上
        #如果这里配置别的pod的label,则这两个pod不跑在同一节点山
        - key: app
          operator: In
          values:
          - web-demo
      topologyKey: kubernetes.io/hostname

Taint

1 2	`kubectl taint nodes <node-hostname> gpu=true:NoSchedule #taint:#具体操作`

具体操作:

NoSchedule: 强制不调度
PerferNoSchedule: 不调度(非强制)
NoExcuete: 不仅不能调度,如果pod已经运行在该节点,也会把它kill掉(最严格)

污点容忍

spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    #与打污点时配置的操作要一致
    effect: "NoSchedule"

deploy-policy

Recreate

重新创建,先停掉旧的再启动新的,服务有间断

1
2
3

spec:
  strategy:
    type: Recreate

Rolling-update

默认25%,逐个更新,有些用户访问到新服务,有些用户访问到旧服务,服务无间断

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      #可以超出总pod数的百分比,比如我总共有4个实例,每次只能启动25%个新版本的pod,即每次新启动一个
      maxSurge: 25%
      #最大不可用:比如我总共有4个实例,则可以有1个实例不可用
      maxUnavailable: 25%

#暂停rolling-update
kubectl rollout pause deploy <deployment-name> -n <namespace>
#继续
kubectl rollout resume deploy <deployment-name> -n <namespace>
#回滚
kubectl rollout undo deploy <deployment-name> -n <namespace>