部署Prometheus Operator


注意:Prometheus Operator的所有配置都不能通过配置原生的prometheus配置文件来配置,需要通过操作CRD资源,比如配置一个报警需要  kubectl edit prometheusrules.monitoring.coreos.com -n monitoring 开配置,而不是配置configmap资源。

kubectl get crd
NAME                                        CREATED AT
alertmanagerconfigs.monitoring.coreos.com   2021-01-07T09:43:11Z
alertmanagers.monitoring.coreos.com         2021-01-07T09:43:11Z
podmonitors.monitoring.coreos.com           2021-01-07T09:43:12Z
probes.monitoring.coreos.com                2021-01-07T09:43:12Z
prometheuses.monitoring.coreos.com          2021-01-07T09:43:12Z
prometheusrules.monitoring.coreos.com       2021-01-07T09:43:12Z
servicemonitors.monitoring.coreos.com       2021-01-07T09:43:13Z
thanosrulers.monitoring.coreos.com          2021-01-07T09:43:13Z


servicemonitors      配置抓取metrics指标,servicemonitors->service->endpoints,类似prometheus.yml

prometheusrules    写报警规则用的,类似rule.yml

alertmanagers        控制alertmanager应用程序,可修改alertmanager的副本数

prometheuses        控制prometheus应用程序,可修改prometheus的副本数


官方文档:

https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md

资源文件:

https://github.com/cnych/kubernetes-learning

prometheus各个组件

prometheus组件:

kube-state-metrics:聚合计算组件,如:Pod个数,Pod各种状态的个数

node_exporter:采集节点数据

Alertmanager:报警组件

Prometheus-Server:从metrics、cAdivisor、exporter来采集数据

k8s-prometheus-adapter:自定义指标组件


Github地址:

https://github.com/coreos/kube-prometheus

下载代码:

git clone https://github.com/coreos/kube-prometheus.git

先部署这个目录下的资源文件:这个目录的资源文件是部署CRD资源的,需要事先部署

cd kube-prometheus/manifests/setup
kubectl apply -f .

再部署这个目录的资源:

cd kube-prometheus/manifests
kubectl apply -f .

资源列表:

]# kubectl get pods -n monitoring 
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   4          24h
alertmanager-main-1                    2/2     Running   4          24h
alertmanager-main-2                    2/2     Running   4          24h
grafana-697c9fc764-tzbvz               1/1     Running   2          24h
kube-state-metrics-5b667f584-2xqd2     1/1     Running   2          24h
node-exporter-6nb4s                    2/2     Running   4          24h
prometheus-adapter-68698bc948-szvjv    1/1     Running   2          24h
prometheus-k8s-0                       3/3     Running   1          87m
prometheus-k8s-1                       3/3     Running   1          87m
prometheus-operator-7457c79db5-lw2qb   1/1     Running   2          24h

简单方式访问服务:

        部署完成之后需要访问prometheus和grafana,简单的方法是修改prometheus-k8s和grafana两个svc改成NodePort方式。

kubectl edit svc prometheus-k8s -n monitoring
kubectl edit svc grafana -n monitoring

使用ingressController访问服务:ingress规则如下

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-k8s
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: prometheus2.chexiangsit.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port:
              number: 9090

调整副本数量:

        如果想调整Pod数量直接修改对应的资源文件是不行的,因为prometheus-operator是通过自定义资源CRD来控制对应资源的。

调整alter-manager的Pod数量:需要修改下面这个资源文件。

kube-prometheus/manifests/alertmanager-alertmanager.yaml

调整prometheus副本数:或修改其他参数。

kube-prometheus/manifests/prometheus-prometheus.yaml


创建没有监控到的资源:

        在prometheus的targets页面中可以看到kube-controller-manager和kube-schduler没有up。需要做如下修改。

http://192.168.1.71:31674/targets

创建资源文件:endpoints和对应的service资源。

        需要注意的是kube-schduler 和 kube-controller-manager 需要监听在--address=0.0.0.0 地址上。

        资源对应关系:servicemonitor -> service -> endpoints,具体配置详见 ServiceMonitor 资源,请求的方式http/https,选择的标签等信息要注意匹配。

ServiceMonitor资源:默认已存在不用创建,但需要修改。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
  name: kube-scheduler
  namespace: monitoring
spec:
  endpoints:
    interval: 30s
    port: http-metrics
    scheme: http # 请求方式,默认是https,二进制集群需要修改为http
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-scheduler # 选择service的标签,不同prometheus operator版本默认配置是不一样的

kube-controller-manager资源:

apiVersion: v1
kind: Service
metadata:
  name: kube-controller-manager
  namespace: kube-system
  labels:
    k8s-app: kube-controller-manager
    # app.kubernetes.io/name: kube-controller-manager
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10252
    protocol: TCP
    targetPort: 10252
---
apiVersion: v1
kind: Endpoints
metadata:
  name: kube-controller-manager
  namespace: kube-system
  labels:
    k8s-app: kube-controller-manager
    # app.kubernetes.io/name: kube-controller-manager
subsets:
- addresses:
  - ip: 192.168.1.70
  ports:
  - name: http-metrics
    port: 10252
    protocol: TCP

kube-scheduler的service和endpoints资源:

apiVersion: v1
kind: Service
metadata: 
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
    # app.kubernetes.io/name: kube-scheduler # 新版的标签,看资源文件 kubectl edit servicemonitor -n monitoring kube-scheduler
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10251
    targetPort: 10251
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
    # app.kubernetes.io/name: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.1.70
  ports:
  - name: http-metrics
    port: 10251
    protocol: TCP

查考地址:

https://www.servicemesher.com/blog/prometheus-operator-manual/



监控服务的基本配置


        prometheus监控服务的方式为 ServiceMonitor->Service->Endpoints,不管是k8s集群内的应用还是外部应用,都可以使用这种方式来监控。

ServiceMonitor资源:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: http-metrics
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kube-scheduler

关键词注释:

spec.endpoints.interval:30s 每隔30秒获取一次信息

spec.endpoints.port: http-metrics 对应Service的端口名称,在service资源中的spec.ports.name中定义。

spec.namespaceSelector.matchNames: 匹配某名称空间的service,如果要从所有名称空间中匹配用any: true

spec.selector.matchLabels: 匹配Service的标签,多个标签为“与”关系

spec.selector.matchExpressions:  匹配Service的标签,多个标签是“或”关系

Service资源:

apiVersion: v1
kind: Service
metadata: 
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
spec:
  selector:
    component: kube-scheduler # 二进制k8s集群这段省略,
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10251
    targetPort: 10251
    protocol: TCP

metadata.labels: 这下的标签要和ServiceMonitor的 spec.selector.matchLabels: 选择的标签一致。

spec.selector: 如果是二进制搭建的k8s集群这段省略,如果是kubeadm搭建的集群则是用来选择kube-scheduler的Pod的标签用的。

spec.clusterIP: 用None。

Endpoints资源:

apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.1.70
  ports:
  - name: http-metrics
    port: 10251
    protocol: TCP

Endpoints中metadata下面的内容要和Service中metadata下面的内容保持一致。

subsets.addresses.ip: 如果是高可用的集群这里填写多个kube-scheduler节点的ip


添加自定义标签: 在target页面中可以看到很多标签,可以通过重写标签的方式来添加新标签。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
  generation: 4
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 1.0.1
  name: node-exporter
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 15s
    port: https
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    - action: replace
      replacement: yuenan # 新标签的值
      sourceLabels:
      - __meta_kubernetes_pod_uid # 重写这个标签
      targetLabel: region # 新标签
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: node-exporter
      app.kubernetes.io/part-of: kube-prometheus

示例:从prometheus-operator监控外部节点。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: external-nodes
  name: external-nodes
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: http-metrics
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      k8s-app: external-nodes
---
apiVersion: v1
kind: Service
metadata:
  name: external-nodes
  namespace: monitoring
  labels:
    k8s-app: external-nodes
spec:
  selector:
    k8s-app: external-nodes
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 9100
    targetPort: 9100
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: external-nodes
  name: external-nodes
  namespace: monitoring
subsets:
- addresses:
  - ip: 172.31.33.194
    nodeName: vietnam-prd
  - ip: 172.31.34.184
    nodeName: vietnam-uat
  ports:
  - name: http-metrics
    port: 9100
    protocol: TCP




修改监控报警配置


删除Watchdog:打开配置后搜索到相关配置删除即可。

kubectl edit prometheusrules.monitoring.coreos.com -n monitoring prometheus-k8s-rules



监控etcd


创建Secret资源:

kubectl -n monitoring create secret generic etcd-certs --from-file=etcd_client.key --from-file=etcd_client.crt --from-file=ca.crt

编辑prometheus资源添加秘钥信息:这个prometheus资源是用crd自定义的资源

kubectl edit prometheus k8s -n monitoring

添加如下secret信息:

spec:
  secrets:
  - etcd-certs

查看是否已经挂载:

kubectl exec -it -n monitoring prometheus-k8s-0 ls /etc/prometheus/secrets/etcd-certs

创建ServiceMonitor资源:这两个资源的metadata区域的内容要一致,clusterIP为 None。

apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: port
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
subsets:
- addresses:
  - ip: 192.168.1.70
    nodeName: etc-master # 名称任意
  ports:
  - name: port
    port: 2379
    protocol: TCP

示例:具体参数看官方文档,在本页面第一行。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: 
  labels:
    k8s-etcd: etcd
  name: etcd
  namespace: kube-system
spec: 
  endpoints: 
  - interval: 30s
    port: etcd-metrics # 这个名称和下面service.spec.ports.name的名称相同
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
      certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
      keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: etcd
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
spec:
  selector:
    component: etcd # 选择Pod的标签,如果是外部应用则省略
  type: ClusterIP
  clusterIP: None
  ports:
  - name: etcd-metrics
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s # endpoints的名称要和service的名称相同。
  namespace: kube-system
  labels:
    k8s-app: etcd
subsets:
- addresses:
  - ip: 10.65.104.112 # 对应Pod的IP列表,如果是非Pod应用也是一样的
    nodeName: lf-k8s-112
  ports:
  - name: etcd-metrics # 这个名称一般也和service的ports名称一样
    port: 2379
    protocol: TCP

打开prometheus的targets页面即可看到监控etcd的信息:如果有报错查看etcd监听地址是否可以被访问到。

http://192.168.1.71:31674/targets

然后到grafana官网中下载监控etcd的模板:

        主页->grafana->Dashboards->左侧 Filter by->Data Source:Prometheus->Search within this list:etcd 然后在右侧就可以看到模板列表了,选择那个Etcd by Prometheus 点进去可以看到编号为 3070 的模板,下载json文件,导入模板:左侧“+”->import->Upload .json file。

https://grafana.com/grafana/dashboards



添加自定义告警


创建PrometheusRule资源:

        Prometheus通过配置来匹配 alertmanager的endpoints来关联alertmanager。添加自定义监控需要创建PrometheusRule资源,并且要包含prometheus=k8s和 role=alert-rules这两个标签,因为Prometheus通过这两个标签选择此资源对象。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: etcd-rules
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: etcd
    rules:
    - alert: EtcdClusterUnavailable
      annotations:
        summary: etcd cluster small
      expr: | 
        count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1 )
      for: 3m
      labels:
        severity: critical

查看PrometheusRule资源是否已经挂载在prometheus目录下:

kubectl exec -n monitoring -it prometheus-k8s-0 ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/

查看是否生效:到prometheus的web页面查看alerts页面是否多出个 EtcdClusterUnavailable 监控项。

http://192.168.1.71:31674/alerts



配置AlertManager告警


        其他资源不变还是coreos官方的那些配置,主要修改secret资源,将原有的删除,使用如下的secret资源。

邮件告警:

alertmanager-secret.yaml:配置完成且配置正确之后即可在对应的邮箱中收到报警信息。

apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    global:
      smtp_smarthost: "smtp.163.com:465"
      smtp_from: "chuxiangyi_com@163.com"
      smtp_auth_username: "chuxiangyi_com@163.com"
      smtp_auth_password: "password"
      smtp_require_tls: false
    route:
      group_by: ["alertname","cluster"] # 告警分组,标签下的值相同的情况下合并成告警组
      group_wait: 30s # 在这个时间段内收到的同组的告警将合成一个告警
      group_interval: 5m # 同组告警时间间隔
      repeat_interval: 10m # 重复报警时间间隔
      receiver: "default-receiver" # 选择下面任意一种报警方式
    receivers:
    - name: "default-receiver"
      email_configs: # 邮件报警
      - to: "myEmail@qq.com"
    - name: 'web.hook'
      webhook_configs: # webhook方式发送报警信息
      - url: 'http://127.0.0.1:5001/'
type: Opaque

查看配置是否被加载:将alertmanager的service改为NodePort方式。

http://192.168.1.71:30808/#/status

除了邮件报警还可以配置微信报警,钉钉报警等。


使用外部alertmanager:

        一般公司不止有一个环境,多套环境可共用一套alertmanager,需要配置外部alertmanager报警。

查看service资源:

$kubectl get svc -n monitoring alertmanager-main -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
  labels:
    alertmanager: main
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: v0.21.0
spec:
  clusterIP: 10.107.202.102
  clusterIPs:
  - 10.107.202.102
  ports:
  - name: web
    port: 9093
    protocol: TCP
    targetPort: web
  selector:
    alertmanager: main
    app: alertmanager
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: ClusterIP
status:
  loadBalancer: {}

编辑endpoints资源:

$kubectl edit endpoints -n monitoring alertmanager-main
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
  labels:  # 标签要和service资源的对应匹配
    alertmanager: main
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: v0.21.0
  name: alertmanager-main
  namespace: monitoring
subsets:
- addresses:
  - ip: 10.32.215.16 # 外部的报警地址
  ports:
  - name: web # 名称要和service资源对应
    port: 9093
    protocol: TCP



钉钉告警:

dingtalk的资源部署yaml文件:

https://github.com/zhuqiyang/dingtalk-yaml

prometheus operator 的告警地址配置在资源文件 alertmanager-secret.yaml 中。


告警路由:

        会优先匹配子路由,子路由匹配不到后使用默认告警方式。通过告警中定义的标签来匹配告警。

apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    global:
      smtp_smarthost: "smtp.163.com:465"
      smtp_from: "chuxiangyi_com@163.com"
      smtp_auth_username: "chuxiangyi_com@163.com"
      smtp_auth_password: "123456"
      smtp_require_tls: false
    route:
      group_by: ["alertname"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: "web.hook"
      routes:
      - match_re: # 正则匹配,alertname标签的值被正则匹配到使用web.hook的报警方式
          alertname: "^(KubeCPUOvercommit|KubeMemOvercommit)$"
        receiver: "web.hook"
      - match: # 当alertname的值为定义的值时使用ops-email的告警方式
          alertname: "kube-scheduler-Unavailable"
        receiver: "ops-email"
    receivers:
    - name: "ops-email"
      email_configs:
      - to: "250994559@qq.com" # 多个地址用逗号隔开
    - name: "web.hook"
      webhook_configs:
      - url: "http://dingtalk.default.svc.cluster.local:5001"
type: Opaque


配置告警规则:

        具体的告警规则需要在Prometheus的配置中定义,使用 Prometheus Operator部署的监控可以在 prometheus-rules.yaml 资源文件中修改。



资源概览模板


k8s各种资源概览的Grafana模板:

https://grafana.com/dashboards/6417 
https://grafana.com/grafana/dashboards/13105



自动发现


        如果逐个定义ServiceMonitor资源来监控k8s中的资源是非常麻烦的,但是可以通过自动发现机制来进行监控。

在需要监控的资源的annotation中添加如下定义:

annotation:
  prometheus.io/scrape=true

下面演示如何配置自动发现:

创建prometheus-additional.yaml文件:

- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

用上面的配置来创建secret文件:

kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

将 kube-prometheus/manifests/prometheus-prometheus.yaml 文件修改为如下:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.15.2
  nodeSelector:
    kubernetes.io/os: linux
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.15.2

其实就是添加了如下配置:

  additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml

应用此配置:

kubectl apply -f prometheus-prometheus.yaml

然后到prometheus->status->configuration中查看是否已经拥有上面的配置。

此时虽然已经有配置了但是还缺少相应的权限:

kubectl logs -f prometheus-k8s-0 prometheus -n monitoring

需要添加权限:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get

过一会然后到prometheus->status->Target中查看:看是否有如下监控项了。

kubernetes-service-endpoints (1/1 up)

如果带有端口的资源如redis则需要添加端口的信息:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9121"

其他资源如Pod、Ingress等都可以用这样的方式。

其他资料:

https://github.com/YaoZengzeng/KubernetesResearch/blob/master/Prometheus%E5%91%8A%E8%AD%A6%E6%A8%A1%E5%9E%8B%E5%88%86%E6%9E%90.md



Prometheus Operator 数据持久化



NFS方式


使用NFS作为持久化存储:

        先在任意主机上创建好nfs服务,然后在k8s中部署如下几个资源,这里用到的是nfs-client-provisioner,官方文档如下:

https://github.com/kubernetes-retired/external-storage/tree/master/nfs-client

新地址:

https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

需要部署的几个文件:

nfs-storageclass.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-nfs-storage
provisioner: fuseim.pri/ifs # or choose another name, must match deployment's env PROVISIONER_NAME'
parameters:
  archiveOnDelete: "false"

rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfs-client-provisioner
  # replace with namespace where provisioner is deployed
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: nfs-client-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: run-nfs-client-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: ClusterRole
  name: nfs-client-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
  # replace with namespace where provisioner is deployed
  namespace: default
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
  # replace with namespace where provisioner is deployed
  namespace: default
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: Role
  name: leader-locking-nfs-client-provisioner
  apiGroup: rbac.authorization.k8s.io

deployment.yaml:注意修改nfs服务地址和目录

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-client-provisioner
  labels:
    app: nfs-client-provisioner
  # replace with namespace where provisioner is deployed
  namespace: default
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: nfs-client-provisioner
  template:
    metadata:
      labels:
        app: nfs-client-provisioner
    spec:
      serviceAccountName: nfs-client-provisioner
      containers:
        - name: nfs-client-provisioner
          image: quay.io/external_storage/nfs-client-provisioner:latest
          volumeMounts:
            - name: nfs-client-root
              mountPath: /persistentvolumes
          env:
            - name: PROVISIONER_NAME
              value: fuseim.pri/ifs # 如果有多个storageclass和多个nfs这里的值要不一样
            - name: NFS_SERVER
              value: 192.168.0.71
            - name: NFS_PATH
              value: /opt/nfs
      volumes:
        - name: nfs-client-root
          nfs:
            server: 192.168.0.71
            path: /opt/nfs

prometheus-prometheus.yaml

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  retention: "30d" # 数据保存时间
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  storage: # 添加storage信息
    volumeClaimTemplate:
      spec:
        storageClassName: managed-nfs-storage # 这个名称要和创建的storageclass名称一致
        resources:
          requests:
            storage: 20Gi
  image: quay.io/prometheus/prometheus:v2.15.2
  nodeSelector:
    kubernetes.io/os: linux
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.15.2


        如果手动创建Pod测试需要手动创建pvc资源文件如下,如果想要删除pv则先要让应用停止使用存储,再删除pvc,删除pvc之后pv会自动删除。

nfs-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: prometheus-k8s-db-prometheus-k8s-0
  namespace: monitoring
  annotations:
    volume.beta.kubernetes.io/storage-class: "managed-nfs-storage"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: prometheus-k8s-db-prometheus-k8s-1
  namespace: monitoring
  annotations:
    volume.beta.kubernetes.io/storage-class: "managed-nfs-storage"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi



Ceph RBD方式


使用Ceph rbd作为持久化存储:

        创建完StorageClass之后填写名称到prometheus-prometheus.yaml资源文件中即可。

apiVersion: v1
kind: Secret
metadata:
  name: ceph-admin-secret
  namespace: monitoring
data:
  key: QVFBbjVuVmVSZDJrS3hBQUlRZE9xcDkrSlQrVStzQUhIbVMzWGc9PQ==
type: kubernetes.io/rbd
---
apiVersion: v1
kind: Secret
metadata:
  name: ceph-k8s-secret
  namespace: monitoring
data:
  key: QVFCWEhuWmVSazlkSnhBQVJoenZEeUpnR1hFVDY4dzc0WW9KVmc9PQ==
type: kubernetes.io/rbd
---
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: rbd-dynamic
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
reclaimPolicy: Retain
parameters:
  monitors: 192.168.0.34:6789
  adminId: admin
  adminSecretName: ceph-admin-secret
  adminSecretNamespace: monitoring
  pool: kube
  userId: k8s
  userSecretName: ceph-k8s-secret

配置存储时限:监控数据要保存多少天,修改资源文件 prometheus-prometheus.yaml

spec:
  retention: "30d" # [0-9]+(ms|s|m|h|d|w|y) (milliseconds seconds minutes hours days weeks years)



CephFS方式


使用Cephfs作为持久化存储:

示例地址:

https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/cephfs/deploy

storageclass.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: cephfs
provisioner: ceph.com/cephfs
parameters:
  monitors: 192.168.0.34:6789
  adminId: admin
  adminSecretName: ceph-admin-secret
  adminSecretNamespace: cephfs
  claimRoot: /pvc-volumes
---
apiVersion: v1
kind: Secret
metadata:
  name: ceph-admin-secret
  namespace: cephfs
data:
  key: QVFBbjVuVmVSZDJrS3hBQUlRZE9xcDkrSlQrVStzQUhIbVMzWGc9PQ==
type: kubernetes.io/rbd

deployment.yaml

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cephfs-provisioner
  namespace: cephfs
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services"]
    resourceNames: ["kube-dns","coredns"]
    verbs: ["list", "get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cephfs-provisioner
subjects:
  - kind: ServiceAccount
    name: cephfs-provisioner
    namespace: cephfs
roleRef:
  kind: ClusterRole
  name: cephfs-provisioner
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cephfs-provisioner
  namespace: cephfs
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "get", "delete"]
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cephfs-provisioner
  namespace: cephfs
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cephfs-provisioner
subjects:
- kind: ServiceAccount
  name: cephfs-provisioner
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cephfs-provisioner
  namespace: cephfs
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cephfs-provisioner
  namespace: cephfs
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: cephfs-provisioner
    spec:
      containers:
      - name: cephfs-provisioner
        image: "quay.io/external_storage/cephfs-provisioner:latest"
        env:
        - name: PROVISIONER_NAME
          value: ceph.com/cephfs
        - name: PROVISIONER_SECRET_NAMESPACE
          value: cephfs
        command:
        - "/usr/local/bin/cephfs-provisioner"
        args:
        - "-id=cephfs-provisioner-1"
      serviceAccount: cephfs-provisioner

cephfs-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: prometheus-k8s-db-prometheus-k8s-0
  namespace: monitoring
spec:
  storageClassName: cephfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: prometheus-k8s-db-prometheus-k8s-1
  namespace: monitoring
spec:
  storageClassName: cephfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

注意:在使用Cephfs作为存储的时候如果prometheus-k8s-x出现 CrashLoopBackOff 的话可能是权限问题导致prometheus无权写入文件系统,只要修改 prometheus-prometheus.yaml文件中的runAsUser等选项即可。

  securityContext:
    fsGroup: 0
    runAsNonRoot: false
    runAsUser: 0

storage写法示例:

  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: cephfs
        accessModes: [ "ReadWriteMany" ]
        resources:
          requests:
            storage: 2Gi



prometheus Operator 监控外部应用 mysql


mysqld_exporter官方文档:

https://github.com/prometheus/mysqld_exporter

下载exporter:

wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz

设置exporter的密码:

mysql -uroot -p
CREATE USER 'exporter'@'localhost' IDENTIFIED BY '123456';
# 赋予查看主从运行情况查看线程,及所有数据库的权限。
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';

在mysql节点安装exporter:

tar -xf mysqld_exporter-0.12.1.linux-amd64.tar.gz -C /usr/local/
ln -sv mysqld_exporter-0.12.1.linux-amd64/ mysqld_exporter

创建配置文件:

vim .my.cnf
[client]
user=exporter
password=123456

其他的一些选项:

常用参数:
# 选择采集innodb
--collect.info_schema.innodb_cmp
# innodb存储引擎状态
--collect.engine_innodb_status
# 指定配置文件
--config.my-cnf=".my.cnf

试运行:

./mysqld_exporter --config.my-cnf=.my.cnf

创建systemd文件:

[Unit]
Description=https://prometheus.io
After=network.target
After=mysqld.service

[Service]
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动服务:

systemctl start mysql_exporter.service
systemctl status mysql_exporter.service

查看metrics资源:

http://10.65.104.112:31880/targets

在k8s中:

写资源文件:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: mysql
  name: mysql
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: http-metrics
  jobLabel: mysql
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: mysql
---
apiVersion: v1
kind: Service
metadata:
  name: mysql
  namespace: kube-system
  labels:
    k8s-app: mysql
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 9104
    targetPort: 9104
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: mysql
  name: mysql
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.3.149.85
  ports:
  - name: http-metrics
    port: 9104
    protocol: TCP

应用资源到k8s集群:

kubectl apply -f mysql-export.yaml

在prometheus中查看如下页面:可以看到mysql已近在资源列表中了,看不到需要等一会。

http://10.65.104.112:31880/targets

在grafana中添加如下模板:添加后即可看到mysql overview 的dashboard页面。

https://grafana.com/grafana/dashboards/7362


如果prometheus server是独立的服务并没有运行在k8s内,则在配置文件中添加如下内容即可:

scrape_configs:
  # 添加作业并命名
  - job_name: 'mysql'
    # 静态添加node
    static_configs:
    # 指定监控端
    - targets: ['10.3.149.85:9104']