|
目标
上一篇文章
IoT 边缘集群基于 Kubernetes Events 的告警通知实现
告警恢复通知 - 经过评估无法实现
原因: 告警和恢复是单独完全不相关的事件, 告警是级别, 恢复是级别, 要开启恢复, 就会导致所有Events 都会被发送, 这个数量是很恐怖的; 而且, 除非特别有经验和耐心, 否则无法看出哪条对应的是 告警的恢复.
- 未恢复进行持续告警 - 默认就带的能力, 无需额外配置.
- 告警内容显示资源名称,比如节点和pod名称
可以设置屏蔽特定的节点和工作负载并可以动态调整
比如,集群中的节点做计划性维护,期间停止监控,维护完成后重新开始监控。
配置
告警内容显示资源名称
典型的几类 events:- apiVersion: v1
- count: 101557
- eventTime: null
- firstTimestamp: "2022-04-08T03:50:47Z"
- involvedObject:
- apiVersion: v1
- fieldPath: spec.containers{prometheus}
- kind: Pod
- name: prometheus-rancher-monitoring-prometheus-0
- namespace: cattle-monitoring-system
- kind: Event
- lastTimestamp: "2022-04-14T11:39:19Z"
- message: 'Readiness probe failed: Get "http://10.42.0.87:9090/-/ready": context deadline
- exceeded (Client.Timeout exceeded while awaiting headers)'
- metadata:
- creationTimestamp: "2022-04-08T03:51:17Z"
- name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344
- namespace: cattle-monitoring-system
- reason: Unhealthy
- reportingComponent: ""
- reportingInstance: ""
- source:
- component: kubelet
- host: master-1
- type: Warning
复制代码- apiVersion: v1
- count: 116
- eventTime: null
- firstTimestamp: "2022-04-13T02:43:26Z"
- involvedObject:
- apiVersion: v1
- fieldPath: spec.containers{grafana}
- kind: Pod
- name: rancher-monitoring-grafana-57777cc795-2b2x5
- namespace: cattle-monitoring-system
- kind: Event
- lastTimestamp: "2022-04-14T11:18:56Z"
- message: 'Readiness probe failed: Get "http://10.42.0.90:3000/api/health": context
- deadline exceeded (Client.Timeout exceeded while awaiting headers)'
- metadata:
- creationTimestamp: "2022-04-14T11:18:57Z"
- name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13
- namespace: cattle-monitoring-system
- reason: Unhealthy
- reportingComponent: ""
- reportingInstance: ""
- source:
- component: kubelet
- host: master-1
- type: Warning
复制代码- apiVersion: v1
- count: 20958
- eventTime: null
- firstTimestamp: "2022-04-11T10:34:51Z"
- involvedObject:
- apiVersion: v1
- fieldPath: spec.containers{lb-port-1883}
- kind: Pod
- name: svclb-emqx-dt22t
- namespace: emqx
- kind: Event
- lastTimestamp: "2022-04-14T11:39:48Z"
- message: Back-off restarting failed container
- metadata:
- creationTimestamp: "2022-04-11T10:34:51Z"
- name: svclb-emqx-dt22t.16e4d11e2b9efd27
- namespace: emqx
- reason: BackOff
- reportingComponent: ""
- reportingInstance: ""
- source:
- component: kubelet
- host: worker-1
- type: Warning
复制代码- apiVersion: v1
- count: 21069
- eventTime: null
- firstTimestamp: "2022-04-11T10:34:48Z"
- involvedObject:
- apiVersion: v1
- fieldPath: spec.containers{lb-port-80}
- kind: Pod
- name: svclb-traefik-r5p8t
- namespace: kube-system
- kind: Event
- lastTimestamp: "2022-04-14T11:44:59Z"
- message: Back-off restarting failed container
- metadata:
- creationTimestamp: "2022-04-11T10:34:48Z"
- name: svclb-traefik-r5p8t.16e4d11daf0b79ce
- namespace: kube-system
- reason: BackOff
- reportingComponent: ""
- reportingInstance: ""
- source:
- component: kubelet
- host: worker-1
- type: Warning
复制代码- {
- "metadata": {
- "name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f",
- "namespace": "monitoring",
- "uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e",
- "resourceVersion": "14043444",
- "creationTimestamp": "2022-04-14T13:08:40Z"
- },
- "reason": "Pulled",
- "message": "Container image "ghcr.io/opsgenie/kubernetes-event-exporter:v0.11" already present on machine",
- "source": {
- "component": "kubelet",
- "host": "worker-2"
- },
- "firstTimestamp": "2022-04-14T13:08:40Z",
- "lastTimestamp": "2022-04-14T13:08:40Z",
- "count": 1,
- "type": "Normal",
- "eventTime": null,
- "reportingComponent": "",
- "reportingInstance": "",
- "involvedObject": {
- "kind": "Pod",
- "namespace": "monitoring",
- "name": "event-exporter-79544df9f7-xj4t5",
- "uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75",
- "apiVersion": "v1",
- "resourceVersion": "14043435",
- "fieldPath": "spec.containers{event-exporter}",
- "labels": {
- "app": "event-exporter",
- "pod-template-hash": "79544df9f7",
- "version": "v1"
- }
- }
- }
复制代码 我们可以把更多的字段加入到告警信息中, 其中就包括:
- 节点:
- Pod:
- {{ .InvolvedObject.Name }}
复制代码 综上, 修改后的yaml 如下:- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: event-exporter-cfg
- namespace: monitoring
- resourceVersion: '5779968'
- data:
- config.yaml: |
- logLevel: error
- logFormat: json
- route:
- routes:
- - match:
- - receiver: "dump"
- - drop:
- - type: "Normal"
- match:
- - receiver: "feishu"
- receivers:
- - name: "dump"
- stdout: {}
- - name: "feishu"
- webhook:
- endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
- headers:
- Content-Type: application/json
- layout:
- msg_type: interactive
- card:
- config:
- wide_screen_mode: true
- enable_forward: true
- header:
- title:
- tag: plain_text
- content: xxx测试K3S集群告警
- template: red
- elements:
- - tag: div
- text:
- tag: lark_md
- content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"
复制代码 屏蔽特定的节点和工作负载
比如,集群中的节点做计划性维护,期间停止监控,维护完成后重新开始监控。
继续修改yaml 如下:- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: event-exporter-cfg
- namespace: monitoring
- data:
- config.yaml: |
- logLevel: error
- logFormat: json
- route:
- routes:
- - match:
- - receiver: "dump"
- - drop:
- - type: "Normal"
- - source:
- host: "worker-1"
- - namespace: "cattle-monitoring-system"
- - name: "*emqx*"
- - kind: "Pod|Deployment|ReplicaSet"
- - labels:
- version: "dev"
- match:
- - receiver: "feishu"
- receivers:
- - name: "dump"
- stdout: {}
- - name: "feishu"
- webhook:
- endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
- headers:
- Content-Type: application/json
- layout:
- msg_type: interactive
- card:
- config:
- wide_screen_mode: true
- enable_forward: true
- header:
- title:
- tag: plain_text
- content: xxx测试K3S集群告警
- template: red
- elements:
- - tag: div
- text:
- tag: lark_md
- content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"
复制代码 默认的 drop 规则为:, 即不对级别进行告警;
现在加入以下规则:- - source:
- host: "worker-1"
- - namespace: "cattle-monitoring-system"
- - name: "*emqx*"
- - kind: "Pod|Deployment|ReplicaSet"
- - labels:
- version: "dev"
复制代码
- : 不对节点做告警;
- ... namespace: "cattle-monitoring-system"
复制代码 : 不对 NameSpace:做告警;
- : 不对 name(name 往往是 pod name) 包含的做告警
- kind: "Pod|Deployment|ReplicaSet"
复制代码 : 不对做告警(也就是不关注应用, 组件相关的告警)
- : 不对含有的做告警(可以通过它屏蔽特定的应用的告警)
最终效果
如下图:
以上就是IoT 边缘集群Kubernetes Events告警通知进一步配置详解的详细内容,更多关于IoT Kubernetes Events告警的资料请关注脚本之家其它相关文章!
来源:https://www.jb51.net/article/275636.htm
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作! |
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有账号?立即注册
x
|