1、监控主机指标 这是一篇介绍主机使用Prometheus监控CPU、磁盘、内存、负载等基础数据的文章,目前生产可用,使用的是nodeexporter0。18。1版本,操作系统是centos7。X版本,使用之前请修改jobgtdwznodeexporter的值对应自己在Prometheus配置的job名称。 2、Prometheus配置项 在prometheus。yml配置文件中添加如下配置: gtdwz jobname:gtdwznodeexporter staticconfigs: targets:〔10。1。5。123:9100,10。1。5。124:9100,10。1。5。125:9100,10。1。5。126:9100〕 labels: service:gtdwzmonitor 1hr2hr3hr4hr5hr6hr3、PromQL判断rules文件 〔rootgtcqgtmonitorprometheus01rules〕moregtdwzmonitor。rules groups: name:dwzgtmonitor rules: alert:nodeAgent告警 expr:up{jobgtdwznodeexporter}0 for:120s labels: severity:重要 team:dwzgtmonitor alerttype:Agent告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{labels。instance}}已停止采集监控数据30s! description:{{labels。instance}}job{{labels。job}}暴露监控数据已停止。 alert:CPU使用率监控 expr:ceil(100sum(increase(nodecpusecondstotal{jobgtdwznodeexporter,modeidle}〔5m〕))by(instance)sum(increase(nodecpusecondstotal{jobgtdwznodeexporter}〔5m〕)) by(instance)100)80 for:2m labels: severity:重要 team:bdfb alerttype:CPU告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}CPU使用率过高 description:IP:{{reReplaceAll:(。)labels。instance}}的CPU使用大于80(当前值:{{value}}) alert:磁盘使用率监控 expr:round((1(nodefilesystemavailbytes{fstypeext3ext4xfsnfs,jobgtdwznodeexporter}nodefilesystemsizebytes{fstypeext3ext4xfsnfs,jobgtdwznodeexporter }))100)80 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:Disk告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}:{{labels。mountpoint}}分区使用率过高 description:{{reReplaceAll:(。)labels。instance}}的{{labels。mountpoint}}分区使用大于80(当前值:{{value}}) alert:内存使用率监控 expr:ceil((1(nodememoryMemAvailablebytes{jobgtdwznodeexporter}(nodememoryMemTotalbytes{jobgtdwznodeexporter})))100)80 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:MEM告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}内存使用率过高 description:{{reReplaceAll:(。)labels。instance}}内存使用大于80(当前值:{{value}}) alert:服务器大法宝CPULoad5 expr:nodeload5{jobgtdwznodeexporter}100 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:负载告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}CPU负载过高 description:{{reReplaceAll:(。)labels。instance}}CPU负载load大于100(当前值:{{value}}) alert:服务器文件句柄监控 expr:nodefilefdallocated{jobgtdwznodeexporter}50000 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:文件句柄告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}文件句柄使用过高 description:{{reReplaceAll:(。)labels。instance}}文件句柄使用过高大于50000(当前值:{{value}}) alert:服务器TCP连接数监控 expr:nodesockstatTCPtw{jobgtdwznodeexporter}15000 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:TCP连接数告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}等待关闭的TCP连接数过高 description:{{reReplaceAll:(。)labels。instance}}等待关闭的TCP连接数TIMEWAIT过高大于15000(当前值:{{value}}) alert:服务器入口流量监控 expr:round((sumby(instance)(irate(nodenetworkreceivebytestotal{jobgtdwznodeexporter,device!tap。veth。br。docker。virbrlo}〔5m〕)))10241024)50 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:流量告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}监控入口流量过高 description:{{reReplaceAll:(。)labels。instance}}监控入口流量过高过高大于50MB(告警值:{{value}}MB) alert:服务器出口流量监控 expr:round((sumby(instance)(irate(nodenetworktransmitbytestotal{jobgtdwznodeexporter,device!tap。veth。br。docker。virbrlo}〔5m〕)))10241024)50 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:流量告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}监控出口流量过高 description:{{reReplaceAll:(。)labels。instance}}监控出口流量过高过高大于50MB(告警值:{{value}}MB) 〔rootgtcqgtmonitorprometheus01rules〕 4、测试告警 修改磁盘阈值如下: alert:磁盘使用率监控 expr:round((1(nodefilesystemavailbytes{fstypeext3ext4xfsnfs,jobgtdwznodeexporter}nodefilesystemsizebytes{fstypeext3ext4xfsnfs,jobgtdwznodeexporter}))100)10 for:2m labels: severity:重要 team:dwzgtmonitor alerttype:Disk告警 alerthost:{{reReplaceAll:(。)labels。instance}} annotations: summary:{{reReplaceAll:(。)labels。instance}}:{{labels。mountpoint}}分区使用率过高 description:{{reReplaceAll:(。)labels。instance}}的{{labels。mountpoint}}分区使用大于10(当前值:{{value}})