强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

Prometheus 完全指南 / 10 - 服务发现

10 - 服务发现

10.1 概述

服务发现(Service Discovery)是 Prometheus 的核心能力之一,允许动态发现和管理监控目标,无需手动维护目标列表。

服务发现类型

类型配置关键字适用场景
静态配置static_configs固定服务器、小型环境
文件发现file_sd_configs自定义脚本生成目标
Kuberneteskubernetes_sd_configsK8s 集群
Consulconsul_sd_configs微服务注册中心
DNSdns_sd_configs域名解析
EC2ec2_sd_configsAWS
GCEgce_sd_configsGoogle Cloud
Azureazure_sd_configsAzure
Marathonmarathon_sd_configsDC/OS
Eurekaeureka_sd_configsSpring Cloud
Tritontriton_sd_configsJoyent Triton

10.2 静态配置(Static Config)

最简单的方式,直接指定目标地址列表。

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets:
          - 'host1:8080'
          - 'host2:8080'
          - 'host3:8080'
        labels:
          env: production
          team: backend

适用场景

  • 开发/测试环境
  • 服务器数量少且固定
  • 不使用服务注册中心

缺点

  • 需要手动维护目标列表
  • 添加/删除目标需修改配置并重载

10.3 文件发现(File SD)

通过文件定义目标列表,文件变化时自动重载。

scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
          - /etc/prometheus/targets/*.yml
        refresh_interval: 5m    # 默认 5m

JSON 格式

[
  {
    "targets": ["web1:8080", "web2:8080"],
    "labels": {
      "env": "production",
      "service": "api",
      "team": "backend"
    }
  },
  {
    "targets": ["db1:9104"],
    "labels": {
      "env": "production",
      "service": "mysql"
    }
  }
]

YAML 格式

- targets:
    - web1:8080
    - web2:8080
  labels:
    env: production
    service: api

- targets:
    - db1:9104
  labels:
    env: production
    service: mysql

动态生成脚本

#!/bin/bash
# generate_targets.sh - 从 CMDB 动态生成目标列表

OUTPUT="/etc/prometheus/targets/generated.json"

curl -s "http://cmdb.internal/api/hosts?service=api" | \
  jq '[.[] | {targets: [.hostname + ":8080"], labels: {env: .env, team: .team}}]' \
  > ${OUTPUT}.tmp

mv ${OUTPUT}.tmp ${OUTPUT}
# Crontab: 每 5 分钟更新
*/5 * * * * /opt/scripts/generate_targets.sh

10.4 Kubernetes 服务发现

Kubernetes 是 Prometheus 最常用的服务发现方式之一。

角色类型

角色说明元标签
node节点__meta_kubernetes_node_name, __meta_kubernetes_node_label_*
podPod__meta_kubernetes_pod_name, __meta_kubernetes_pod_label_*, __meta_kubernetes_pod_annotation_*
serviceService__meta_kubernetes_service_name, __meta_kubernetes_service_label_*
endpointsEndpoints包含 Pod IP 和端口
endpointsliceEndpointSliceK8s 1.21+ 推荐
ingressIngress__meta_kubernetes_ingress_name

Pod 发现

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    
    relabel_configs:
      # 只抓取有 prometheus.io/scrape: "true" 注解的 Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # 使用自定义指标路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # 使用自定义端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # 添加 Pod 标签
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app

Pod 注解约定

# 在 Deployment 中添加注解
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"      # 启用抓取
        prometheus.io/port: "8080"        # 指标端口
        prometheus.io/path: "/metrics"    # 指标路径
    spec:
      containers:
        - name: my-app
          ports:
            - containerPort: 8080

Service 发现

scrape_configs:
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:${2}

Node 发现

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    
    relabel_configs:
      # 使用节点地址的 9100 端口(Node Exporter)
      - source_labels: [__address__]
        regex: '(.+):(\d+)'
        target_label: __address__
        replacement: '${1}:9100'
      
      - source_labels: [__meta_kubernetes_node_name]
        target_label: node

Ingress 发现

scrape_configs:
  - job_name: 'kubernetes-ingresses'
    kubernetes_sd_configs:
      - role: ingress
    
    relabel_configs:
      - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_scrape]
        action: keep
        regex: true

使用 EndpointSlice(推荐)

scrape_configs:
  - job_name: 'kubernetes-endpointslices'
    kubernetes_sd_configs:
      - role: endpointslice
    
    relabel_configs:
      - source_labels: [__meta_kubernetes_endpointslice_annotation_prometheus_io_scrape]
        action: keep
        regex: true

10.5 Consul 服务发现

Consul 是 HashiCorp 的服务网格和服务发现工具。

基本配置

scrape_configs:
  - job_name: 'consul'
    consul_sd_configs:
      - server: 'consul.internal:8500'
        tags:
          - 'prometheus'      # 只发现带此标签的服务
        services: []          # 空列表 = 所有服务
    
    relabel_configs:
      # 使用 Consul 服务名作为 job 标签
      - source_labels: [__meta_consul_service]
        target_label: job
      
      # 使用 Consul 节点名作为 instance 标签
      - source_labels: [__meta_consul_node]
        target_label: instance
      
      # 添加数据中心标签
      - source_labels: [__meta_consul_dc]
        target_label: datacenter
      
      # 添加服务标签
      - source_labels: [__meta_consul_tags]
        regex: ',(?:[^,]+,)*prometheus-path=([^,]+),.*'
        target_label: __metrics_path__

Consul 服务注册

{
  "service": {
    "name": "api-service",
    "port": 8080,
    "tags": ["prometheus"],
    "meta": {
      "prometheus_path": "/metrics",
      "prometheus_port": "8080"
    },
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s"
    }
  }
}

10.6 DNS 服务发现

通过 DNS SRV 记录或 A 记录发现目标。

SRV 记录

scrape_configs:
  - job_name: 'dns-srv'
    dns_sd_configs:
      - names:
          - '_prometheus._tcp.example.com'
        type: SRV
        refresh_interval: 30s
# DNS SRV 记录示例
_prometheus._tcp.example.com. IN SRV 10 60 9100 node1.example.com.
_prometheus._tcp.example.com. IN SRV 10 60 9100 node2.example.com.

A 记录

scrape_configs:
  - job_name: 'dns-a'
    dns_sd_configs:
      - names:
          - 'nodes.example.com'
        type: A
        port: 9100
        refresh_interval: 30s

10.7 EC2 服务发现

scrape_configs:
  - job_name: 'ec2'
    ec2_sd_configs:
      - region: 'us-east-1'
        access_key: '<access_key>'
        secret_key: '<secret_key>'
        port: 9100
        filters:
          - name: 'tag:Environment'
            values: ['production']
          - name: 'instance-state-name'
            values: ['running']
    
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: env
      - source_labels: [__meta_ec2_tag_Team]
        target_label: team

10.8 Relabel 进阶

基于元标签过滤

relabel_configs:
  # 只保留生产环境
  - source_labels: [__meta_kubernetes_namespace]
    action: keep
    regex: 'production'
  
  # 丢弃特定 Pod
  - source_labels: [__meta_kubernetes_pod_name]
    action: drop
    regex: '.*-debug.*'
  
  # 只保留特定注解的 Pod
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: 'true'

标签映射

relabel_configs:
  # 映射所有 Kubernetes 标签
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  
  # 映射特定标签
  - action: labelmap
    regex: __meta_kubernetes_service_label_(app|version)

哈希分片

# 将目标分片到多个 Prometheus 实例
relabel_configs:
  - source_labels: [__address__]
    modulus: 3          # 分为 3 片
    target_label: __tmp_shard
    action: hashmod
  - source_labels: [__tmp_shard]
    regex: 0            # 当前实例只处理第 0 片
    action: keep

10.9 多环境配置

scrape_configs:
  # 生产环境
  - job_name: 'k8s-prod'
    kubernetes_sd_configs:
      - role: pod
        api_server: 'https://k8s-prod.internal:6443'
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - target_label: environment
        replacement: production

  # 测试环境
  - job_name: 'k8s-staging'
    kubernetes_sd_configs:
      - role: pod
        api_server: 'https://k8s-staging.internal:6443'
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - target_label: environment
        replacement: staging

10.10 本章小结

发现方式适用场景动态能力
static_configs小型环境/测试❌ 手动
file_sd_configs自定义脚本✅ 文件变化
kubernetes_sd_configsK8s 集群✅ API 驱动
consul_sd_configsConsul 微服务✅ 注册中心
dns_sd_configsDNS 环境✅ DNS 记录
ec2_sd_configsAWS✅ API 驱动

扩展阅读


上一章09 - 录制规则 下一章11 - Exporter 生态