Advanced Observability: Alerting & ELK
Passer de la visualisation passive ร l'action proactive. Gestion des astreintes et logs ร grande รฉchelle.
AlertManager : Le Cerveau des Alertes
Pourquoi Prometheus ne Suffit Pas ?
Prometheus = Collecte + รvaluation des alertes AlertManager = Dรฉduplication + Groupement + Routage + Silence

โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SANS ALERTMANAGER (Chaos) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Datacenter down โ 50 serveurs down โ
โ โ
โ Prometheus envoie : โ
โ โโโ 50 emails "Server down" โ
โ โโโ 50 SMS โ
โ โโโ 50 appels PagerDuty โ
โ โ
โ SRE on-call : โ
โ โโโ ๐ฑ๐ฅ๐ฅ๐ฅ (150 notifications en 1 minute) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ AVEC ALERTMANAGER (Intelligence) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Datacenter down โ 50 serveurs down โ
โ โ
โ AlertManager : โ
โ โโโ Groupe les 50 alertes similaires โ
โ โโโ Applique inhibition (datacenter down > serveurs) โ
โ โโโ Envoie 1 seule notification : โ
โ "Datacenter Paris DOWN - 50 serveurs affectรฉs" โ
โ โ
โ SRE on-call : โ
โ โโโ ๐ฑ (1 notification claire et actionnable) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Fonctionnalitรฉs clรฉs d'AlertManager :
| Fonctionnalitรฉ | Description | Valeur |
|---|---|---|
| Dรฉduplication | Fusionne les alertes identiques | รvite les doublons |
| Groupement | Groupe les alertes similaires | 1 notification pour N alertes |
| Routage | Envoie ร la bonne รฉquipe/canal | Critical โ PagerDuty, Warning โ Slack |
| Inhibition | Supprime les alertes dรฉrivรฉes | Datacenter down > Serveurs down |
| Silences | Mute temporairement | Maintenance planifiรฉe |
| Repeat Interval | Re-notifie si non rรฉsolu | รvite l'oubli |
Architecture AlertManager
flowchart LR
A[Prometheus] -->|Firing alerts| B[AlertManager]
B -->|Route: Critical| C[PagerDuty]
B -->|Route: Warning| D[Slack]
B -->|Route: Info| E[Email]
B -->|Grouping| F[Group by: cluster, alertname]
B -->|Inhibition| G[Suppress child alerts]
B -->|Silences| H[Mute during maintenance]
C -->|๐ฑ Phone call| I[SRE On-call]
D -->|๐ฌ Message| J[Dev Team Channel]
E -->|๐ง Email| K[Ops Team]
Configuration AlertManager
Structure du fichier alertmanager.yml :
# ============================================================
# CONFIGURATION ALERTMANAGER
# ============================================================
global:
# Rรฉsolution par dรฉfaut
resolve_timeout: 5m
# Config Slack globale
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
# ============================================================
# TEMPLATES (Personnalisation des messages)
# ============================================================
templates:
- '/etc/alertmanager/templates/*.tmpl'
# ============================================================
# ROUTE TREE (L'Arbre de Dรฉcision)
# ============================================================
route:
# Route racine (catch-all)
receiver: 'default-receiver'
# Groupement des alertes
group_by: ['cluster', 'alertname']
# Attendre 30s avant d'envoyer (pour grouper)
group_wait: 30s
# Attendre 5min avant de grouper ร nouveau
group_interval: 5m
# Re-notifier toutes les 4h si non rรฉsolu
repeat_interval: 4h
# Routes enfants (matching spรฉcifique)
routes:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ROUTE 1 : Alertes CRITICAL โ PagerDuty
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s # Envoyer rapidement (10s)
repeat_interval: 1h # Re-notifier toutes les heures
routes:
# Sous-route : Database critical โ รquipe DB
- match:
team: database
receiver: 'pagerduty-db-team'
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ROUTE 2 : Alertes WARNING โ Slack
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- match:
severity: warning
receiver: 'slack-warnings'
group_wait: 1m
repeat_interval: 12h
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ROUTE 3 : Alertes INFO โ Email (low priority)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- match:
severity: info
receiver: 'email-ops'
group_wait: 5m
repeat_interval: 24h
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ROUTE 4 : Environnement DEV โ Discord (pas de pagerduty)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- match:
environment: dev
receiver: 'discord-dev'
# ============================================================
# RECEIVERS (Canaux de Notification)
# ============================================================
receivers:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# DEFAULT : Slack gรฉnรฉral
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'default-receiver'
slack_configs:
- channel: '#alerts'
title: '๐จ Alert: {{ .GroupLabels.alertname }}'
text: |
*Summary:* {{ .CommonAnnotations.summary }}
*Description:* {{ .CommonAnnotations.description }}
*Severity:* {{ .CommonLabels.severity }}
*Cluster:* {{ .CommonLabels.cluster }}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# PAGERDUTY : Alertes critiques
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
description: '{{ .CommonAnnotations.summary }}'
severity: '{{ .CommonLabels.severity }}'
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# PAGERDUTY : รquipe Database
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'pagerduty-db-team'
pagerduty_configs:
- service_key: '<PAGERDUTY_DB_TEAM_KEY>'
description: '[DB] {{ .CommonAnnotations.summary }}'
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# SLACK : Warnings
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
icon_emoji: ':warning:'
title: 'โ ๏ธ Warning: {{ .GroupLabels.alertname }}'
text: |
*Summary:* {{ .CommonAnnotations.summary }}
*Cluster:* {{ .CommonLabels.cluster }}
*Instances:* {{ .Alerts | len }} affected
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# DISCORD : Environnement Dev
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'discord-dev'
webhook_configs:
- url: 'https://discord.com/api/webhooks/XXXXXXXXXX/YYYYYYYYYYYYYYYYYY'
send_resolved: true
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# EMAIL : Ops Team
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- name: 'email-ops'
email_configs:
- to: 'ops-team@company.com'
from: 'alertmanager@company.com'
smarthost: 'smtp.company.com:587'
auth_username: 'alertmanager@company.com'
auth_password: '<SMTP_PASSWORD>'
headers:
Subject: '[AlertManager] {{ .GroupLabels.alertname }}'
# ============================================================
# INHIBITION RULES (Suppression d'Alertes Dรฉrivรฉes)
# ============================================================
inhibit_rules:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# RรGLE 1 : Datacenter down > Serveurs down
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- source_match:
alertname: 'DatacenterDown'
severity: 'critical'
target_match:
alertname: 'ServerDown'
equal: ['datacenter'] # Mรชme datacenter
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# RรGLE 2 : Node down > Services down
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '(ServiceDown|HighMemory|HighCPU)'
equal: ['instance']
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# RรGLE 3 : Critical supprime Warning sur mรชme instance
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance', 'alertname']
Exemples de Receivers
Slack (le plus utilisรฉ) :
- name: 'slack-production'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXX'
channel: '#alerts-prod'
username: 'AlertManager'
icon_emoji: ':fire:'
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
text: |-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Instance:* `{{ .Labels.instance }}`
{{ end }}
send_resolved: true
Discord :
- name: 'discord-dev'
webhook_configs:
- url: 'https://discord.com/api/webhooks/123456789/ABCDEFGHIJKLMNOPQRSTUVWXYZ'
send_resolved: true
http_config:
follow_redirects: true
Microsoft Teams :
- name: 'teams-ops'
webhook_configs:
- url: 'https://outlook.office.com/webhook/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx@xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/IncomingWebhook/yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'
send_resolved: true
Tester AlertManager
# Crรฉer une alerte de test
curl -X POST http://localhost:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "critical",
"instance": "localhost:9090"
},
"annotations": {
"summary": "Test alert from curl",
"description": "This is a test alert"
}
}
]'
# Voir les alertes actives
curl http://localhost:9093/api/v1/alerts
# Crรฉer un silence (mute pendant maintenance)
curl -X POST http://localhost:9093/api/v1/silences -d '{
"matchers": [
{
"name": "instance",
"value": "localhost:9090",
"isRegex": false
}
],
"startsAt": "2024-01-20T10:00:00Z",
"endsAt": "2024-01-20T12:00:00Z",
"createdBy": "admin",
"comment": "Maintenance planifiรฉe"
}'
Blackbox Exporter : Sondes Synthรฉtiques
Le Besoin
Problรจme : Votre serveur Nginx tourne (CPU/RAM OK), mais votre site est-il accessible ?

โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MONITORING TRADITIONNEL (Insuffisant) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Node Exporter : โ
โ โ
CPU: 20% โ
โ โ
RAM: 40% โ
โ โ
Disk: 60% โ
โ โ
โ Mais... โ
โ โ Le site web retourne-t-il 200 OK ? โ
โ โ Le certificat SSL est-il valide ? โ
โ โ Le DNS rรฉsout-il correctement ? โ
โ โ La latence est-elle acceptable ? โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ BLACKBOX EXPORTER (Monitoring Synthรฉtique) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Simule un client rรฉel : โ
โ โ
HTTP GET https://myapp.com โ 200 OK โ
โ โ
TLS cert expiry โ Valide 89 jours โ
โ โ
DNS lookup myapp.com โ 1.2.3.4 โ
โ โ
ICMP ping 1.2.3.4 โ 12ms โ
โ โ
โ Alerte si : โ
โ - Code HTTP != 200 โ
โ - Latence > 2s โ
โ - Cert expire < 30j โ
โ - DNS timeout โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Architecture Blackbox Exporter

flowchart LR
A[Prometheus] -->|1. Scrape| B[Blackbox Exporter<br/>:9115]
B -->|2. Probe HTTP| C[https://myapp.com]
B -->|2. Probe DNS| D[DNS Server]
B -->|2. Probe ICMP| E[1.2.3.4]
C -->|3. Response<br/>200 OK| B
D -->|3. Response<br/>1.2.3.4| B
E -->|3. Pong| B
B -->|4. Metrics| A
A -->|5. Alert if<br/>probe_success=0| F[AlertManager]
Flux d'exรฉcution :
- Prometheus appelle Blackbox Exporter avec une target en paramรจtre
- Blackbox fait la requรชte (HTTP/DNS/ICMP) vers la target
- Blackbox retourne les mรฉtriques (success, duration, status_code)
- Prometheus stocke et รฉvalue les rรจgles d'alerte
- AlertManager notifie si problรจme
Configuration Blackbox Exporter
Fichier blackbox.yml :
modules:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# MODULE 1 : HTTP 2xx (Vรฉrifier code retour 200)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200] # Accepter seulement 200
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
# Vรฉrifier la prรฉsence d'un texte dans la rรฉponse
fail_if_body_not_matches_regexp:
- "Welcome"
# Vรฉrifier le certificat TLS
tls_config:
insecure_skip_verify: false
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# MODULE 2 : HTTP POST (API health check)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
http_post_2xx:
prober: http
http:
method: POST
headers:
Content-Type: application/json
body: '{"status":"check"}'
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# MODULE 3 : ICMP Ping
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# MODULE 4 : DNS Lookup
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dns:
prober: dns
timeout: 5s
dns:
query_name: "myapp.com"
query_type: "A"
valid_rcodes:
- NOERROR
validate_answer_rrs:
fail_if_not_matches_regexp:
- "1\\.2\\.3\\.4" # Vรฉrifier que le DNS rรฉsout bien vers cette IP
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# MODULE 5 : TCP Port Check
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
tcp_connect:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: "ip4"
Configuration Prometheus (Le Trick du Relabeling)
Problรจme : Blackbox Exporter ne scrape pas directement les targets. Il faut passer la target en paramรจtre d'URL.
Solution : Relabeling dans Prometheus.
# ============================================================
# PROMETHEUS : Job Blackbox Exporter
# ============================================================
scrape_configs:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# JOB 1 : HTTP Probes
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx] # Utiliser le module http_2xx
static_configs:
- targets:
- https://myapp.com
- https://api.company.com/health
- https://admin.company.com
relabel_configs:
# รtape 1 : Sauvegarder la target originale dans __param_target
- source_labels: [__address__]
target_label: __param_target
# รtape 2 : Remplacer __address__ par l'adresse du Blackbox Exporter
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115 # Adresse du Blackbox Exporter
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# JOB 2 : ICMP Probes (Ping)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- job_name: 'blackbox-icmp'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 8.8.8.8 # Google DNS
- 1.1.1.1 # Cloudflare DNS
- 192.168.1.1 # Gateway interne
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# JOB 3 : DNS Probes
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- job_name: 'blackbox-dns'
metrics_path: /probe
params:
module: [dns]
static_configs:
- targets:
- 8.8.8.8 # Rรฉsoudre via Google DNS
- 1.1.1.1 # Rรฉsoudre via Cloudflare DNS
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Le Relabeling est INDISPENSABLE
Sans le relabeling, Prometheus essaiera de scraper directement https://myapp.com/metrics, ce qui รฉchouera.
Le relabeling transforme :
En :Mรฉtriques Blackbox Essentielles
# Probe rรฉussie (1) ou รฉchouรฉe (0)
probe_success{job="blackbox-http"}
# Durรฉe de la requรชte HTTP
probe_http_duration_seconds{job="blackbox-http"}
# Code de statut HTTP
probe_http_status_code{job="blackbox-http"}
# Expiration du certificat SSL (en secondes)
probe_ssl_earliest_cert_expiry{job="blackbox-http"}
# Durรฉe du ping ICMP
probe_icmp_duration_seconds{job="blackbox-icmp"}
# Rรฉsolution DNS rรฉussie
probe_dns_lookup_time_seconds{job="blackbox-dns"}
Rรจgles d'Alerte Blackbox
# prometheus-rules.yml
groups:
- name: blackbox-alerts
interval: 30s
rules:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ALERTE 1 : Site web down
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- alert: WebsiteDown
expr: probe_success{job="blackbox-http"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Site web {{ $labels.instance }} est DOWN"
description: "Le site {{ $labels.instance }} ne rรฉpond pas depuis 2 minutes."
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ALERTE 2 : Latence HTTP รฉlevรฉe
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- alert: HighHTTPLatency
expr: probe_http_duration_seconds{job="blackbox-http"} > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Latence HTTP รฉlevรฉe sur {{ $labels.instance }}"
description: "La latence est de {{ $value }}s (seuil: 2s)"
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ALERTE 3 : Certificat SSL expire bientรดt
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificat SSL expire dans {{ $value }} jours"
description: "Le certificat de {{ $labels.instance }} expire bientรดt."
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# ALERTE 4 : Ping รฉlevรฉ (rรฉseau lent)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- alert: HighPingLatency
expr: probe_icmp_duration_seconds{job="blackbox-icmp"} > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Ping รฉlevรฉ vers {{ $labels.instance }}"
description: "Latence ICMP: {{ $value }}s (seuil: 100ms)"
La Stack ELK : Elasticsearch, Logstash, Kibana

ELK vs Loki : Quand Utiliser Quoi ?
| Aspect | ELK (Elasticsearch, Logstash, Kibana) | Loki (Grafana Loki) |
|---|---|---|
| Indexation | Full-text search (tous les champs) | Labels uniquement (comme Prometheus) |
| Stockage | Lourd (indexe tout) | Lรฉger (indexe les labels) |
| Requรชtes | Complexes (regex, agrรฉgations) | Simples (grep distribuรฉ) |
| Performance | Excellent pour recherche complexe | Excellent pour logs corrรฉlรฉs aux mรฉtriques |
| Coรปt | รlevรฉ (CPU, RAM, Disk) | Faible |
| Cas d'usage | Analyse forensic, Compliance, SIEM | Debugging DevOps, Corrรฉlation mรฉtriques |
| Intรฉgration | Kibana (UI dรฉdiรฉe) | Grafana (UI unifiรฉe mรฉtriques + logs) |
Recommandation :
- ELK : Logs applicatifs lourds, recherche full-text, compliance (audit trail)
- Loki : Logs systรจme/container, debugging DevOps, corrรฉlation avec Prometheus
Pourquoi pas les deux ?
Beaucoup d'organisations utilisent Loki pour les logs quotidiens (debugging, monitoring) et ELK pour l'archivage long-terme et l'analyse forensic.
Architecture ELK
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PIPELINE ELK โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Application โ
โ โโโ app.log โ
โ โ โ
โ โผ โ
โ Filebeat (Agent lรฉger) โ
โ โโโ Lit les logs โ
โ โโโ Envoie vers Logstash ou directement ES โ
โ โ โ
โ โผ โ
โ Logstash (ETL) โ
โ โโโ Parse (Grok) โ
โ โโโ Enrich (GeoIP, User-Agent) โ
โ โโโ Filter โ
โ โ โ
โ โผ โ
โ Elasticsearch (Stockage) โ
โ โโโ Indexe les logs โ
โ โโโ Recherche full-text โ
โ โ โ
โ โผ โ
โ Kibana (Visualisation) โ
โ โโโ Dashboards, Recherche, Alertes โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stack ELK avec Docker Compose
Fichier docker-compose.yml (Stack minimale) :
version: '3.8'
services:
# ============================================================
# ELASTICSEARCH (Stockage)
# ============================================================
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.3
container_name: elasticsearch
environment:
- discovery.type=single-node # Mode single-node (dev/test)
- xpack.security.enabled=false # Dรฉsactiver la sรฉcuritรฉ (dev uniquement)
- "ES_JAVA_OPTS=-Xms512m -Xmx512m" # Heap size
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
networks:
- elk
# ============================================================
# KIBANA (Interface Web)
# ============================================================
kibana:
image: docker.elastic.co/kibana/kibana:8.11.3
container_name: kibana
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
networks:
- elk
# ============================================================
# LOGSTASH (ETL)
# ============================================================
logstash:
image: docker.elastic.co/logstash/logstash:8.11.3
container_name: logstash
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
ports:
- "5044:5044" # Beats input
- "9600:9600" # Logstash API
environment:
- "LS_JAVA_OPTS=-Xms256m -Xmx256m"
depends_on:
- elasticsearch
networks:
- elk
# ============================================================
# FILEBEAT (Agent)
# ============================================================
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.3
container_name: filebeat
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: filebeat -e -strict.perms=false
depends_on:
- logstash
networks:
- elk
volumes:
es_data:
driver: local
networks:
elk:
driver: bridge
Configuration Logstash (Pipeline)
Fichier logstash/pipeline/logstash.conf :
# ============================================================
# LOGSTASH PIPELINE
# ============================================================
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# INPUT : Recevoir depuis Filebeat
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
input {
beats {
port => 5044
}
}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# FILTER : Parser et enrichir
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
filter {
# Parser les logs JSON
if [message] =~ /^\{/ {
json {
source => "message"
}
}
# Parser les logs Nginx (format combined)
if [fields][log_type] == "nginx" {
grok {
match => { "message" => '%{IPORHOST:clientip} - %{USER:ident} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}' }
}
# Convertir la date
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
# GeoIP sur l'IP client
geoip {
source => "clientip"
target => "geoip"
}
# Parser le User-Agent
useragent {
source => "agent"
target => "user_agent"
}
}
# Parser les logs applicatifs (format standard)
if [fields][log_type] == "application" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:logger} - %{GREEDYDATA:log_message}" }
}
}
# Ajouter des tags
mutate {
add_field => { "environment" => "production" }
remove_field => [ "message" ] # Supprimer le message brut si parsรฉ
}
}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# OUTPUT : Envoyer vers Elasticsearch
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{[fields][log_type]}-%{+YYYY.MM.dd}"
}
# Debug : Afficher dans stdout
stdout {
codec => rubydebug
}
}
Configuration Filebeat
Fichier filebeat/filebeat.yml :
filebeat.inputs:
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# INPUT 1 : Logs Docker containers
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
fields:
log_type: docker
fields_under_root: true
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# INPUT 2 : Logs Nginx
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
log_type: nginx
fields_under_root: true
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# INPUT 3 : Logs application
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
fields:
log_type: application
fields_under_root: true
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# OUTPUT : Envoyer vers Logstash
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
output.logstash:
hosts: ["logstash:5044"]
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# LOGGING
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Lancer la Stack ELK
# Dรฉmarrer la stack
docker-compose up -d
# Attendre qu'Elasticsearch soit prรชt (30s-1min)
curl -X GET "localhost:9200/_cluster/health?wait_for_status=yellow&timeout=50s&pretty"
# Accรฉder ร Kibana
# http://localhost:5601
# Vรฉrifier les indices crรฉรฉs
curl -X GET "localhost:9200/_cat/indices?v"
# Rechercher des logs
curl -X GET "localhost:9200/logs-*/_search?pretty"
SRE Golden Signals : Les 4 Mรฉtriques qui Comptent
Thรฉorie Google SRE
Les 4 signaux dorรฉs = Les 4 mรฉtriques essentielles pour monitorer n'importe quel systรจme.

โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GOLDEN SIGNALS (Google) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. LATENCY (Latence) โ
โ Temps de rรฉponse d'une requรชte โ
โ Outil : Blackbox Exporter, Application metrics โ
โ Alerte : P95 > 2s โ
โ โ
โ 2. TRAFFIC (Trafic) โ
โ Charge sur le systรจme (req/s, connexions/s) โ
โ Outil : Nginx metrics, HAProxy metrics โ
โ Alerte : Augmentation soudaine > 200% โ
โ โ
โ 3. ERRORS (Erreurs) โ
โ Taux d'erreur (5xx, failed requests) โ
โ Outil : Blackbox Exporter, Logs (ELK) โ
โ Alerte : Error rate > 5% โ
โ โ
โ 4. SATURATION (Saturation) โ
โ Utilisation des ressources (CPU, RAM, Disk, Network) โ
โ Outil : Node Exporter โ
โ Alerte : CPU > 80%, RAM > 90%, Disk > 85% โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mapping avec nos Outils
| Golden Signal | Mรฉtrique Prometheus | Outil |
|---|---|---|
| Latency | probe_http_duration_seconds |
Blackbox Exporter |
http_request_duration_seconds |
Application (Instrumentation) | |
| Traffic | nginx_http_requests_total |
Nginx Exporter |
haproxy_frontend_connections_total |
HAProxy Exporter | |
| Errors | probe_success == 0 |
Blackbox Exporter |
http_requests_total{code=~"5.."} |
Application (Instrumentation) | |
| Saturation | node_cpu_seconds_total |
Node Exporter |
node_memory_MemAvailable_bytes |
Node Exporter | |
node_filesystem_avail_bytes |
Node Exporter |
Exemple de Dashboard SRE
Requรชtes PromQL pour un Dashboard Golden Signals :
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 1. LATENCY (P95 des 5 derniรจres minutes)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Ou pour Blackbox
probe_http_duration_seconds{job="blackbox-http"}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 2. TRAFFIC (Requรชtes par seconde)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
rate(nginx_http_requests_total[1m])
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 3. ERRORS (Taux d'erreur 5xx)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Ou pour Blackbox
avg_over_time(probe_success{job="blackbox-http"}[5m]) * 100
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 4. SATURATION (Utilisation CPU)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Saturation RAM
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Saturation Disk
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Rรฉfรฉrence Rapide
Ports par Dรฉfaut
| Service | Port | Description |
|---|---|---|
| AlertManager | 9093 | API et UI AlertManager |
| Blackbox Exporter | 9115 | Metrics endpoint |
| Elasticsearch | 9200 | API HTTP |
| Elasticsearch | 9300 | Communication inter-nลuds |
| Kibana | 5601 | Interface Web |
| Logstash | 5044 | Beats input |
| Logstash | 9600 | Monitoring API |
Commandes de Test
# ============================================================
# ALERTMANAGER
# ============================================================
# Vรฉrifier l'รฉtat
curl http://localhost:9093/-/healthy
# Lister les alertes actives
curl http://localhost:9093/api/v1/alerts
# Crรฉer une alerte de test
curl -X POST http://localhost:9093/api/v1/alerts -d '[
{
"labels": {"alertname": "TestAlert", "severity": "critical"},
"annotations": {"summary": "Test"}
}
]'
# ============================================================
# BLACKBOX EXPORTER
# ============================================================
# Vรฉrifier l'รฉtat
curl http://localhost:9115/metrics
# Tester un probe HTTP
curl "http://localhost:9115/probe?target=https://google.com&module=http_2xx"
# Tester un probe ICMP
curl "http://localhost:9115/probe?target=8.8.8.8&module=icmp"
# ============================================================
# ELASTICSEARCH
# ============================================================
# Vรฉrifier l'รฉtat du cluster
curl http://localhost:9200/_cluster/health?pretty
# Lister les indices
curl http://localhost:9200/_cat/indices?v
# Rechercher des logs
curl -X GET "http://localhost:9200/logs-*/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"level": "ERROR"
}
}
}'
# ============================================================
# KIBANA
# ============================================================
# Vรฉrifier l'รฉtat
curl http://localhost:5601/api/status
# Accรฉder ร l'UI
# http://localhost:5601
Rรฉfรฉrence Rapide Complรจte
# ============================================================
# ALERTMANAGER : Routage
# ============================================================
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
# ============================================================
# BLACKBOX : Probe HTTP
# ============================================================
# prometheus.yml
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['https://myapp.com']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox:9115
# ============================================================
# ELK : Docker Compose
# ============================================================
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.3
environment:
- discovery.type=single-node
ports:
- "9200:9200"
kibana:
image: docker.elastic.co/kibana/kibana:8.11.3
ports:
- "5601:5601"