TP Final : Stack d'Observabilitรฉ Complรจte
Objectifs
- Dรฉployer une stack de monitoring production-ready
- Monitorer une application multi-tiers
- Crรฉer des SLOs et SLIs
- Simuler et rรฉsoudre des incidents
Durรฉe : 2 heures
Scรฉnario
Vous รชtes SRE chez TechShop, une plateforme e-commerce. Vous devez mettre en place une stack d'observabilitรฉ complรจte pour monitorer :
- Frontend (Nginx)
- API Backend (Python/Flask)
- Base de donnรฉes (PostgreSQL)
- Cache (Redis)
Architecture

ARCHITECTURE TECHSHOP
โโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MONITORING โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโ
โ โPrometheusโ โ Grafana โ โAlertmgrโโ
โ โ :9090 โ โ :3000 โ โ :9093 โโ
โ โโโโโโฌโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโ
โ โ โ
โโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ scrape
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ Nginx โ โ API Flask โ โ Node Exporterโ
โ Frontend โ โ Backend โ โ (chaque) โ
โ :80 โ โ :5000 โ โ :9100 โ
โ + Exporter โ โ + /metrics โ โ โ
โ :9113 โ โ โ โ โ
โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโดโโโโโโโโ
โ โ โ
โ โผ โผ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โ PostgreSQL โ โ Redis โ
โ โ :5432 โ โ :6379 โ
โ โ + Exporter โ โ + Exporter โ
โ โ :9187 โ โ :9121 โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโ
โ Blackbox โ
โ Exporter โ
โ :9115 โ
โโโโโโโโโโโโโโโโโ
Partie 1 : Dรฉploiement (30 min)
1.1 Structure du Projet
1.2 Docker Compose
# docker-compose.yml
version: '3.8'
services:
# ============ APPLICATION ============
nginx:
image: nginx:alpine
container_name: nginx
ports:
- "80:80"
volumes:
- ./app/nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- api
api:
build: ./app
container_name: api
ports:
- "5000:5000"
environment:
- DATABASE_URL=postgresql://techshop:password@postgres:5432/techshop
- REDIS_URL=redis://redis:6379
depends_on:
- postgres
- redis
postgres:
image: postgres:15-alpine
container_name: postgres
environment:
- POSTGRES_USER=techshop
- POSTGRES_PASSWORD=password
- POSTGRES_DB=techshop
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
container_name: redis
# ============ EXPORTERS ============
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
nginx-exporter:
image: nginx/nginx-prometheus-exporter:latest
container_name: nginx-exporter
ports:
- "9113:9113"
command:
- '-nginx.scrape-uri=http://nginx:80/stub_status'
depends_on:
- nginx
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
ports:
- "9187:9187"
environment:
- DATA_SOURCE_NAME=postgresql://techshop:password@postgres:5432/techshop?sslmode=disable
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
ports:
- "9121:9121"
environment:
- REDIS_ADDR=redis://redis:6379
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
ports:
- "9115:9115"
volumes:
- ./prometheus/blackbox.yml:/etc/blackbox_exporter/config.yml
# ============ MONITORING STACK ============
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=7d'
- '--web.enable-lifecycle'
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
postgres_data:
prometheus_data:
grafana_data:
1.3 Application Flask avec Mรฉtriques
# app/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random
import redis
import psycopg2
import os
app = Flask(__name__)
# Mรฉtriques Prometheus
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0]
)
ACTIVE_REQUESTS = Gauge(
'http_requests_active',
'Active HTTP requests'
)
ORDERS_TOTAL = Counter(
'orders_total',
'Total orders processed',
['status']
)
CART_VALUE = Histogram(
'cart_value_euros',
'Shopping cart value',
buckets=[10, 25, 50, 100, 200, 500, 1000]
)
# Connexions
redis_client = redis.from_url(os.environ.get('REDIS_URL', 'redis://localhost:6379'))
def get_db():
return psycopg2.connect(os.environ.get('DATABASE_URL', 'postgresql://techshop:password@localhost:5432/techshop'))
# Middleware pour les mรฉtriques
@app.before_request
def before_request():
request.start_time = time.time()
ACTIVE_REQUESTS.inc()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc()
REQUEST_LATENCY.labels(request.method, request.path).observe(latency)
ACTIVE_REQUESTS.dec()
return response
# Endpoints
@app.route('/health')
def health():
return jsonify({'status': 'healthy'})
@app.route('/api/products')
def get_products():
# Simuler une latence variable
time.sleep(random.uniform(0.01, 0.1))
return jsonify({'products': [
{'id': 1, 'name': 'Laptop', 'price': 999},
{'id': 2, 'name': 'Phone', 'price': 699},
{'id': 3, 'name': 'Tablet', 'price': 499}
]})
@app.route('/api/cart', methods=['POST'])
def add_to_cart():
time.sleep(random.uniform(0.02, 0.15))
value = random.uniform(10, 500)
CART_VALUE.observe(value)
return jsonify({'status': 'added', 'cart_value': value})
@app.route('/api/order', methods=['POST'])
def create_order():
time.sleep(random.uniform(0.05, 0.3))
# Simuler des erreurs occasionnelles
if random.random() < 0.05:
ORDERS_TOTAL.labels('failed').inc()
return jsonify({'error': 'Payment failed'}), 500
ORDERS_TOTAL.labels('success').inc()
return jsonify({'order_id': random.randint(1000, 9999)})
@app.route('/api/slow')
def slow_endpoint():
# Endpoint intentionnellement lent pour les tests
time.sleep(random.uniform(1, 3))
return jsonify({'status': 'slow response'})
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# app/Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN pip install flask prometheus_client redis psycopg2-binary
COPY app.py .
CMD ["python", "app.py"]
1.4 Configuration Prometheus
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
service: techshop
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'api'
static_configs:
- targets: ['api:5000']
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://nginx:80/health
- http://api:5000/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Partie 2 : Rรจgles d'Alerte (20 min)
2.1 Alertes Infrastructure
# prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 1 minute"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
- alert: HighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% available"
2.2 Alertes Application
# prometheus/rules/application.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
/
sum(rate(http_requests_total[5m])) by (endpoint)
* 100 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.endpoint }}"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High P95 latency on {{ $labels.endpoint }}"
description: "P95 latency is {{ $value | printf \"%.2f\" }}s"
- alert: OrderFailureRate
expr: |
sum(rate(orders_total{status="failed"}[5m]))
/
sum(rate(orders_total[5m]))
* 100 > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High order failure rate"
description: "{{ $value | printf \"%.1f\" }}% of orders are failing"
- alert: HighActiveRequests
expr: http_requests_active > 50
for: 2m
labels:
severity: warning
annotations:
summary: "Too many active requests"
2.3 Alertes SLO
# prometheus/rules/slo.yml
groups:
- name: slo
rules:
# Recording rules pour SLIs
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: sli:latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Alertes SLO
- alert: SLOAvailabilityBreach
expr: sli:availability:ratio_rate5m < 0.995
for: 5m
labels:
severity: critical
slo: availability
annotations:
summary: "SLO Availability breach"
description: "Availability is {{ $value | printf \"%.3f\" }} (target: 99.5%)"
- alert: SLOLatencyBreach
expr: sli:latency:ratio_rate5m < 0.95
for: 5m
labels:
severity: warning
slo: latency
annotations:
summary: "SLO Latency breach"
description: "Only {{ $value | printf \"%.1f\" }}% of requests under 300ms (target: 95%)"
Partie 3 : Dashboard Grafana (30 min)
3.1 Provisioning Datasource
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
3.2 Dashboard TechShop
Crรฉez un dashboard avec les panels suivants :

LAYOUT DASHBOARD TECHSHOP
โโโโโโโโโโโโโโโโโโโโโโโโโ
Row: Overview
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ Uptime โRequests/sโError Rateโ P95 Lat โ
โ STAT โ STAT โ STAT โ STAT โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ
Row: Traffic
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Requests per Second (Time Series) โ
โ by endpoint โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Row: Performance
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ Latency Distributionโ Error Rate by Endpointโ
โ HEATMAP โ TIME SERIES โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ
Row: Business Metrics
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ Orders/min โ Cart Value Histogram โ
โ TIME SERIES โ HISTOGRAM โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโ
Row: Infrastructure
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ CPU % โ Memory % โ Redis Opsโ PG Conn โ
โ GAUGE โ GAUGE โ STAT โ STAT โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ
3.3 Queries Importantes
# Requests per second
sum(rate(http_requests_total[5m])) by (endpoint)
# Error rate %
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Orders per minute
sum(rate(orders_total[1m])) * 60
# Redis operations
rate(redis_commands_total[5m])
# PostgreSQL connections
pg_stat_activity_count{state="active"}
Partie 4 : Simulation d'Incidents (30 min)
4.1 Test de Charge
# Installer hey (load testing tool)
# https://github.com/rakyll/hey
# Charge normale
hey -n 1000 -c 10 http://localhost/api/products
# Charge รฉlevรฉe
hey -n 10000 -c 100 http://localhost/api/products
# Test endpoint lent
hey -n 100 -c 20 http://localhost/api/slow
4.2 Simulation de Pannes
# Arrรชter l'API (dรฉclenche InstanceDown)
docker stop api
# Observer les alertes dans Alertmanager
curl http://localhost:9093/api/v2/alerts | jq .
# Redรฉmarrer
docker start api
# Simuler une surcharge PostgreSQL
docker exec -it postgres psql -U techshop -c "SELECT pg_sleep(10);" &
docker exec -it postgres psql -U techshop -c "SELECT pg_sleep(10);" &
docker exec -it postgres psql -U techshop -c "SELECT pg_sleep(10);" &
4.3 Checklist de Validation
- [ ] Toutes les targets sont UP dans Prometheus
- [ ] Les mรฉtriques de l'API apparaissent
- [ ] Les alertes sont configurรฉes (Status > Rules)
- [ ] Grafana affiche les dashboards
- [ ] Les alertes se dรฉclenchent lors des tests
- [ ] Les notifications arrivent (si configurรฉes)
Partie 5 : SLOs et SLIs
5.1 Dรฉfinition des SLOs
| SLI | SLO | Mesure |
|---|---|---|
| Availability | 99.5% | Requรชtes non-5xx / Total |
| Latency P95 | < 300ms | 95% des requรชtes sous 300ms |
| Error Rate | < 1% | Erreurs 5xx / Total |
5.2 Dashboard SLO
# Error Budget remaining (%)
100 * (
1 - (
(1 - sli:availability:ratio_rate5m) / (1 - 0.995)
)
)
# Burn rate (1 = normal, >1 = burning fast)
(1 - sli:availability:ratio_rate5m) / (1 - 0.995)
# Time until budget exhaustion (hours)
# Si burn rate constant
(1 - 0.995) * 30 * 24 / (1 - sli:availability:ratio_rate5m)
รvaluation
Critรจres
| Critรจre | Points |
|---|---|
| Stack dรฉployรฉe et fonctionnelle | 20 |
| Mรฉtriques collectรฉes correctement | 20 |
| Rรจgles d'alerte pertinentes | 20 |
| Dashboard complet et lisible | 20 |
| SLOs dรฉfinis et monitorรฉs | 10 |
| Documentation/Annotations | 10 |
Total : 100 points Seuil de rรฉussite : 70 points
Livrables
- docker-compose.yml fonctionnel
- prometheus.yml avec tous les jobs
- Rรจgles d'alerte (3 fichiers minimum)
- Dashboard JSON exportรฉ
- Document SLO avec justifications
Bonus
- Ajouter Loki pour les logs
- Intรฉgrer des traces avec Tempo
- Crรฉer un runbook automatisรฉ
- Configurer des webhooks personnalisรฉs
Prรฉcรฉdent : Module 5 - Alerting
Retour au programme : Index
Navigation
| โ Module 5 : Alerting et Notifications | Programme โ |