Module 10 : Observability - Monitoring, Logging & Tracing
Durée estimée : 30 minutes
Objectifs du Module
À la fin de ce module, vous serez capable de :
- Créer des dashboards et métriques custom avec Cloud Monitoring
- Analyser les logs avec Cloud Logging et Log Analytics
- Implémenter le distributed tracing avec Cloud Trace
- Configurer des alertes et incident management
- Définir et suivre des SLOs (Service Level Objectives)
1. Cloud Operations Suite
Vue d'ensemble
graph TB
subgraph "Data Collection"
APP[Applications]
GKE[GKE Workloads]
VMs[Compute Engine]
CF[Cloud Functions]
CR[Cloud Run]
end
subgraph "Cloud Operations Suite"
subgraph "Observability"
CM[Cloud Monitoring<br/>Metrics]
CL[Cloud Logging<br/>Logs]
CT[Cloud Trace<br/>Traces]
CP[Cloud Profiler<br/>CPU/Memory]
end
subgraph "Analysis"
LA[Log Analytics<br/>BigQuery]
ME[Metrics Explorer]
ER[Error Reporting]
end
subgraph "Alerting"
AP[Alerting Policies]
NC[Notification Channels]
IM[Incident Management]
end
end
APP --> CM
APP --> CL
APP --> CT
GKE --> CM
GKE --> CL
VMs --> CM
VMs --> CL
CM --> ME
CL --> LA
CM --> AP
CL --> AP
style CM fill:#4285F4,color:#fff
style CL fill:#34A853,color:#fff
style CT fill:#FBBC04,color:#000
Services Cloud Operations
| Service | Fonction | Équivalent |
|---|---|---|
| Cloud Monitoring | Métriques, dashboards, alertes | Datadog, Prometheus/Grafana |
| Cloud Logging | Centralisation des logs | ELK Stack, Splunk |
| Cloud Trace | Distributed tracing | Jaeger, Zipkin |
| Cloud Profiler | Profiling CPU/Memory | Pyroscope |
| Error Reporting | Agrégation d'erreurs | Sentry |
2. Cloud Monitoring
Types de métriques
graph LR
subgraph "Built-in Metrics"
GCP[GCP Services<br/>compute.googleapis.com/*]
K8S[Kubernetes<br/>kubernetes.io/*]
AG[Agent Metrics<br/>agent.googleapis.com/*]
end
subgraph "Custom Metrics"
APP[Application<br/>custom.googleapis.com/*]
OC[OpenCensus/OTel]
end
GCP --> CM[Cloud Monitoring]
K8S --> CM
AG --> CM
APP --> CM
OC --> CM
style CM fill:#4285F4,color:#fff
Metrics Explorer
# Lister les métriques disponibles
gcloud monitoring metrics list --filter="metric.type:compute.googleapis.com"
# Types de métriques courants
# - compute.googleapis.com/instance/cpu/utilization
# - compute.googleapis.com/instance/disk/read_bytes_count
# - loadbalancing.googleapis.com/https/request_count
# - cloudsql.googleapis.com/database/cpu/utilization
# - run.googleapis.com/request_count
# - kubernetes.io/container/cpu/core_usage_time
Métriques custom avec Python
from google.cloud import monitoring_v3
import time
def write_custom_metric(project_id: str, metric_type: str, value: float):
"""Écrire une métrique custom."""
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"
series = monitoring_v3.TimeSeries()
series.metric.type = f"custom.googleapis.com/{metric_type}"
series.resource.type = "global"
now = time.time()
interval = monitoring_v3.TimeInterval(
{"end_time": {"seconds": int(now), "nanos": int((now % 1) * 10**9)}}
)
point = monitoring_v3.Point({
"interval": interval,
"value": {"double_value": value}
})
series.points = [point]
client.create_time_series(
request={"name": project_name, "time_series": [series]}
)
print(f"Wrote {metric_type}={value}")
# Usage
write_custom_metric("my-project", "myapp/orders_processed", 42.0)
Créer un Dashboard
# Via gcloud (JSON)
cat > dashboard.json << 'EOF'
{
"displayName": "Application Dashboard",
"gridLayout": {
"columns": "2",
"widgets": [
{
"title": "CPU Utilization",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
}
}]
}
},
{
"title": "Request Count",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
]
}
}
EOF
gcloud monitoring dashboards create --config-from-file=dashboard.json
Uptime Checks
# Créer un uptime check HTTP
gcloud monitoring uptime create my-app-check \
--display-name="My App Health Check" \
--resource-type=uptime-url \
--monitored-resource="host=myapp.example.com" \
--http-check-path="/health" \
--http-check-port=443 \
--use-ssl \
--period=60 \
--timeout=10 \
--content-matchers='{"content": "healthy", "matcher": "CONTAINS_STRING"}'
# Lister les uptime checks
gcloud monitoring uptime list-configs
3. Cloud Logging
Architecture de logging
graph LR
subgraph "Sources"
APP[Application Logs]
SYS[System Logs]
AUDIT[Audit Logs]
VPC[VPC Flow Logs]
end
subgraph "Cloud Logging"
ROUTER[Log Router]
BUCKET[Log Buckets]
end
subgraph "Destinations"
CS[Cloud Storage]
BQ[BigQuery]
PS[Pub/Sub]
SPLUNK[Splunk/SIEM]
end
APP --> ROUTER
SYS --> ROUTER
AUDIT --> ROUTER
VPC --> ROUTER
ROUTER --> BUCKET
ROUTER --> CS
ROUTER --> BQ
ROUTER --> PS
PS --> SPLUNK
style ROUTER fill:#34A853,color:#fff
Logging queries
# Format de filtre de logs
# resource.type="RESOURCE_TYPE"
# logName="projects/PROJECT_ID/logs/LOG_NAME"
# severity>=ERROR
# timestamp>="2024-01-01T00:00:00Z"
# jsonPayload.key="value"
# Exemples de requêtes
# Logs d'erreur des dernières 24h
gcloud logging read 'severity>=ERROR' \
--limit=100 \
--format="table(timestamp,resource.type,textPayload)"
# Logs GKE d'un namespace
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="production"' \
--limit=50
# Logs Cloud Run avec un message spécifique
gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:"error"' \
--limit=20
# Logs d'audit pour création de ressources
gcloud logging read 'logName:"cloudaudit.googleapis.com" AND protoPayload.methodName:"create"'
# Logs avec JSON structuré
gcloud logging read 'jsonPayload.level="error" AND jsonPayload.service="api"'
Log-based metrics
# Créer une métrique basée sur les logs
gcloud logging metrics create error-count \
--description="Count of error logs" \
--log-filter='severity>=ERROR'
# Métrique avec labels
gcloud logging metrics create api-latency \
--description="API latency from logs" \
--log-filter='resource.type="cloud_run_revision" AND jsonPayload.latency_ms:*' \
--value-extractor='EXTRACT(jsonPayload.latency_ms)' \
--label-extractors='service=EXTRACT(resource.labels.service_name)'
# Lister les métriques
gcloud logging metrics list
Log sinks (export)
# Sink vers BigQuery
gcloud logging sinks create bq-all-logs \
bigquery.googleapis.com/projects/$PROJECT_ID/datasets/logs_dataset \
--log-filter='resource.type="cloud_run_revision"'
# Sink vers Cloud Storage
gcloud logging sinks create gcs-audit-logs \
storage.googleapis.com/$PROJECT_ID-audit-logs \
--log-filter='logName:"cloudaudit.googleapis.com"'
# Sink vers Pub/Sub (pour SIEM externe)
gcloud logging sinks create pubsub-security-logs \
pubsub.googleapis.com/projects/$PROJECT_ID/topics/security-logs \
--log-filter='severity>=WARNING'
# Important : donner les permissions au service account du sink
SINK_SA=$(gcloud logging sinks describe bq-all-logs --format="get(writerIdentity)")
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=$SINK_SA \
--role=roles/bigquery.dataEditor
Log Analytics (SQL sur les logs)
-- Dans la Console : Logging > Log Analytics
-- Top 10 des erreurs
SELECT
TIMESTAMP_TRUNC(timestamp, HOUR) as hour,
resource.type,
COUNT(*) as error_count
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY hour, resource.type
ORDER BY error_count DESC
LIMIT 10;
-- Latence P50, P95, P99 par service
SELECT
JSON_VALUE(jsonPayload, '$.service') as service,
APPROX_QUANTILES(CAST(JSON_VALUE(jsonPayload, '$.latency_ms') AS FLOAT64), 100)[OFFSET(50)] as p50,
APPROX_QUANTILES(CAST(JSON_VALUE(jsonPayload, '$.latency_ms') AS FLOAT64), 100)[OFFSET(95)] as p95,
APPROX_QUANTILES(CAST(JSON_VALUE(jsonPayload, '$.latency_ms') AS FLOAT64), 100)[OFFSET(99)] as p99
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE JSON_VALUE(jsonPayload, '$.latency_ms') IS NOT NULL
GROUP BY service;
4. Cloud Trace
Instrumentation automatique
Instrumentation manuelle (Python)
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup
tracer_provider = TracerProvider()
cloud_trace_exporter = CloudTraceSpanExporter()
tracer_provider.add_span_processor(BatchSpanProcessor(cloud_trace_exporter))
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)
# Usage dans le code
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# Sous-span pour l'appel DB
with tracer.start_as_current_span("db_query") as db_span:
db_span.set_attribute("db.operation", "SELECT")
# ... query database
# Sous-span pour l'appel API externe
with tracer.start_as_current_span("external_api") as api_span:
api_span.set_attribute("http.url", "https://api.example.com")
# ... call API
return {"status": "processed"}
Analyser les traces
# Via Console : Trace > Trace list
# Filtres disponibles :
# - Service name
# - Span name
# - Latency (min/max)
# - Status (OK, ERROR)
# - Time range
# API pour récupérer les traces
gcloud trace traces list --limit=10 --format=json
5. Alerting
Créer une politique d'alerte
# Alerte CPU > 80%
cat > cpu-alert.yaml << 'EOF'
displayName: "High CPU Alert"
combiner: OR
conditions:
- displayName: "CPU > 80%"
conditionThreshold:
filter: 'metric.type="compute.googleapis.com/instance/cpu/utilization"'
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_MEAN
notificationChannels:
- projects/PROJECT_ID/notificationChannels/CHANNEL_ID
documentation:
content: |
CPU utilization exceeded 80% for 5 minutes.
Check the instance and consider scaling.
mimeType: text/markdown
EOF
gcloud alpha monitoring policies create --policy-from-file=cpu-alert.yaml
Notification Channels
# Créer un channel email
gcloud alpha monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels=email_address=ops@company.com
# Créer un channel Slack
gcloud alpha monitoring channels create \
--display-name="Slack Alerts" \
--type=slack \
--channel-labels=channel_name=#alerts
# Créer un channel PagerDuty
gcloud alpha monitoring channels create \
--display-name="PagerDuty" \
--type=pagerduty \
--channel-labels=service_key=YOUR_SERVICE_KEY
# Lister les channels
gcloud alpha monitoring channels list
Alertes multi-conditions
# Alerte composite : CPU ET Memory élevés
displayName: "Resource Pressure Alert"
combiner: AND
conditions:
- displayName: "CPU > 80%"
conditionThreshold:
filter: 'metric.type="compute.googleapis.com/instance/cpu/utilization"'
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 300s
- displayName: "Memory > 90%"
conditionThreshold:
filter: 'metric.type="agent.googleapis.com/memory/percent_used"'
comparison: COMPARISON_GT
thresholdValue: 90
duration: 300s
6. SLOs (Service Level Objectives)
Concepts SRE
graph TB
subgraph "SLI (Indicator)"
A[Availability = requests_success / requests_total]
B[Latency = requests_below_threshold / requests_total]
end
subgraph "SLO (Objective)"
C[99.9% availability monthly]
D[95% requests < 200ms]
end
subgraph "Error Budget"
E[0.1% = 43.2 min downtime/month]
F[5% requests can be slow]
end
A --> C
B --> D
C --> E
D --> F
style C fill:#34A853,color:#fff
style D fill:#34A853,color:#fff
Créer un SLO
# Via gcloud
cat > slo.yaml << 'EOF'
displayName: "API Availability SLO"
serviceLevelIndicator:
basicSli:
availability: {}
goal: 0.999 # 99.9%
rollingPeriod: 2592000s # 30 days
EOF
# Créer d'abord un service
gcloud monitoring services create api-service \
--display-name="API Service"
# Puis le SLO
gcloud monitoring slos create \
--service=api-service \
--slo-id=availability-slo \
--config-from-file=slo.yaml
Alerte sur Error Budget
# Alerte quand 50% du budget est consommé
displayName: "Error Budget Alert - 50%"
conditions:
- displayName: "Error Budget Burn Rate"
conditionThreshold:
filter: 'select_slo_burn_rate("projects/PROJECT_ID/services/api-service/serviceLevelObjectives/availability-slo")'
comparison: COMPARISON_GT
thresholdValue: 2 # 2x burn rate = budget épuisé en 15 jours
duration: 3600s
7. Exercices Pratiques
Exercice 1 : Dashboard de monitoring
Exercice
Créez un dashboard avec :
- CPU utilization de toutes les VMs
- Request count d'un Load Balancer
- Error rate (logs-based metric)
- Un uptime check
Solution
# 1. Dashboard JSON
cat > my-dashboard.json << 'EOF'
{
"displayName": "Training Dashboard",
"gridLayout": {
"columns": "2",
"widgets": [
{
"title": "VM CPU Utilization",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": ["resource.label.instance_id"]
}
}
}
}]
}
},
{
"title": "Load Balancer Requests",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
]
}
}
EOF
gcloud monitoring dashboards create --config-from-file=my-dashboard.json
# 2. Log-based metric pour error rate
gcloud logging metrics create app-errors \
--description="Application error count" \
--log-filter='severity>=ERROR AND resource.type="cloud_run_revision"'
# 3. Uptime check (si vous avez un endpoint)
# gcloud monitoring uptime create health-check \
# --display-name="Health Check" \
# --resource-type=uptime-url \
# --monitored-resource="host=myapp.run.app" \
# --http-check-path="/health"
Exercice 2 : Alertes et notifications
Exercice
- Créez un notification channel email
- Créez une alerte pour CPU > 70% pendant 5 minutes
- Testez en générant de la charge
Solution
# Notification channel
gcloud alpha monitoring channels create \
--display-name="Training Email" \
--type=email \
--channel-labels=email_address=your-email@example.com
# Récupérer l'ID du channel
CHANNEL_ID=$(gcloud alpha monitoring channels list \
--filter="displayName='Training Email'" \
--format="value(name)")
# Créer l'alerte
cat > cpu-alert.yaml << EOF
displayName: "Training - High CPU"
combiner: OR
conditions:
- displayName: "CPU > 70%"
conditionThreshold:
filter: 'metric.type="compute.googleapis.com/instance/cpu/utilization"'
comparison: COMPARISON_GT
thresholdValue: 0.7
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_MEAN
notificationChannels:
- $CHANNEL_ID
EOF
gcloud alpha monitoring policies create --policy-from-file=cpu-alert.yaml
# Pour tester, créer une VM et générer de la charge
# gcloud compute instances create stress-test --machine-type=e2-small
# gcloud compute ssh stress-test -- "stress-ng --cpu 2 --timeout 600s"
Exercice 3 : Log Analytics
Exercice
Écrivez des requêtes SQL dans Log Analytics pour :
- Compter les erreurs par heure sur les dernières 24h
- Identifier les top 5 sources d'erreurs
- Calculer le temps moyen entre les erreurs
Solution
-- 1. Erreurs par heure
SELECT
TIMESTAMP_TRUNC(timestamp, HOUR) as hour,
COUNT(*) as error_count
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY hour
ORDER BY hour;
-- 2. Top 5 sources d'erreurs
SELECT
resource.type,
resource.labels.service_name,
COUNT(*) as error_count
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY resource.type, resource.labels.service_name
ORDER BY error_count DESC
LIMIT 5;
-- 3. Temps moyen entre erreurs (MTBF)
WITH errors AS (
SELECT timestamp
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY timestamp
),
error_gaps AS (
SELECT
TIMESTAMP_DIFF(timestamp, LAG(timestamp) OVER (ORDER BY timestamp), SECOND) as gap_seconds
FROM errors
)
SELECT
AVG(gap_seconds) as avg_mtbf_seconds,
AVG(gap_seconds) / 60 as avg_mtbf_minutes
FROM error_gaps
WHERE gap_seconds IS NOT NULL;
Exercice : À Vous de Jouer
Mise en Pratique
Objectif : Implémenter une solution d'observabilité complète avec monitoring, logging, tracing et alerting
Contexte : Vous gérez une application en production et devez mettre en place une observabilité complète. Vous devez créer des dashboards personnalisés, des alertes pertinentes, des SLOs pour mesurer la fiabilité, et analyser les logs avec Log Analytics.
Tâches à réaliser :
- Créer un dashboard Cloud Monitoring avec 4 widgets :
- CPU utilization de toutes les VMs
- Request count du Load Balancer
- Error rate (log-based metric)
- Latence P95 du backend
- Créer une log-based metric
http-errorscomptant les erreurs 5xx - Créer une log-based metric
api-latencypour la latence P50/P95/P99 - Créer 3 notification channels :
- Email pour l'équipe ops
- Slack/PagerDuty pour les alertes critiques (simulation)
- SMS pour les incidents majeurs (simulation)
- Créer 3 alerting policies :
- CPU > 80% pendant 5 minutes
- Error rate > 5% pendant 2 minutes
- Latency P95 > 1000ms pendant 3 minutes
- Créer un SLO pour disponibilité (99.9% sur 30 jours)
- Configurer un uptime check sur une URL publique
- Écrire 3 requêtes Log Analytics :
- Top 10 erreurs des dernières 24h
- Latence par service (P50, P95, P99)
- Temps moyen entre erreurs (MTBF)
Critères de validation :
- [ ] Dashboard créé avec les 4 widgets fonctionnels
- [ ] Les log-based metrics collectent les données
- [ ] Les 3 notification channels sont configurés
- [ ] Les 3 alerting policies sont actives
- [ ] Le SLO est configuré et mesure correctement
- [ ] L'uptime check fonctionne
- [ ] Les 3 requêtes Log Analytics retournent des résultats
- [ ] Documentation des seuils d'alerte et justification
Solution
# Variables
PROJECT_ID=$(gcloud config get-value project)
REGION="europe-west1"
# 1. Log-based metrics
# Metric pour erreurs HTTP 5xx
gcloud logging metrics create http-errors \
--description="Count of HTTP 5xx errors" \
--log-filter='severity>=ERROR AND httpRequest.status>=500'
# Metric pour latence API
gcloud logging metrics create api-latency \
--description="API latency distribution" \
--log-filter='resource.type="cloud_run_revision" AND jsonPayload.latency_ms:*' \
--value-extractor='EXTRACT(jsonPayload.latency_ms)' \
--metric-kind=DELTA \
--value-type=DISTRIBUTION
# 2. Notification Channels
# Email
gcloud alpha monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels=email_address=ops-team@example.com
# Récupérer les IDs
EMAIL_CHANNEL=$(gcloud alpha monitoring channels list \
--filter="displayName='Ops Team Email'" \
--format="value(name)")
# 3. Alerting Policies
# Alerte CPU
cat > cpu-alert.yaml << EOF
displayName: "High CPU Alert - Production"
combiner: OR
conditions:
- displayName: "CPU > 80% for 5 minutes"
conditionThreshold:
filter: 'metric.type="compute.googleapis.com/instance/cpu/utilization"'
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_MEAN
notificationChannels:
- $EMAIL_CHANNEL
documentation:
content: |
## Action requise
CPU utilization a dépassé 80% pendant 5 minutes.
**Étapes de diagnostic:**
1. Vérifier les processus avec \`top\`
2. Analyser les logs d'application
3. Considérer le scaling horizontal
**Runbook:** https://wiki.example.com/runbooks/high-cpu
mimeType: text/markdown
EOF
gcloud alpha monitoring policies create --policy-from-file=cpu-alert.yaml
# Alerte Error Rate
cat > error-rate-alert.yaml << EOF
displayName: "High Error Rate Alert"
combiner: OR
conditions:
- displayName: "Error rate > 5%"
conditionThreshold:
filter: 'metric.type="logging.googleapis.com/user/http-errors"'
comparison: COMPARISON_GT
thresholdValue: 5
duration: 120s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
notificationChannels:
- $EMAIL_CHANNEL
documentation:
content: |
## Taux d'erreurs élevé
Plus de 5% des requêtes sont en erreur.
**Actions immédiates:**
1. Vérifier le dashboard des erreurs
2. Analyser les logs dans Log Explorer
3. Contacter l'équipe de développement si nécessaire
mimeType: text/markdown
EOF
gcloud alpha monitoring policies create --policy-from-file=error-rate-alert.yaml
# 4. Dashboard
cat > dashboard.json << 'EOF'
{
"displayName": "Production Monitoring Dashboard",
"gridLayout": {
"columns": "2",
"widgets": [
{
"title": "VM CPU Utilization",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": ["resource.instance_id"]
}
}
}
}],
"yAxis": {"scale": "LINEAR"}
}
},
{
"title": "Load Balancer Requests",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
},
{
"title": "Error Rate",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/http-errors\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
},
{
"title": "API Latency P95",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"logging.googleapis.com/user/api-latency\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_DELTA",
"crossSeriesReducer": "REDUCE_PERCENTILE_95"
}
}
}
}]
}
}
]
}
}
EOF
gcloud monitoring dashboards create --config-from-file=dashboard.json
# 5. Uptime Check
gcloud monitoring uptime create prod-uptime-check \
--display-name="Production App Health" \
--resource-type=uptime-url \
--monitored-resource="host=example.com" \
--http-check-path="/health" \
--http-check-port=443 \
--use-ssl \
--period=60 \
--timeout=10
# 6. SLO
# Créer un service
gcloud monitoring services create prod-api \
--display-name="Production API"
# Créer le SLO
cat > slo.yaml << 'EOF'
displayName: "API Availability SLO - 99.9%"
serviceLevelIndicator:
basicSli:
availability: {}
goal: 0.999
rollingPeriod: 2592000s # 30 days
EOF
gcloud monitoring slos create \
--service=prod-api \
--slo-id=availability-slo \
--config-from-file=slo.yaml
# 7. Log Analytics Queries
echo "=== LOG ANALYTICS QUERIES ==="
# Query 1: Top 10 erreurs
cat > query-top-errors.sql << 'SQL'
SELECT
TIMESTAMP_TRUNC(timestamp, HOUR) as hour,
jsonPayload.error as error_type,
COUNT(*) as error_count
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY hour, error_type
ORDER BY error_count DESC
LIMIT 10;
SQL
# Query 2: Latence par service
cat > query-latency.sql << 'SQL'
SELECT
resource.labels.service_name as service,
APPROX_QUANTILES(CAST(jsonPayload.latency_ms AS FLOAT64), 100)[OFFSET(50)] as p50,
APPROX_QUANTILES(CAST(jsonPayload.latency_ms AS FLOAT64), 100)[OFFSET(95)] as p95,
APPROX_QUANTILES(CAST(jsonPayload.latency_ms AS FLOAT64), 100)[OFFSET(99)] as p99
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE jsonPayload.latency_ms IS NOT NULL
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY service
ORDER BY p95 DESC;
SQL
# Query 3: MTBF (Mean Time Between Failures)
cat > query-mtbf.sql << 'SQL'
WITH errors AS (
SELECT timestamp
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE severity = 'ERROR'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY timestamp
),
error_gaps AS (
SELECT
TIMESTAMP_DIFF(timestamp, LAG(timestamp) OVER (ORDER BY timestamp), SECOND) as gap_seconds
FROM errors
)
SELECT
AVG(gap_seconds) as avg_mtbf_seconds,
AVG(gap_seconds) / 60 as avg_mtbf_minutes,
AVG(gap_seconds) / 3600 as avg_mtbf_hours
FROM error_gaps
WHERE gap_seconds IS NOT NULL;
SQL
# Validation
echo "=== VALIDATION ==="
echo ""
echo "1. Dashboards:"
gcloud monitoring dashboards list --format="table(name,displayName)"
echo ""
echo "2. Alerting Policies:"
gcloud alpha monitoring policies list --format="table(name,displayName,enabled)"
echo ""
echo "3. Notification Channels:"
gcloud alpha monitoring channels list --format="table(name,displayName,type)"
echo ""
echo "4. SLOs:"
gcloud monitoring slos list --service=prod-api --format="table(name,displayName,goal)"
echo ""
echo "5. Uptime Checks:"
gcloud monitoring uptime list-configs --format="table(name,displayName)"
echo ""
echo "✅ Observabilité complète configurée!"
echo ""
echo "📊 Dashboard URL:"
echo "https://console.cloud.google.com/monitoring/dashboards?project=$PROJECT_ID"
8. Nettoyage
# Dashboards
gcloud monitoring dashboards list --format="value(name)" | while read d; do
gcloud monitoring dashboards delete $d --quiet
done
# Alerting policies
gcloud alpha monitoring policies list --format="value(name)" | while read p; do
gcloud alpha monitoring policies delete $p --quiet
done
# Notification channels
gcloud alpha monitoring channels list --format="value(name)" | while read c; do
gcloud alpha monitoring channels delete $c --quiet
done
# Log-based metrics
gcloud logging metrics delete app-errors --quiet
# Uptime checks
gcloud monitoring uptime delete-config health-check --quiet
Résumé du Module
| Concept | Points clés |
|---|---|
| Cloud Monitoring | Métriques, dashboards, uptime checks |
| Cloud Logging | Log Router, sinks, Log Analytics (SQL) |
| Cloud Trace | Distributed tracing, OpenTelemetry |
| Alerting | Policies, notification channels, conditions |
| SLOs | SLI, SLO, Error Budget |
Les 4 Golden Signals
| Signal | Métrique GCP |
|---|---|
| Latency | loadbalancing.googleapis.com/https/backend_latencies |
| Traffic | loadbalancing.googleapis.com/https/request_count |
| Errors | loadbalancing.googleapis.com/https/request_count (filtered by response_code) |
| Saturation | compute.googleapis.com/instance/cpu/utilization |
← Retour au Module 9 | Retour au TP Final
Retour au : Programme de la Formation | Catalogue des Formations
Navigation
| ← Module 9 : Sécurité - Cloud Armor, Se... | Programme → |