- Add 60 new agents across all 10 categories (75 -> 135) - Add 95 new plugins with command files (25 -> 120) - Update all agents to use model: opus - Update README with complete plugin/agent tables - Update marketplace.json with all 120 plugins
41 lines
2.6 KiB
Markdown
41 lines
2.6 KiB
Markdown
# Monitoring
|
|
|
|
## Logging Standards
|
|
- Use structured logging (JSON format) in all environments except local development.
|
|
- Include standard fields in every log entry: `timestamp`, `level`, `service`, `requestId`, `message`.
|
|
- Log levels: DEBUG (verbose development info), INFO (business events), WARN (recoverable issues), ERROR (failures requiring attention).
|
|
- Log at INFO level: request start/end, auth events, business transactions, job start/completion.
|
|
- Log at ERROR level: unhandled exceptions, failed external calls, data integrity issues.
|
|
- Never log: passwords, tokens, PII, credit card numbers, or full request bodies with sensitive data.
|
|
- Sanitize user input in log messages to prevent log injection attacks.
|
|
- Use a correlation ID (requestId) across all services for distributed tracing.
|
|
|
|
## Metrics
|
|
- Track the four golden signals: latency, traffic, errors, saturation.
|
|
- Use histograms for latency (not averages). Track p50, p95, p99 percentiles.
|
|
- Instrument: HTTP request duration, database query duration, queue depth, cache hit rate.
|
|
- Use counter metrics for: requests total, errors total, jobs processed, events emitted.
|
|
- Use gauge metrics for: active connections, queue size, memory usage, goroutine count.
|
|
- Name metrics with a namespace prefix: `myapp_http_requests_total`, `myapp_db_query_duration_seconds`.
|
|
- Export metrics in Prometheus format or push to Datadog/CloudWatch.
|
|
|
|
## Alerting Rules
|
|
- Alert on symptoms (error rate > 1%, latency p99 > 2s) not causes (CPU > 80%).
|
|
- Set severity levels: P1 (pages on-call, service down), P2 (Slack alert, degraded), P3 (ticket, non-urgent).
|
|
- P1 alerts must have a runbook linked in the alert description.
|
|
- Avoid alert fatigue: if an alert fires more than 3 times without action, tune or remove it.
|
|
- Use alerting windows: 5-minute sustained for P1, 15-minute sustained for P2.
|
|
- Test alerts quarterly by injecting controlled failures.
|
|
|
|
## Health Checks
|
|
- Expose `/health` for load balancer checks (returns 200 if the process is running).
|
|
- Expose `/ready` for dependency checks (database, cache, queue connectivity).
|
|
- Health check endpoints must respond within 1 second and not perform expensive operations.
|
|
- Return structured health: `{ status: "healthy", checks: { db: "ok", redis: "ok", queue: "ok" } }`.
|
|
|
|
## Dashboards
|
|
- Create a service dashboard with: request rate, error rate, latency percentiles, resource usage.
|
|
- Create a business dashboard with: signups, active users, transactions, revenue metrics.
|
|
- Use consistent time ranges and refresh intervals across dashboards.
|
|
- Add annotations for deployments, incidents, and configuration changes.
|