Files
Rohit Ghumare c3f43d8b61 Expand toolkit to 135 agents, 120 plugins, 796 total files
- Add 60 new agents across all 10 categories (75 -> 135)
- Add 95 new plugins with command files (25 -> 120)
- Update all agents to use model: opus
- Update README with complete plugin/agent tables
- Update marketplace.json with all 120 plugins
2026-02-04 21:08:28 +00:00

41 lines
2.6 KiB
Markdown

# Monitoring
## Logging Standards
- Use structured logging (JSON format) in all environments except local development.
- Include standard fields in every log entry: `timestamp`, `level`, `service`, `requestId`, `message`.
- Log levels: DEBUG (verbose development info), INFO (business events), WARN (recoverable issues), ERROR (failures requiring attention).
- Log at INFO level: request start/end, auth events, business transactions, job start/completion.
- Log at ERROR level: unhandled exceptions, failed external calls, data integrity issues.
- Never log: passwords, tokens, PII, credit card numbers, or full request bodies with sensitive data.
- Sanitize user input in log messages to prevent log injection attacks.
- Use a correlation ID (requestId) across all services for distributed tracing.
## Metrics
- Track the four golden signals: latency, traffic, errors, saturation.
- Use histograms for latency (not averages). Track p50, p95, p99 percentiles.
- Instrument: HTTP request duration, database query duration, queue depth, cache hit rate.
- Use counter metrics for: requests total, errors total, jobs processed, events emitted.
- Use gauge metrics for: active connections, queue size, memory usage, goroutine count.
- Name metrics with a namespace prefix: `myapp_http_requests_total`, `myapp_db_query_duration_seconds`.
- Export metrics in Prometheus format or push to Datadog/CloudWatch.
## Alerting Rules
- Alert on symptoms (error rate > 1%, latency p99 > 2s) not causes (CPU > 80%).
- Set severity levels: P1 (pages on-call, service down), P2 (Slack alert, degraded), P3 (ticket, non-urgent).
- P1 alerts must have a runbook linked in the alert description.
- Avoid alert fatigue: if an alert fires more than 3 times without action, tune or remove it.
- Use alerting windows: 5-minute sustained for P1, 15-minute sustained for P2.
- Test alerts quarterly by injecting controlled failures.
## Health Checks
- Expose `/health` for load balancer checks (returns 200 if the process is running).
- Expose `/ready` for dependency checks (database, cache, queue connectivity).
- Health check endpoints must respond within 1 second and not perform expensive operations.
- Return structured health: `{ status: "healthy", checks: { db: "ok", redis: "ok", queue: "ok" } }`.
## Dashboards
- Create a service dashboard with: request rate, error rate, latency percentiles, resource usage.
- Create a business dashboard with: signups, active users, transactions, revenue metrics.
- Use consistent time ranges and refresh intervals across dashboards.
- Add annotations for deployments, incidents, and configuration changes.