Files
Rohit Ghumare c3f43d8b61 Expand toolkit to 135 agents, 120 plugins, 796 total files
- Add 60 new agents across all 10 categories (75 -> 135)
- Add 95 new plugins with command files (25 -> 120)
- Update all agents to use model: opus
- Update README with complete plugin/agent tables
- Update marketplace.json with all 120 plugins
2026-02-04 21:08:28 +00:00

51 lines
1.9 KiB
Markdown

Set up monitoring, alerting, and observability for the application.
## Steps
1. Analyze the application to determine monitoring needs:
- Web server: response times, error rates, request volume.
- Database: query performance, connection pool, replication lag.
- Queue: message throughput, consumer lag, dead letters.
- Background jobs: execution time, failure rate, queue depth.
2. Generate monitoring configuration for the detected stack:
- **Prometheus**: Scrape config, recording rules, alert rules.
- **Grafana**: Dashboard JSON with key panels.
- **Datadog**: `datadog.yaml` or agent configuration.
- **Health endpoint**: `/health` or `/healthz` implementation.
3. Define alerts for critical metrics:
- Error rate > 1% over 5 minutes.
- P99 latency > 2 seconds.
- Disk usage > 80%.
- Memory usage > 90%.
- Certificate expiry < 14 days.
4. Add structured logging configuration:
- JSON log format with timestamp, level, message, trace ID.
- Log levels: ERROR for failures, WARN for degradation, INFO for operations.
5. Set up distributed tracing if applicable:
- OpenTelemetry SDK initialization.
- Trace context propagation headers.
6. Write all configuration files to `monitoring/` or `deploy/monitoring/`.
## Format
```yaml
groups:
- name: <app-name>-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
```
## Rules
- Every production service must have health checks, error rate alerts, and latency monitoring.
- Use percentile-based latency metrics (P50, P95, P99), not averages.
- Set alert thresholds based on SLO targets, not arbitrary values.
- Include runbook links in alert annotations.
- Log at appropriate levels; never log sensitive data (passwords, tokens, PII).