Files
html/wiki/doctrine-134-fpm-auto-recovery.md
Opus 4297271bd9
Some checks failed
WEVAL NonReg / nonreg (push) Has been cancelled
feat(v184-phase05-fpm-auto-recovery): watchdog + timeouts cap + doctrine 134
Root cause incident 22avr 22h: FPM workers stuck 30+min apres test Playwright Q5 120s
Fixes:
- /usr/local/bin/fpm-watchdog (cron minute, 2 fails => restart)
- cron horaire resolvectl flush-caches (DNS cache overflow)
- exec.conf request_terminate_timeout 120 -> 25
- chat-v2-direct.php: set_time_limit(25) + CURLOPT_TIMEOUT 45->20 + CONNECT 5

Verification:
- 5/5 intents OK (hi, nonreg 153/153, l99 340/340, ethica 166742, intents_pool 639)
- NR 153/153 invariant
- GOLD vault-gold/opus/phase05-20260423-010548/
- Doctrine 134 in vault + wiki (3 copies)
2026-04-23 01:07:58 +02:00

1.2 KiB

Doctrine 134 - FPM Auto-Recovery Obligatoire

Date: 2026-04-23 Context: Post-incident 22 avril 2026 - WEVIA Master down 30+ min suite à test Playwright Q5 saturant FPM workers (120s backend), watchdog absent, recovery manuelle requise.

Regles absolues

  1. Timeout cap PHP: set_time_limit(25) en tete de TOUT endpoint chat/LLM
  2. FPM request_terminate_timeout: max 25s sur tous les pools (www.conf=30, exec.conf=25)
  3. CURL externes LLM: CURLOPT_TIMEOUT max 20, CURLOPT_CONNECTTIMEOUT 5
  4. Watchdog obligatoire: /usr/local/bin/fpm-watchdog + crontab minute, 2 fails consecutifs => restart fpm + resolved + flush DNS
  5. DNS cache flush: crontab horaire resolvectl flush-caches
  6. Health endpoint: /api/health-up.php probe par watchdog (minimal, 1 ligne PHP)

Root cause incident 22avr 2026

  • exec.conf avait request_terminate_timeout=120 (trop laxiste)
  • chat-v2-direct.php CURLOPT_TIMEOUT=45 (pas de cap serre)
  • Aucun watchdog systeme pour relever FPM stuck
  • DNS resolver saturait (cache overflow) empechant les LLM externes

Verification

  • fpm-watchdog tick minute OK
  • 5 intents repondent en <100ms (hi, nonreg, l99, ethica, intents_pool)
  • NR 153/153 invariant apres fix