Some checks failed
WEVAL NonReg / nonreg (push) Has been cancelled
Root cause incident 22avr 22h: FPM workers stuck 30+min apres test Playwright Q5 120s Fixes: - /usr/local/bin/fpm-watchdog (cron minute, 2 fails => restart) - cron horaire resolvectl flush-caches (DNS cache overflow) - exec.conf request_terminate_timeout 120 -> 25 - chat-v2-direct.php: set_time_limit(25) + CURLOPT_TIMEOUT 45->20 + CONNECT 5 Verification: - 5/5 intents OK (hi, nonreg 153/153, l99 340/340, ethica 166742, intents_pool 639) - NR 153/153 invariant - GOLD vault-gold/opus/phase05-20260423-010548/ - Doctrine 134 in vault + wiki (3 copies)
1.2 KiB
1.2 KiB
Doctrine 134 - FPM Auto-Recovery Obligatoire
Date: 2026-04-23 Context: Post-incident 22 avril 2026 - WEVIA Master down 30+ min suite à test Playwright Q5 saturant FPM workers (120s backend), watchdog absent, recovery manuelle requise.
Regles absolues
- Timeout cap PHP:
set_time_limit(25)en tete de TOUT endpoint chat/LLM - FPM request_terminate_timeout: max 25s sur tous les pools (www.conf=30, exec.conf=25)
- CURL externes LLM: CURLOPT_TIMEOUT max 20, CURLOPT_CONNECTTIMEOUT 5
- Watchdog obligatoire: /usr/local/bin/fpm-watchdog + crontab minute, 2 fails consecutifs => restart fpm + resolved + flush DNS
- DNS cache flush: crontab horaire
resolvectl flush-caches - Health endpoint: /api/health-up.php probe par watchdog (minimal, 1 ligne PHP)
Root cause incident 22avr 2026
- exec.conf avait request_terminate_timeout=120 (trop laxiste)
- chat-v2-direct.php CURLOPT_TIMEOUT=45 (pas de cap serre)
- Aucun watchdog systeme pour relever FPM stuck
- DNS resolver saturait (cache overflow) empechant les LLM externes
Verification
- fpm-watchdog tick minute OK
- 5 intents repondent en <100ms (hi, nonreg, l99, ethica, intents_pool)
- NR 153/153 invariant apres fix