yanis/html

Files

WEVAL NonReg / nonreg (push) Has been cancelled

Details

feat(v184-phase05-fpm-auto-recovery): watchdog + timeouts cap + doctrine 134

Root cause incident 22avr 22h: FPM workers stuck 30+min apres test Playwright Q5 120s
Fixes:
- /usr/local/bin/fpm-watchdog (cron minute, 2 fails => restart)
- cron horaire resolvectl flush-caches (DNS cache overflow)
- exec.conf request_terminate_timeout 120 -> 25
- chat-v2-direct.php: set_time_limit(25) + CURLOPT_TIMEOUT 45->20 + CONNECT 5

Verification:
- 5/5 intents OK (hi, nonreg 153/153, l99 340/340, ethica 166742, intents_pool 639)
- NR 153/153 invariant
- GOLD vault-gold/opus/phase05-20260423-010548/
- Doctrine 134 in vault + wiki (3 copies)

2026-04-23 01:07:58 +02:00

1.2 KiB

Raw Permalink Blame History

Doctrine 134 - FPM Auto-Recovery Obligatoire

Date: 2026-04-23 Context: Post-incident 22 avril 2026 - WEVIA Master down 30+ min suite à test Playwright Q5 saturant FPM workers (120s backend), watchdog absent, recovery manuelle requise.

Regles absolues

Timeout cap PHP: set_time_limit(25) en tete de TOUT endpoint chat/LLM
FPM request_terminate_timeout: max 25s sur tous les pools (www.conf=30, exec.conf=25)
CURL externes LLM: CURLOPT_TIMEOUT max 20, CURLOPT_CONNECTTIMEOUT 5
Watchdog obligatoire: /usr/local/bin/fpm-watchdog + crontab minute, 2 fails consecutifs => restart fpm + resolved + flush DNS
DNS cache flush: crontab horaire resolvectl flush-caches
Health endpoint: /api/health-up.php probe par watchdog (minimal, 1 ligne PHP)

Root cause incident 22avr 2026

exec.conf avait request_terminate_timeout=120 (trop laxiste)
chat-v2-direct.php CURLOPT_TIMEOUT=45 (pas de cap serre)
Aucun watchdog systeme pour relever FPM stuck
DNS resolver saturait (cache overflow) empechant les LLM externes

Verification

fpm-watchdog tick minute OK
5 intents repondent en <100ms (hi, nonreg, l99, ethica, intents_pool)
NR 153/153 invariant apres fix

1.2 KiB Raw Permalink Blame History

Doctrine 134 - FPM Auto-Recovery Obligatoire

Regles absolues

Root cause incident 22avr 2026

Verification

1.2 KiB

Raw Permalink Blame History