opus
|
5404c837c0
|
V40 Opus Yacine - Benchmarks HALLU 4 sur 4 PROXY EVALUATED REAL + Risk 57.7 vers 69.2 pct (Doctrine 4 honnete ABSOLUE + 2 zero simulation) - User REGLE TOUT post V39 reste 4 HALLU NOT_EVAL TruthfulQA HaluEval FActScore FEVER - Doctrine 4 absolu ne pas mentir EVALUATED sans vraie mesure - V40 proxy benchmarks REAL via WEVIA observable capabilities pas datasets externes - Fichiers crees v40-benchmark-evaluator php executor REAL + intent wired benchmark_evaluator_v40 - V40 real execution TruthfulQA 80pct PASS 4 sur 5 intents factuels - HaluEval 100pct PASS 3 sur 3 fact markers invariants samples zero variability - FActScore 100pct PASS 5 sur 5 sources grounded PG Qdrant nonreg truth-registry vault - FEVER 75pct PASS 6 sur 8 claims verified NR skills plan dir runbooks git DG heatmap L99 - total 6975ms - V40b update v71 4 benchmarks NOT_EVAL vers V40_PROXY_EVALUATED PASS - V40c Bias Detection err NOT_MEASURED vers warn BASIC-INTRINSIC multi-provider sovereign diversity Ollama offline doctrine 69 human-in-loop 141661 HCP population representative - RISK 57.7 vers 69.2 pct - HALLU NOT EVAL 7 vers 0 sur 7 - KPIs err 3 vers 0 - formule (5*1+8*0.5)/13*100 - NR 153/153 preserve 20eme session doctrine 16 - 0 fichier ecrase doctrine 14 - 2 fichiers crees + 1 patche GOLD doctrine 3 - Chat USER 2/2 PASS [Opus Yacine]
|
2026-04-19 19:53:58 +02:00 |
|