- Add 60 new agents across all 10 categories (75 -> 135) - Add 95 new plugins with command files (25 -> 120) - Update all agents to use model: opus - Update README with complete plugin/agent tables - Update marketplace.json with all 120 plugins
1.4 KiB
1.4 KiB
/evaluate-model - Evaluate ML Model
Evaluate machine learning model performance with comprehensive metrics.
Steps
- Ask the user for the model type: classification, regression, NLP, or generative
- Load the model and test dataset from the specified paths
- Run inference on the entire test dataset and collect predictions
- For classification models, calculate: accuracy, precision, recall, F1-score, AUC-ROC
- For regression models, calculate: MAE, MSE, RMSE, R-squared, MAPE
- For NLP models, calculate: BLEU, ROUGE, perplexity, exact match
- Generate a confusion matrix for classification tasks
- Identify the worst-performing classes or data segments
- Calculate calibration metrics: expected calibration error
- Run performance profiling: inference time per sample, memory usage, throughput
- Check for bias: evaluate performance across demographic subgroups if applicable
- Generate a comprehensive evaluation report with all metrics and visualizations
Rules
- Use stratified sampling if the test set is imbalanced
- Report confidence intervals for all metrics when sample size allows
- Include both micro and macro averages for multi-class metrics
- Test on held-out data never seen during training
- Report inference latency percentiles (p50, p95, p99) not just averages
- Check for data leakage between train and test sets
- Include baseline comparisons (random, majority class, previous model version)