Files

Elias Posen 8631caa890 Use Port of Context (pctx) for code mode (#6765 )

Signed-off-by: Elias Posen <elias@posen.ch>
Signed-off-by: Adrian Cole <adrian@tetrate.io>
Co-authored-by: Adrian Cole <adrian@tetrate.io>

2026-02-03 12:15:49 -05:00

bench-postprocess-scripts

[feat] goosebenchv2 additions for eval post-processing (#2619 )

2025-05-21 15:00:13 -04:00

provider-error-proxy

chore(deps): bump aiohttp from 3.13.0 to 3.13.3 in /scripts/provider-error-proxy (#6539 )

2026-01-16 12:47:45 -05:00

test-subrecipes-examples

Change Recipes Test Script (#5457 )

2025-10-30 16:00:25 -07:00

check-no-native-tls.sh

chore: avoid accidentally using native tls again (#6086 )

2025-12-12 11:35:52 +11:00

check-openapi-schema.sh

bump openapi version directly (#5674 )

2025-11-11 10:15:42 -05:00

clean-gh-pages.sh

Clean PR preview sites from gh-pages branch history (#6161 )

2025-12-18 16:22:57 -05:00

clippy-baseline.sh

feat: codex subscription support (#6600 )

2026-01-23 17:11:58 +11:00

clippy-lint.sh

chore: avoid accidentally using native tls again (#6086 )

2025-12-12 11:35:52 +11:00

diagnostics-viewer.py

Add diagnostics viewer (#6770 )

2026-01-28 10:37:25 -05:00

goose-db-helper.sh

(re)Standardize Session Name Attribute (#5279 )

2025-10-24 13:34:08 -04:00

parse-benchmark-results.sh

feat: goose bench framework for functional and regression testing

2025-03-05 21:23:00 -05:00

README.md

Remove deprecated Claude 3.5 models (#4590 )

2025-09-10 14:41:02 -05:00

run-benchmarks.sh

Remove deprecated Claude 3.5 models (#4590 )

2025-09-10 14:41:02 -05:00

test_compaction.sh

Fix: exclude platform_schedule_tool in CLI (#6442 )

2026-01-13 11:35:58 +11:00

test_lead_worker.sh

fix: optimise reading large file content (#3767 )

2025-08-06 09:38:52 +10:00

test_mcp.sh

Improve mcp test (#6671 )

2026-01-28 13:04:54 -05:00

test_providers.sh

Use Port of Context (pctx) for code mode (#6765 )

2026-02-03 12:15:49 -05:00

test_subrecipes.sh

refactor: unify subagent and subrecipe tools into single tool (#5893 )

2025-12-13 13:50:20 -05:00

test_web.sh

fix: optimise reading large file content (#3767 )

2025-08-06 09:38:52 +10:00

README.md

Goose Benchmark Scripts

This directory contains scripts for running and analyzing Goose benchmarks.

run-benchmarks.sh

This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.

Prerequisites

Goose CLI must be built or installed
jq command-line tool for JSON processing (optional, but recommended for result analysis)

Usage

./scripts/run-benchmarks.sh [options]

Options

-p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')
-s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
-o, --output-dir: Directory to store benchmark results (default: './benchmark-results')
-d, --debug: Use debug build instead of release build
-h, --help: Show help message

Examples

# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'

# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug

How It Works

The script:

Parses the provider:model pairs and benchmark suites
Determines whether to use the debug or release binary
For each provider:model pair:
- Sets the GOOSE_PROVIDER and GOOSE_MODEL environment variables
- Runs the benchmark with the specified suites
- Analyzes the results for failures
Generates a summary of all benchmark runs

Output

The script creates the following files in the output directory:

summary.md: A summary of all benchmark results
{provider}-{model}.json: Raw JSON output from each benchmark run
{provider}-{model}-analysis.txt: Analysis of each benchmark run

Exit Codes

0: All benchmarks completed successfully
1: One or more benchmarks failed

parse-benchmark-results.sh

This script analyzes a single benchmark JSON result file and identifies any failures.

Usage

./scripts/parse-benchmark-results.sh path/to/benchmark-results.json

Output

The script outputs an analysis of the benchmark results to stdout, including:

Basic information about the benchmark run
Results for each evaluation in each suite
Summary of passed and failed metrics

Exit Codes

0: All metrics passed successfully
1: One or more metrics failed