Files

HyunjunJeon af5fbfabec 문서 추가: Context Engineering 문서 추가 및 deepagents_sourcecode 한국어 번역

- Context_Engineering.md: 에이전트 컨텍스트 엔지니어링 개념 정리 문서 추가
- Context_Engineering_Research.ipynb: 연구 노트북 업데이트
- deepagents_sourcecode/: docstring과 주석을 한국어로 번역

2026-01-11 17:55:52 +09:00

deepagents_harbor

문서 추가: Context Engineering 문서 추가 및 deepagents_sourcecode 한국어 번역

2026-01-11 17:55:52 +09:00

scripts

문서 추가: Context Engineering 문서 추가 및 deepagents_sourcecode 한국어 번역

2026-01-11 17:55:52 +09:00

tests/unit_tests

project init

2025-12-31 11:32:36 +09:00

Makefile

project init

2025-12-31 11:32:36 +09:00

pyproject.toml

project init

2025-12-31 11:32:36 +09:00

README.md

project init

2025-12-31 11:32:36 +09:00

uv.lock

deepagents_sourcecode update

2026-01-11 12:05:37 +09:00

README.md

Building DeepAgent Harnesses for Terminal Bench 2.0 with Harbor

Overview

This repository demonstrates how to evaluate and improve your DeepAgent harness using Harbor and LangSmith.

What is Harbor?

Harbor is an evaluation framework that simplifies running agents on challenging benchmarks. It provides:

Sandbox environments (Docker, Modal, Daytona, E2B, etc.)
Automatic test execution and verification
Reward scoring (0.0 - 1.0 based on test pass rate)
Trajectory logging in ATIF format (Agent Trajectory Interchange Format)

What is Terminal Bench 2.0?

Terminal Bench 2.0 is an evaluation benchmark that measures agent capabilities across several domains, testing how well an agent operates using a computer environment, primarily via the terminal. The benchmark includes 90+ tasks across domains like software engineering, biology, security, gaming, and more.

Example tasks:

path-tracing: Reverse-engineer C program from rendered image
chess-best-move: Find optimal move using chess engine
git-multibranch: Complex git operations with merge conflicts
sqlite-with-gcov: Build SQLite with code coverage, analyze reports

The DeepAgent Architecture

The DeepAgent harness ships with design patterns validated as good defaults across agentic tasks:

Detailed System Prompt: Expansive, instructional prompts with tool guidance and examples
Planning Middleware: The write_todos tool helps the agent structure thinking and track progress
Filesystem: Provides ls, read_file, write_file, edit_file, glob, grep for context management
SubAgents: The task tool spawns specialized subagents for isolated work

Quick Start

# Install dependencies
uv sync

# Configure API keys - Choose one approach:

# Option 1: Use .env file (recommended for local development)
cp .env.example .env
# Edit .env and add your keys - they'll be automatically loaded

# Option 2: Export directly (useful for CI/CD or quick testing)
export ANTHROPIC_API_KEY="sk-ant-..."  # Required: For Claude model
export LANGSMITH_API_KEY="lsv2_..."    # Required: For tracing
export LANGSMITH_TRACING_V2=true       # Required: Enable LangSmith tracing
export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"  # Optional: Default shown
# export DAYTONA_API_KEY="..."  # Optional: Only if using --env daytona

# Run via Docker (1 task)
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset terminal-bench@2.0 -n 1 --jobs-dir jobs/terminal-bench --env docker

# Run via Daytona (10 tasks)
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset terminal-bench@2.0 -n 10 --jobs-dir jobs/terminal-bench --env daytona

LangSmith Integration

LangSmith provides tracing and observability for agent runs. The workflow:

DeepAgents → Harbor (evaluate) → LangSmith (analyze) → Improve → Repeat

Prerequisites

Ensure your LangSmith credentials are configured (see Quick Start for .env or export options):

# Required environment variables:
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_TRACING_V2=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com  # Optional: defaults to this

Step 1: Create Dataset and Experiment

# Create dataset from Harbor tasks
python scripts/harbor_langsmith.py create-dataset terminal-bench --version 2.0

# Create experiment session (outputs session ID and URL)
python scripts/harbor_langsmith.py create-experiment terminal-bench --name deepagents-baseline-v1

Step 2: Run Benchmark with Tracing

# Option 1: For experiments (enables side-by-side comparison in LangSmith)
export LANGSMITH_EXPERIMENT="deepagents-baseline-v1"
make run-terminal-bench-daytona  # Runs 10 tasks on Daytona

# Option 2: For development (simpler project view in LangSmith)
export LANGSMITH_PROJECT="deepagents-development"
make run-terminal-bench-daytona

# Option 3: Run harbor directly (customize -n for number of tasks)
export LANGSMITH_EXPERIMENT="deepagents-baseline-v1"
uv run harbor run \
  --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset terminal-bench@2.0 -n 10 --jobs-dir jobs/terminal-bench --env daytona

Step 3: Add Feedback Scores

After the benchmark completes, push reward scores to LangSmith for filtering and analysis:

python scripts/harbor_langsmith.py add-feedback jobs/terminal-bench/2025-12-02__16-25-40 \
  --project-name deepagents-baseline-v1

This matches trials to traces and adds harbor_reward feedback (0.0-1.0) from Harbor's test results.

Analyzing Results

LangSmith captures every LLM call, tool invocation, and performance metric. Combined with Harbor reward scores (added via Step 3), you can filter runs by performance and identify patterns in successful vs. failed runs.

Common Patterns & Fixes

After running evaluations, analyze failed runs in LangSmith to identify improvement opportunities:

Pattern	Symptom	Potential Fix
Poor Planning	Agent jumps into coding without reading requirements	Add upfront planning requirement to prompt
Incorrect Tool Usage	Uses `bash cat` instead of `read_file`	Improve tool descriptions with examples
No Incremental Testing	Writes 200 lines, then tests once	Prompt to test after each logical unit
Hallucinated Paths	Reads files before checking existence	Add "always `ls` before read" rule
Wrong Model	Model fails on complex reasoning	Use more capable model for hard tasks

Agent-Assisted Analysis

Use LangSmith's Insights Agent or your own agent to analyze trajectory data across runs. Task it with identifying common failure patterns, grouping errors by category, and suggesting prompt or tool improvements.

Available Environments

Harbor supports multiple sandbox environments. Use the --env flag to select:

docker - Local Docker containers (good for testing)
daytona - Daytona cloud sandboxes (requires DAYTONA_API_KEY)
modal - Modal cloud compute
runloop - Runloop sandboxes

Makefile shortcuts are available for common workflows:

make run-terminal-bench-docker - Run 1 task locally with Docker
make run-terminal-bench-daytona - Run 10 tasks on Daytona
make run-terminal-bench-modal - Run 4 tasks on Modal
make run-terminal-bench-runloop - Run 10 tasks on Runloop

README.md

Building DeepAgent Harnesses for Terminal Bench 2.0 with Harbor

Overview

What is Harbor?

What is Terminal Bench 2.0?

The DeepAgent Architecture

Quick Start

LangSmith Integration

Prerequisites

Step 1: Create Dataset and Experiment

Step 2: Run Benchmark with Tracing

Step 3: Add Feedback Scores

Analyzing Results

Common Patterns & Fixes

Agent-Assisted Analysis

Available Environments

Resources