# Clone the repository
cd mlx-server

# Build in release mode
xcodebuild -scheme mlx-server -configuration Release

# The binary and metallib will be in the Xcode derived data build products

Quick Start

# Run with a local MLX model
./.build/arm64-apple-macosx/release/mlx-server \
  --model "/path/to/your/model" \
  --port 8080

Command-Line Options

Option	Default	Description
`-m, --model`	Required	Path to model directory or HuggingFace model ID
`--port`	8080	HTTP server port
`--ctx-size`	4096	Context window size
`--api-key`	`""`	API key for authentication (optional)

API Endpoints

Chat Completions

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming Response

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model",
    "messages": [{"role": "user", "content": "Tell me a story."}],
    "stream": true
  }'

Cancel Active Generation

curl -X POST http://localhost:8080/v1/cancel

List Models

curl http://localhost:8080/v1/models

Health Check

curl http://localhost:8080/health

Project Structure

mlx-server/
├── Sources/
│   └── MLXServer/
│       ├── MLXServerCommand.swift    # CLI entry point
│       ├── ModelRunner.swift          # Core inference engine
│       ├── Server.swift               # HTTP server & API handlers
│       ├── OpenAITypes.swift          # API type definitions
│       └── Logger.swift               # Logging utilities
├── Package.swift                      # Swift package manifest
└── README.md                          # This file

Architecture

Core Components

ModelRunner - Manages model loading and inference (streaming and non-streaming)
MLXHTTPServer - HTTP server with OpenAI-compatible endpoints
ActiveGenerations - Tracks and manages cancellable generation tasks

Notes

The server binds to 127.0.0.1 (localhost only) for security
Client disconnects automatically cancel the active generation
VLM (vision-language models) are supported via image/video URL inputs

Troubleshooting

Model Loading Fails

Ensure the model directory contains:

config.json - Model configuration
tokenizer.json - Tokenizer vocabulary
model.safetensors or model.safetensors.index.json - Model weights
Optional: generation_config.json, chat_template.jinja

Port Already in Use

Change the port:

--port 8081

Benchmarking

Use the mlx-lm server benchmark script to measure throughput and latency:

python server_benchmark.py --url http://localhost:8080/v1/chat/completions --model model

README.md

MLX Server

Features

Requirements

Installation

Building from Source

Quick Start

Command-Line Options

API Endpoints

Chat Completions

Streaming Response

Cancel Active Generation

List Models

Health Check

Project Structure

Architecture

Core Components

Notes

Troubleshooting

Model Loading Fails

Port Already in Use

Benchmarking

License

Contributing

Resources