CodeLeash Documentation

Architecture and Agent-Guardrail Systems

1 Introduction

1.1 What Is CodeLeash?

CodeLeash is an opinionated full-stack development scaffold that demonstrates how to build web applications with AI coding agents using strong guardrails, Test-Driven Development, and architectural enforcement. The tagline says it all: your coding agent, on a leash.

AI coding agents are powerful but undisciplined. Left unchecked, they skip tests, write sprawling changes, introduce subtle regressions, and produce code that works but nobody can maintain. CodeLeash addresses this with a system of hooks, state machines, and lint rules that constrain the agent’s behavior without limiting its productivity.

The scaffold includes a minimal “hello world” implementation that exercises every architectural pattern — repository, service, container DI, React root mounting with initial data — so you can see how the pieces fit together before building on top of them.

1.2 Who Are These Docs For?

Developers adopting CodeLeash as a starting point for their own projects
Teams experimenting with AI-assisted development who want guardrails
Anyone curious about the systems — TDD enforcement, test infrastructure, code quality checks — and how they work under the hood

1.3 Technology Stack

Layer	Technology
Backend	Python, FastAPI, Uvicorn
Frontend	React 19, TypeScript, Vite, Tailwind CSS
Database	Supabase (PostgreSQL) with RLS
Auth	Supabase Auth with JWT tokens
Observability	Prometheus metrics, OpenTelemetry, Sentry
Testing	pytest, Vitest, Playwright
CI/Quality	pre-commit hooks, custom Python lint scripts

1.4 Chapter Overview

Full-Stack Monorepo — How Vite and FastAPI work together, the render_page() pattern, and the initial data bridge from server to React.
TDD Guard — The state machine that enforces Red-Green-Refactor, the hooks that drive it, and how it isolates per-agent state.
How Tests Work — Three test levels (unit, integration, e2e), the 10ms timeout, and the e2e harness with isolated Supabase instances.
Agent Optimizations — Deny rules, test pipe blocking, dot silencing, and other settings that shape agent behavior.
Code Quality Checks — Custom Python scripts that run as pre-commit hooks: brand colors, unused routes, soft deletes, and more.
Worker System — PostgreSQL job queue with FOR UPDATE SKIP LOCKED, the QueueWorker polling loop, and handler registration.
Worktree Parallel Work — Port hashing, Supabase config isolation, and running multiple branches simultaneously.
Future & Community — Migration testing framework, planned enhancements, and how to adopt these ideas.

1.5 Key Files

Area	Files
Backend entry	`main.py`, `worker.py`
App core	`app/core/container.py`, `app/core/templates.py`, `app/core/vite_loader.py`
Frontend roots	`src/roots/util.tsx`, `src/roots/index.tsx`
TDD guard	`scripts/tdd_common.py`, `scripts/tdd_pre_edit.py`
Agent config	`.claude/settings.json`, `CLAUDE.md`
Setup	`init.sh`, `package.json`

1.6 Quick Start

git clone https://github.com/cadamsdotcom/CodeLeash.git
cd CodeLeash
./init.sh        # Install deps, start Supabase, configure .env
npm run dev      # Vite + FastAPI + worker with hot reload

The application runs at http://localhost:8000.

2 Full-Stack Monorepo

CodeLeash runs Vite and FastAPI as a single application. In development, two servers run concurrently with hot module replacement. In production, Vite builds static assets and FastAPI serves everything.

2.1 Dual-Server Architecture

The npm run dev command starts three processes via concurrently:

concurrently -n vite,uvicorn,worker \
  vite \
  "uv run python main.py" \
  "uv run python worker.py"

package.json

Vite (port 5173) serves JavaScript/CSS with HMR
Uvicorn (port 8000) serves HTML pages and API routes
Worker processes background jobs (see Worker System)

In production (npm run build then uv run uvicorn main:app), Vite compiles assets into dist/ and FastAPI serves them directly using the Vite manifest for cache-busted URLs.

2.2 The `render_page()` Pattern

Every page follows the same flow: a FastAPI route gathers data, passes it to render_page(), which renders a Jinja2 template that mounts a React component.

2.2.1 Route Layer

@router.get("/", response_class=HTMLResponse)
async def index(
    request: Request,
    greeting_service: GreetingService = Depends(get_greeting_service),
) -> HTMLResponse:
    greetings = await greeting_service.get_all()
    initial_data = {
        "greetings": [g.model_dump(mode="json") for g in greetings],
    }
    return render_page(
        request, "src/roots/index.tsx",
        title="CodeLeash", initial_data=initial_data,
    )

app/routes/index.py

The route calls a service (injected via Depends()), serializes the result to a dict, and passes it as initial_data.

2.2.2 Template Layer

render_page() JSON-serializes the initial data into the template context:

def render_page(request, component_path, title, initial_data=None, ...):
    initial_data_json = json.dumps(initial_data or {})
    return templates.TemplateResponse(request, "page.html", {
        "component_path": component_path,
        "title": title,
        "initial_data_json": initial_data_json,
    })

app/core/templates.py

The page.html template contains the critical bridge:

<div
  id="root"
  data-initial="{{ initial_data_json | escape }}"
  class="{{ root_css_class }}"
></div>
{{ vite_hmr_client(request) }} {{ vite_asset(component_path, request) }}

The initial data is embedded as a data-initial attribute on the root div — HTML-escaped JSON that React reads on mount.

2.2.3 React Layer

createReactRoot() parses the data-initial attribute and wraps the component in providers:

export const createReactRoot = (ComponentClass: React.ComponentType) => {
  const initializeRoot = () => {
    const rootElement = document.getElementById('root');
    const initialData = rootElement.dataset.initial;
    const data = initialData ? JSON.parse(initialData) : {};

    createRoot(rootElement).render(
      <React.StrictMode>
        <ErrorBoundary>
          <InitialDataProvider data={data}>
            {React.createElement(ComponentClass)}
          </InitialDataProvider>
        </ErrorBoundary>
      </React.StrictMode>
    );
  };
  // ...
};

src/roots/util.tsx

Each page’s root file is minimal:

import Index from '../pages/Index';
import { createReactRoot } from './util';
createReactRoot(Index);

src/roots/index.tsx

Components access the data via a useInitialData() hook provided by InitialDataProvider.

2.3 Complete Data Flow

Route handler
  → service.get_all()
  → initial_data dict
  → render_page()
  → json.dumps(initial_data)
  → page.html template
  → data-initial="..." attribute
  → createReactRoot()
  → JSON.parse(dataset.initial)
  → InitialDataProvider
  → useInitialData() hook
  → Component renders

2.4 Vite Integration

The vite_loader.py module handles both development and production modes:

Development (ENVIRONMENT != "production"):

vite_hmr_client() builds the Vite dev server URL from the request hostname, so HMR works regardless of how the browser reaches the server:

def get_vite_server_url(request: Request | None = None) -> str:
    hostname = request.headers.get("host").split(":")[0]
    return f"{scheme}://{hostname}:{VITE_SERVER_PORT}/"

app/core/vite_loader.py

Production:

vite_asset() reads dist/.vite/manifest.json to resolve cache-busted file paths, CSS dependencies, and module preload hints:

manifest = parse_manifest()
manifest_entry = manifest[path]

# Add CSS, vendor imports, the script itself, and modulepreload tags
tags.append(generate_stylesheet_tag(urljoin(STATIC_PATH, css_path)))
tags.append(generate_script_tag(
    urljoin(STATIC_PATH, manifest_entry["file"]), attrs=scripts_attrs,
))

app/core/vite_loader.py

# Development: script points at Vite server
<script type="module" src="http://localhost:5173/src/roots/index.tsx"></script>

# Production: script points at built asset
<script type="module" async defer src="/dist/assets/index-a1b2c3d4.js"></script>
<link rel="stylesheet" href="/dist/assets/index-e5f6g7h8.css" />

2.5 Type Safety: Pydantic to TypeScript

The npm run types command runs scripts/generate_types.py, which converts Pydantic models to TypeScript interfaces. A pre-commit hook (check-initial-data) verifies these types stay in sync, so the data-initial JSON and TypeScript types never drift apart.

2.6 Rollup Entry Points

Vite is configured with three entry points in vite.config.js:

rollupOptions: {
  input: {
    main: './src/main.ts',      // Global CSS and shared code
    app: './src/app.ts',        // Application-wide scripts
    index: './src/roots/index.tsx',  // Page-specific root
  },
},

Adding a new page means adding a new root file in src/roots/ and a corresponding entry in the Vite config.

3 TDD Guard

The TDD Guard is a state machine enforced through Claude Code hooks. It ensures agents follow the Red-Green-Refactor cycle by blocking file edits and tracking test outcomes. The guard is implemented entirely in Python scripts that run as hook handlers.

3.1 State Machine

The guard maintains four states:

initial ──→ red_intent ──→ red ──→ green_intent ──→ initial
   │            │                       │
   │         (write       (tests      (edit prod
   │          test)        fail)       files)
   │                                    │
   └────────────────────────────────────┘
                 (tests pass)

State	Meaning	Allowed Actions
`initial`	No active TDD cycle	Log Red intent only
`red_intent`	Agent declared what test should fail	Edit test files only
`red`	Test ran and failed (as expected)	Log Green intent only
`green_intent`	Agent declared what to change and which files	Edit declared prod files only

When tests pass after a Green phase, the state returns to initial.

3.1.1 State Derivation

State is derived by scanning the TDD log file bottom-up. The last significant line determines the current state:

def read_state(log_path: Path) -> str:
    """Scan log bottom-up for the last significant line to derive state."""
    lines = log_path.read_text().strip().splitlines()

    for i, line in enumerate(reversed(lines)):
        stripped = line.rstrip()
        if stripped.startswith("[test]") and stripped.endswith("— SUCCEEDED"):
            return "initial"
        if stripped.startswith("[test]") and "— FAILED" in stripped:
            preceding = _find_preceding_intent(lines, len(lines) - 1 - i)
            if preceding == "green":
                return "green_intent"
            return "red"
        if stripped.startswith("## Red"):
            return "red_intent"
        if stripped.startswith("## Green"):
            return "green_intent"
    return "initial"

scripts/tdd_common.py

Summary of state derivation rules:

[test] ... — SUCCEEDED → initial
[test] ... — FAILED after a ## Green header → green_intent (test failed during Green)
[test] ... — FAILED after a ## Red header → red (test failed as expected)
## Red ... → red_intent
## Green ... → green_intent

3.2 The CLI: `tdd_log`

Agents interact with the TDD guard through scripts/tdd_log.py, invoked as:

# Declare Red intent
uv run python -m scripts.tdd_log --log "tdd-abc123.log" red \
  --test "path/to/test_file" \
  --expects "test_name fails because ..."

# Declare Green intent
uv run python -m scripts.tdd_log --log "tdd-abc123.log" green \
  --change "what you plan to do" \
  --file "path/to/file1.py" --file "path/to/file2.py"

# Skip Red cycle (for refactoring, lint, or coverage)
uv run python -m scripts.tdd_log --log "tdd-abc123.log" green --skip-red \
  --reason=refactoring --change "what you plan to do" \
  --file "path/to/file.py"

3.2.1 Green Validation

The green subcommand enforces prerequisites:

Without --skip-red: requires state to be red (test must have failed) or green_intent (re-logging)
With --skip-red: requires a --reason from {refactoring, lint-only, adding-coverage}

3.2.2 Overrides

Logging a Red or Green intent at any time overrides the current state. This is useful when the agent gets stuck in the wrong state. Overrides are recorded in the log for later review.

3.3 Pre-Edit Hook

The scripts/tdd_pre_edit.py script runs as a PreToolUse hook on every Edit or Write tool call. It reads the current state from the TDD log and decides whether to allow or block the edit.

3.3.1 File Classification

Every file is classified into one of four categories based on pattern matching:

PROD_PATTERNS = [
    r"^src/",
    r"^app/",
    r"^scripts/.*\.py$",
    r"^main\.py$",
    r"^worker\.py$",
]

scripts/tdd_common.py

Category	Patterns	TDD Enforced
`e2e_test`	`tests/e2e/`	No (bypass)
`test`	`.test.{ts,tsx,js,jsx}`, `test_.py`, `tests/`, `conftest.py`	Yes
`prod`	`src/`, `app/`, `scripts/*.py`, `main.py`, `worker.py`	Yes
`other`	Everything else	No (bypass)

3.3.2 Permission Table

State	Test Files	Prod Files
`initial`	Blocked	Blocked
`red_intent`	Allowed	Blocked
`red`	Blocked	Blocked
`green_intent`	Blocked*	Allowed (if in allowlist)

* Test files are allowed during green_intent only if the Green was logged with --skip-red.

3.3.3 Green Allowlist

During the Green phase, only files explicitly declared in the --file arguments are allowed. The hook scans the log backwards from the last ## Green header, collecting File: lines to build the allowlist. If the agent tries to edit a file not in the allowlist, the edit is blocked with a message showing the declared files.

A warning is emitted if the allowlist exceeds 5 files, encouraging smaller increments.

3.4 Post-Bash Hook

The scripts/tdd_post_bash.py script runs as a PostToolUse (and PostToolUseFailure) hook on every Bash tool call. It classifies commands and records outcomes:

Command Pattern	Tag	Effect on State
`npm run test:e2e*`	`ignored e2e test`	No state change
`npm test` or `npm run test`	`test`	Drives state transitions
Everything else	`bash`	Logged, no state change

Test commands tagged as test with SUCCEEDED status reset the state to initial. Test commands that FAILED during a Red phase confirm the state as red.

3.4.1 Example TDD Log

A full Red-Green cycle produces log entries like this:

## Red — 2026-02-24 10:30:00
Test: tests/unit/services/test_greeting_service.py
Expects: test_create_greeting fails because create() method doesn't exist yet

[test] npm run test:python -- tests/unit/services/test_greeting_service.py -v — FAILED

## Green — 2026-02-24 10:32:00
Change: Add create() method to GreetingService
File: app/services/greeting.py

[test] npm run test:python -- tests/unit/services/test_greeting_service.py -v — SUCCEEDED

3.5 Plan Exit Hook

The scripts/plan_exit_hook.py runs as a PreToolUse hook on ExitPlanMode. On the first invocation per session:

Outputs a TDD Planning Checklist to stderr (reminding the agent to consider test levels, automation, cleanup)
Invokes a nested Claude CLI instance to review the plan for TDD coverage gaps:

result = subprocess.run(
    ["claude", "-p", prompt],
    capture_output=True, text=True, timeout=60,
)

scripts/plan_exit_hook.py

Blocks the tool call (exit 2), forcing the agent to address feedback

On the second invocation, the hook allows the call through. State is tracked per session ID in a temp file.

3.6 Session Start Hook

The scripts/tdd_session_start.py runs at SessionStart and outputs:

The TDD log filename (derived from transcript path hash)
Copy-pasteable Red, Green, and skip-red command examples with the correct --log value

This ensures agents know their log file from the very beginning of a session.

3.7 Per-Agent Isolation

Each Claude Code session gets a unique TDD log file based on an MD5 hash of the transcript path:

def get_log_path(input_data: dict) -> Path:
    transcript = input_data.get("transcript_path", "")
    if transcript:
        key = hashlib.md5(transcript.encode()).hexdigest()[:8]
        return Path(f"tdd-{key}.log")
    return Path("tdd.log")

scripts/tdd_common.py

This means multiple agents working in the same repo (e.g., in different worktrees or parallel sessions) each maintain their own TDD state without interference. All tdd-*.log files are gitignored.

4 How Tests Work

CodeLeash has three test levels — unit, integration, and end-to-end — plus frontend component tests via Vitest. The full suite runs automatically on every git commit via a pre-commit hook installed by init.sh.

4.1 Test Levels

Level	Directory	Framework	Timeout	What It Tests
Unit	`tests/unit/`	pytest	10ms	Pure business logic
Integration	`tests/integration/`	pytest	None	Service + repository interactions
Component	`src/*/.test.tsx`	Vitest + Testing Library	None	React component rendering
E2E	`tests/e2e/`	pytest + Playwright	None	Full application flows

4.2 Running Tests

# All tests (pre-commit + vitest + pytest + e2e, in parallel)
npm run test:all

# Individual suites
npm run test:python         # Unit + integration (excludes e2e)
npm test                    # Vitest (React components)
npm run test:e2e            # E2E with parallel workers
npm run test:e2e:serial     # E2E in sequential mode

# Specific files
npm run test:python -- tests/unit/services/test_greeting_service.py -k "test_name" -v
npm test -- src/components/GreetingList.test.tsx
npm run test:e2e -- tests/e2e/test_hello_world.py -k "test_name" -v

Tests must be run through npm run wrappers — direct uv run pytest and npx vitest are blocked by deny rules in .claude/settings.json.

npm run test:all runs all four suites in parallel:

"test:all": "concurrently --kill-others-on-fail 'npm run pre-commit' 'npm test' 'npm run test:python' 'npm run test:e2e'"

package.json

4.3 The 10ms Unit Test Timeout

Unit tests in tests/unit/ enforce a strict 10ms timeout on test logic execution. This forces tests to be true unit tests focused on business logic, with all I/O mocked.

4.3.1 How It Works

The timeout is implemented as a pytest hook in tests/conftest.py. The core timing check profiles each test and raises on timeout:

@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_call(item):
    if "tests/unit/" not in item.fspath.strpath:
        yield; return

    profiler = cProfile.Profile()
    profiler.enable()
    start_time = time.perf_counter_ns()
    try:
        yield
    finally:
        end_time = time.perf_counter_ns()
        duration_ms = (end_time - start_time) / 1_000_000
        profiler.disable()

    if duration_ms > 10.0:
        # Auto-retry once, then generate flamegraph and raise
        ...

tests/conftest.py

4.3.2 Automatic Retry

Tests that exceed 10ms get one automatic retry. This handles transient performance issues like first-time module imports or JIT compilation. Only after the retry also exceeds 10ms does the test fail.

4.3.3 Flamegraph on Failure

When a test times out after retry, the profiler data is saved as an SVG flamegraph via flameprof:

test_profiles/tests_unit_services_test_greeting_service_TestGetAll_test_returns_greetings_12.3ms.svg

Opening this SVG in a browser reveals exactly where the time was spent — typically in @patch decorator import chains or accidental I/O.

4.3.4 Common Causes

@patch decorators trigger imports: @patch("app.module.dependency") loads the entire module chain. Use dependency injection instead.
Heavy module imports: Importing routes or services triggers FastAPI/Pydantic initialization. Keep test imports lightweight.
Database or external calls: Any real I/O will exceed 10ms. Mock everything.

4.3.5 Fixture Prewarming

The conftest.py imports commonly-used models at module load time (not inside test functions), so the import cost is paid once and excluded from individual test timing:

from app.models.greeting import Greeting
from app.models.user import User

4.4 E2E Test Harness

The e2e test runner (scripts/run_e2e_tests.py) is fully automated. It:

Finds available ports for both the application server and an isolated Supabase instance
Starts Supabase and builds the frontend in parallel using ThreadPoolExecutor
Starts the server (uvicorn + worker) via concurrently
Runs pytest with parallel workers (-n auto by default)
Analyzes server logs for unexpected HTTP errors or Python exceptions
Cleans up everything (server processes, Supabase instance, temp directories)

4.4.1 Isolated Supabase

Each e2e test run gets its own Supabase instance with unique ports and project ID:

unique_project_id = f"e2e-{timestamp}-{random_id}"
config_replacements = [
    (r"^project_id = .*$", f'project_id = "{unique_project_id}"'),
    (r"^port = 54321$", f'port = {port_mapping["api"]}'),
    (r"^port = 54322$", f'port = {port_mapping["db"]}'),
    (r"^shadow_port = 54320$", f'shadow_port = {port_mapping["db_shadow"]}'),
    ...
]

scripts/run_e2e_tests.py

Port ranges are allocated from 55000 in blocks of 10 (supporting up to 10 concurrent runs)
A temporary directory is created with a copy of supabase/ and patched config.toml
Each instance gets a unique project_id to ensure fresh Docker volumes
Unnecessary services (studio, edge-runtime, logflare, vector, imgproxy, realtime) are excluded to speed up startup

4.4.2 Server Log Analysis

After tests complete, the harness analyzes server logs for unexpected errors:

http_error_pattern = re.compile(r'"\w+\s+[^"]+"\s+(4\d{2}|5\d{2})')
error_log_pattern = re.compile(r"\bERROR\b|\bException\b|\bTraceback\b")

for prefix, line in log_lines:
    if http_error_pattern.search(line):
        # Check against expected-errors list
        ...

scripts/run_e2e_tests.py

If unexpected errors are found, the test suite fails even if all pytest assertions passed. This catches server-side issues that client tests might miss.

4.4.3 Output Suppression

Setup output (Supabase startup, frontend build, server startup) is captured in a QuietSetup buffer. If setup succeeds, none of it is shown. If setup fails, the full captured output is printed for debugging.

4.5 Dot Silencing

The pytest_report_teststatus hook in conftest.py suppresses the default progress dots for passing tests:

def pytest_report_teststatus(report, config):
    if report.passed and report.when == "call":
        return report.outcome, "", report.outcome.upper()

This keeps test output minimal — agents only need exit codes, not visual progress indicators.

4.6 Test Command Reference

Command	What It Runs	Parallel
`npm run test:all`	pre-commit + vitest + pytest + e2e	Yes (concurrently)
`npm run test:python`	pytest (unit + integration)	No
`npm test`	Vitest (component tests)	No
`npm run test:e2e`	E2E with auto workers	Yes (pytest-xdist)
`npm run test:e2e:serial`	E2E sequentially	No
`npm run pre-commit`	Linting, formatting, type checks	No

5 Agent Optimizations

CodeLeash configures Claude Code to prevent common agent misbehaviors through deny rules, hooks, and environment settings. These are defined in .claude/settings.json and enforced automatically.

5.1 Deny Rules

The permissions.deny list blocks commands that agents should never run directly:

{
  "permissions": {
    "deny": [
      "Bash(pre-commit *)",
      "Bash(uv run pre-commit*)",
      "Bash(npx vitest*)",
      "Bash(uv run pytest*)"
    ]
  }
}

.claude/settings.json

Blocked Command	Why	Correct Alternative
`uv run pytest`	Bypasses npm wrapper, may fail with permissions	`npm run test:python`
`npx vitest`	Bypasses npm wrapper	`npm test`
`pre-commit` / `uv run pre-commit`	Bypasses npm wrapper	`npm run pre-commit`

The npm run wrappers ensure consistent environment setup and output formatting.

5.2 PreToolUse Bash Hooks

Five PreToolUse hooks on Bash commands block common mistakes:

5.2.1 Test Pipe Blocking

The hook uses a regex to detect any test command followed by |, ;, or >:

if [[ "$cmd" =~ ^(npm run test|npm test).*(\\||;|>) ]]; then
  echo "BLOCKED: Test commands must not be piped, chained, or redirected." >&2
  exit 2
fi

.claude/settings.json

This forces agents to see complete test output — no filtering, no redirection. Agents that can’t see full output make worse debugging decisions.

5.2.2 Direct Python Blocking

if [[ "$cmd" =~ ^python ]]; then
  echo "BLOCKED: python must be run via uv." >&2; exit 2
fi

.claude/settings.json

All Python execution must go through uv run to ensure the correct virtual environment and dependencies.

5.2.3 py_compile Blocking

Agents sometimes try to syntax-check files before running tests. This is unnecessary since syntax errors surface immediately in test runs.

5.2.4 Timeout Wrapper Blocking

Wrapping commands in timeout changes the command string, preventing it from matching against permission allowlist entries and forcing unnecessary permission prompts.

5.2.5 Supabase Production Guard

Commands that modify production Supabase resources (db push --linked, functions deploy, secrets set) are blocked. Deployment is the user’s responsibility.

5.3 Allow Rules

The permissions.allow list grants pre-approval for specific commands:

{
  "permissions": {
    "allow": ["Bash(uv run python -m scripts.tdd_log:*)"]
  }
}

This allows the TDD log commands to run without prompting the user for approval each time.

5.4 Git Commit Hook

The init.sh script installs a git pre-commit hook that runs npm run test:all on every commit:

#!/bin/bash
# Pre-commit hook installed by init.sh
set -e
npm run test:all

init.sh

This means every commit runs:

Pre-commit checks (black, isort, ruff, prettier, eslint, type-check, all custom checks)
Vitest (React component tests)
pytest (unit + integration tests)
E2E tests (with isolated Supabase instance)

If any of these fail, the commit is rejected.

5.5 Environment Settings

{
  "env": {
    "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

These disable feedback surveys and non-essential network requests, keeping the agent focused on the task.

5.6 PostToolUse Hooks

Both PostToolUse and PostToolUseFailure hooks on Bash run tdd_post_bash.py, which logs every command execution to the TDD log with its outcome. This provides a complete audit trail and drives state transitions in the TDD guard.

5.7 Stop and PreCompact Hooks

Stop hook: Fires when a session ends. Prompts the agent to write learnings to .claude/learnings/ and review its TDD log for inappropriate overrides.
PreCompact hook: Fires before context compaction. Same prompt as Stop — ensures learnings are captured before context is compressed.

The Stop hook prompt:

SESSION ENDING -- If you learned anything noteworthy,
create .claude/learnings/{date}-{slug}.md. Include surprises,
key learnings, hook/workflow recommendations. Also review your
TDD log for inappropriate overrides or skip-red usage.

.claude/settings.json

Both hooks encourage the agent to reflect on its session, producing structured notes that benefit future sessions.

5.8 Dot Silencing

Test progress dots (.....F..) are suppressed in pytest output via the pytest_report_teststatus hook in tests/conftest.py:

def pytest_report_teststatus(report, config):
    if report.passed and report.when == "call":
        return report.outcome, "", report.outcome.upper()

Agents don’t need visual progress — they need structured pass/fail results. This reduces output noise and context window usage.

6 Code Quality Checks

CodeLeash enforces code quality through custom Python scripts that run as pre-commit hooks. Each script is a focused lint rule implemented with AST walking, regex scanning, or both. This “Python script as lint rule” pattern makes rules easy to write, test, and understand.

6.1 Integration Chain

.pre-commit-config.yaml
  → npm run pre-commit (runs all hooks)
  → npm run test:all (includes pre-commit)
  → git pre-commit hook (runs test:all)

Every commit triggers the full chain. A failing check blocks the commit.

6.2 The “Python Script as Lint Rule” Pattern

Each custom check is registered as a local hook in .pre-commit-config.yaml. Here’s a representative entry:

- id: check-brand-colors
  name: Check for non-permitted Tailwind color classes
  entry: uv run python scripts/check_brand_colors.py
  language: system
  files: \.(ts|tsx)$
  pass_filenames: true

.pre-commit-config.yaml

The pattern: a Python script that reads files, checks a rule, and exits nonzero on violations. No plugin API to learn — just stdin/stdout and exit codes.

6.3 Third-Party Hooks

Standard tools run first:

Hook	Purpose
black	Python code formatting
isort	Python import sorting (black profile)
ruff	Python linting with auto-fix
prettier	JS/TS/JSON/CSS/MD formatting
djlint	HTML template formatting
trailing-whitespace	Remove trailing whitespace
vulture	Dead Python code detection (min-confidence 80)

6.4 Custom Checks

6.4.1 Brand Colors (`check_brand_colors.py`)

Scans TypeScript/TSX files for Tailwind color classes that aren’t from the approved brand palette. The script maintains a set of disallowed standard Tailwind colors and uses fast string matching:

DISALLOWED_COLORS = {
    "amber", "blue", "cyan", "emerald", "fuchsia", "gray",
    "green", "indigo", "lime", "neutral", "orange", "pink",
    "purple", "red", "rose", "sky", "slate", "stone",
    "teal", "violet", "yellow", "zinc",
}

scripts/check_brand_colors.py

Prevents agents from using arbitrary colors like bg-blue-500 when they should use bg-brand-blue.

6.4.2 Unused Routes (`check_unused_routes.py`)

Scans backend route definitions and frontend TypeScript for API calls. Flags backend JSON API routes that have no frontend callers.

The TypeScript scanner uses regex patterns to find all frontend API references:

patterns = [
    r"fetch\s*\(\s*['\"`]([^'\"`]*\/[^'\"`]*)['\"`]",
    r"fetch\s*\(\s*`([^`]*\/[^`]*)`",
    r"href\s*=\s*['\"`]([^'\"`]*\/[^'\"`]*)['\"`]",
    r"action\s*=\s*['\"`]([^'\"`]*\/[^'\"`]*)['\"`]",
    ...
]

scripts/check_unused_routes.py

Routes used by external callers can be whitelisted in find_unused_routes().

6.4.3 Unused Code (`check_unused_code.py`)

Detects unused functions and methods in Python files. Uses AST walking to find function definitions, then searches for call sites across the codebase. Escape hatch:

# check_unused_code: ignore

Add this comment on the function definition to suppress the warning.

6.4.4 Dynamic Imports (`check_dynamic_imports.py`)

Flags Python imports that aren’t at the top of the file. Dynamic imports make dependency graphs unpredictable and slow down test startup. TYPE_CHECKING blocks are allowed.

6.4.5 Soft Deletes (`check_soft_deletes.py`)

Ensures repository code uses soft deletes (setting deleted_at) instead of hard deletes on tables that support soft deletion.

6.4.6 Code Quality (`check_code_quality.py`)

Catches common code quality issues: fixed waits in e2e tests, conditional logic issues, and direct repository client access outside of repository classes.

6.4.7 Obsolete Terms (`check_obsolete_terms.py`)

Scans filenames and file content for terms that have been renamed or deprecated. Prevents stale references from accumulating after renames.

6.4.8 Dashboard Metrics (`check_dashboard_metrics.py`)

Verifies that the Grafana dashboard JSON includes panels for all metrics defined in app/core/metrics.py. Prevents metrics from being added to code without corresponding dashboard visibility.

6.5 Type Checking

Two type checkers run as pre-commit hooks:

Checker	Language	Hook
TypeScript (`tsc --noEmit`)	TypeScript	`type-check`
Pyrefly	Python	`pyrefly`

6.5.1 Initial Data Type Sync

The check-initial-data hook runs scripts/generate_types.py --check to verify that TypeScript type definitions for initial data match the current Pydantic models. If they’ve drifted, the hook fails.

6.6 Dead Code Detection

Two complementary tools detect dead code:

Tool	Language	What It Finds
vulture	Python	Unused variables, functions, imports, classes
knip	TypeScript	Unused exports, imports, dependencies, files

Both are configured to minimize false positives — vulture uses a whitelist file (.vulture_whitelist.py) and an 80% confidence threshold.

6.7 Import Architecture

The import-linter hook (uv run lint-imports) enforces architectural boundaries via contracts in pyproject.toml:

[[tool.importlinter.contracts]]
name = "Routes should not directly import Supabase"
type = "forbidden"
source_modules = ["app.routes"]
forbidden_modules = ["app.core.supabase", "supabase"]

[[tool.importlinter.contracts]]
name = "Routes should not directly import Repositories"
type = "forbidden"
source_modules = ["app.routes"]
forbidden_modules = ["app.repositories"]

[[tool.importlinter.contracts]]
name = "Services should not directly import Repositories"
type = "forbidden"
source_modules = ["app.services"]
forbidden_modules = ["app.repositories"]

pyproject.toml

This ensures:

Routes cannot import repositories or the Supabase client directly
Services cannot import repositories directly
The container (app/core/container.py) is the only place that wires dependencies

7 Worker System

CodeLeash includes a background job queue built on PostgreSQL. Instead of using a separate message broker, jobs are stored in a regular table and claimed atomically using FOR UPDATE SKIP LOCKED.

7.1 Jobs Table

The jobs table is created by a Supabase migration:

create table if not exists public.jobs (
  id bigserial primary key,
  queue text not null,               -- e.g. 'greeting-jobs'
  payload jsonb not null,
  status text not null default 'pending',

  -- Scheduling
  scheduled_for timestamptz not null default now(),

  -- Retry tracking
  attempts int not null default 0,
  max_attempts int not null default 3,
  last_error text,

  -- Timestamps
  created_at timestamptz not null default now(),
  started_at timestamptz,
  completed_at timestamptz
);

Two indexes support efficient polling:

idx_jobs_pending on scheduled_for where status = 'pending'
idx_jobs_status_queue on (status, queue)

RLS is enabled with a policy restricting access to the service_role.

7.2 Atomic Job Claiming

The claim_jobs SQL function uses FOR UPDATE SKIP LOCKED to atomically claim jobs without conflicts between concurrent workers:

create or replace function public.claim_jobs(
  p_queues text[] default null,
  p_limit int default 1
) returns table(id bigint, queue text, payload jsonb, attempts int, max_attempts int) as $$
  with claimed as (
    select j.id from public.jobs j
    where j.status = 'pending'
      and j.scheduled_for <= now()
      and (p_queues is null or j.queue = any(p_queues))
    order by j.id
    for update skip locked
    limit p_limit
  )
  update public.jobs set
    status = 'processing',
    started_at = now(),
    attempts = public.jobs.attempts + 1
  from claimed
  where public.jobs.id = claimed.id
  returning public.jobs.id, public.jobs.queue, public.jobs.payload,
            public.jobs.attempts, public.jobs.max_attempts;
$$ language sql;

supabase/migrations/20260223000002_create_jobs_table.sql

FOR UPDATE SKIP LOCKED means:

Each row is locked when selected
Other workers skip locked rows instead of waiting
No two workers can claim the same job
No advisory locks or external coordination needed

7.3 JobRepository

The JobRepository wraps the Supabase client with typed methods:

Method	What It Does
`enqueue(queue, payload, delay_seconds, max_attempts)`	Insert a new job
`claim(queues, limit)`	Call `claim_jobs` RPC, return `Job` dataclass list
`complete(job_id)`	Set status to `completed`, record timestamp
`fail(job_id, error)`	Retry with backoff or mark as permanently failed
`get_queue_depth(queue)`	Count pending jobs (for metrics)

7.3.1 Enqueuing a Job

async def enqueue(self, queue: str, payload: dict, delay_seconds: int = 0,
                  max_attempts: int = 3) -> int:
    scheduled_for = datetime.now(UTC) + timedelta(seconds=delay_seconds)
    response = self.client.table(self.table_name).insert({
        "queue": queue,
        "payload": payload,
        "scheduled_for": scheduled_for.isoformat(),
        "max_attempts": max_attempts,
    }).execute()

app/repositories/job.py

7.3.2 Retry with Exponential Backoff

When a job fails and has remaining attempts, the fail() method schedules a retry:

# Backoff: 30 seconds × attempt number
backoff = timedelta(seconds=30 * attempts)
scheduled_for = datetime.now(UTC) + backoff
update_data = {
    "status": "pending",
    "last_error": error,
    "scheduled_for": scheduled_for.isoformat(),
}

When all attempts are exhausted, the job is marked failed with completed_at set.

7.3.3 Metrics Integration

Every enqueue, fail, and complete operation updates a Prometheus gauge for queue depth. Connection errors are detected and recorded as a separate metric.

7.4 QueueWorker

The QueueWorker class runs a polling loop:

class QueueWorker:
    def __init__(self, job_repo, handlers):
        self.job_repo = job_repo
        self.handlers = handlers  # {"queue-name": handler_instance}
        self._running = False

    async def run(self, poll_interval=5):
        self._running = True
        queues = list(self.handlers.keys())
        while self._running:
            jobs = await self.job_repo.claim(queues=queues, limit=1)
            for job in jobs:
                task = asyncio.create_task(self._execute_job(job))
                self._active_tasks.add(task)
            await asyncio.sleep(poll_interval)

Each job is dispatched to its handler’s handle() method. The worker tracks active tasks and supports graceful shutdown with a configurable timeout.

7.4.1 Job Execution

async def _execute_job(self, job):
    handler = self.handlers.get(job.queue)
    if handler is None:
        await self.job_repo.fail(job.id, f"No handler for queue {job.queue}")
        return

    start_time = time.time()
    try:
        await handler.handle(job)
        await self.job_repo.complete(job.id)
        record_queue_job_processed(queue=job.queue, status="completed")
    except Exception as e:
        await self.job_repo.fail(job.id, str(e))
        record_queue_job_processed(queue=job.queue, status="failed")
    finally:
        duration = time.time() - start_time
        record_queue_job_duration(queue=job.queue, duration=duration)

7.5 Handler Registration

Handlers are wired up in app/core/worker_dependencies.py:

def create_queue_worker() -> QueueWorker:
    container = _get_container()
    greeting_handler = GreetingHandler(
        greeting_repository=container.get_greeting_repository()
    )
    return QueueWorker(
        job_repo=container.get_job_repository(),
        handlers={
            "greeting-jobs": greeting_handler,
        },
    )

This follows the same container DI pattern as the web application.

7.5.1 Writing a Handler

Handlers implement an async handle(job) method. Here’s the GreetingHandler:

class GreetingHandler:
    def __init__(self, greeting_repository: GreetingRepository) -> None:
        self.greeting_repository = greeting_repository

    async def handle(self, job: Job) -> dict[str, Any]:
        greeting_id = job.payload.get("greeting_id", "")
        greeting = await self.greeting_repository.get_by_id(greeting_id)
        return {"status": "processed", "greeting_id": greeting_id}

app/workers/handlers/greeting_handler.py

7.6 Hot Reload in Development

The worker.py entry point uses watchdog to monitor file changes in development:

class WorkerReloadHandler(FileSystemEventHandler):
    def on_modified(self, event):
        if self._should_reload_for_file(filepath):
            self.restart_event.set()

The reload handler:

Watches app/ (recursive) and worker.py
Skips irrelevant files (templates, static, tests, routes)
Debounces with a 0.5-second delay
Signals the main loop to restart the worker

In production (ENVIRONMENT != "development"), hot reload is disabled and the worker runs until interrupted.

7.7 Adding a New Job Type

Create a handler in app/workers/handlers/:

class MyHandler:
    def __init__(self, my_service):
        self.my_service = my_service

    async def handle(self, job):
        await self.my_service.do_work(job.payload)

Register it in app/core/worker_dependencies.py:

my_handler = MyHandler(my_service=container.get_my_service())
return QueueWorker(
    job_repo=container.get_job_repository(),
    handlers={
        "greeting-jobs": greeting_handler,
        "my-jobs": my_handler,  # Add here
    },
)

Enqueue jobs from your service:

await job_repo.enqueue("my-jobs", {"key": "value"})

8 Worktree Parallel Work

Git worktrees let you check out multiple branches of the same repo simultaneously, each in its own directory. CodeLeash’s init.sh script automatically configures isolated ports and Supabase instances for each worktree, so multiple branches can run side by side without conflicts.

8.1 Worktree Detection

The init.sh script compares the current directory to the main repo and calculates a slot number:

WORKTREE_NAME=$(basename "$PWD")
MAIN_REPO=$(git worktree list | head -1 | awk '{print $1}')

if [ "$PWD" = "$MAIN_REPO" ]; then
    SLOT=0
    PROJECT_ID="codeleash"
else
    # Calculate slot from worktree name
    if [[ "$WORKTREE_NAME" =~ ^[0-9]+$ ]] && [ "$WORKTREE_NAME" -ge 1 ] && [ "$WORKTREE_NAME" -le 99 ]; then
        SLOT=$WORKTREE_NAME
    else
        SLOT=$(echo -n "$WORKTREE_NAME" | cksum | awk '{print ($1 % 99) + 1}')
    fi
fi

init.sh

The main repo always gets slot 0 (default ports)
Numeric worktree names (1-99) use their number as the slot directly
Other names are hashed with cksum to a slot in 1-99

8.2 Port Formula

Each slot gets a deterministic set of ports, calculated with simple arithmetic:

PORT=$((8000 + SLOT))
VITE_PORT=$((5173 + SLOT))
API_PORT=$((54321 + SLOT * 10))
DB_PORT=$((54322 + SLOT * 10))
SHADOW_PORT=$((54320 + SLOT * 10))
POOLER_PORT=$((54329 + SLOT * 10))
STUDIO_PORT=$((54323 + SLOT * 10))
INBUCKET_PORT=$((54324 + SLOT * 10))

init.sh

Service	Formula	Slot 0 (main)	Slot 1	Slot 5
FastAPI	8000 + slot	8000	8001	8005
Vite	5173 + slot	5173	5174	5178
Supabase API	54321 + slot×10	54321	54331	54371
Supabase DB	54322 + slot×10	54322	54332	54372
DB Shadow	54320 + slot×10	54320	54330	54370
DB Pooler	54329 + slot×10	54329	54339	54379
Studio	54323 + slot×10	54323	54333	54373
Inbucket	54324 + slot×10	54324	54334	54374
Analytics	54327 + slot×10	54327	54337	54377

8.3 Supabase Config Isolation

For worktrees (slot > 0), init.sh generates a fresh config and patches the ports with sed:

# Generate fresh config.toml
TEMP_DIR=$(mktemp -d)
(cd "$TEMP_DIR" && supabase init --force) > /dev/null 2>&1
cp "$TEMP_DIR/supabase/config.toml" "$TEMP_CONFIG"

# Patch port numbers
sed -i '' "s/^project_id = .*/project_id = \"$PROJECT_ID\"/" "$TEMP_CONFIG"
sed -i '' "s/^port = 54321$/port = $API_PORT/" "$TEMP_CONFIG"
sed -i '' "s/^port = 54322$/port = $DB_PORT/" "$TEMP_CONFIG"
sed -i '' "s/^shadow_port = 54320$/shadow_port = $SHADOW_PORT/" "$TEMP_CONFIG"
sed -i '' "s/^port = 54329$/port = $POOLER_PORT/" "$TEMP_CONFIG"
sed -i '' "s/^port = 54323$/port = $STUDIO_PORT/" "$TEMP_CONFIG"

init.sh

This ensures each worktree’s Supabase instance has its own Docker containers and PostgreSQL data.

8.4 Environment File

Worktrees get their own .env with port overrides:

# Worktree 'feature-xyz' (slot 42) port configuration
PORT=8042
VITE_SERVER_PORT=5215
SUPABASE_URL=http://127.0.0.1:54741
DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:54742/postgres

The .env file starts as a copy from the main repo, with port-related variables replaced.

8.5 Typical Workflow

# Create a worktree for a feature branch
git worktree add ../my-feature feature-branch

# Initialize the worktree (installs deps, configures ports, starts Supabase)
cd ../my-feature
./init.sh

# Develop normally --- runs on its own ports
npm run dev    # FastAPI on 8042, Vite on 5215

# Meanwhile, main repo keeps running on default ports
cd ../CodeLeash
npm run dev    # FastAPI on 8000, Vite on 5173

Both instances run simultaneously with no port conflicts.

8.6 Limitations

Slot collisions: Two worktree names that hash to the same slot will conflict. Use numeric names (1-99) for deterministic assignment.
Docker resources: Each Supabase instance runs its own set of Docker containers. Running many worktrees simultaneously requires significant memory.
First startup: The first init.sh in a worktree may need to pull Docker images, which can be slow.
macOS sed: The port patching uses sed -i '' (BSD sed syntax). On Linux, this would need sed -i without the empty string argument.

9 Future & Community

9.1 Migration Testing Framework

A comprehensive migration testing framework is planned in tests/migration/FUTURE.md. The design includes:

MigrationRunner: Programmatic execution of individual migrations forward and backward
DataGenerator: Realistic production-like test data
DatabaseState: Capture and compare full database state (schema, constraints, indexes, RLS policies, data checksums)
BaseMigrationTest: A base class that tests forward migration, rollback, edge cases, and foreign key integrity

The key insight is that migration tests should run against an isolated Supabase instance (like e2e tests), resetting to just before the target migration, inserting test data, applying the migration, and verifying data transformations and schema changes.

9.2 Philosophy

CodeLeash is built on a few core beliefs:

AI agents need constraints, not freedom. An unconstrained agent will skip tests, make sweeping changes, and produce code that works in isolation but breaks in context. The TDD guard, file edit restrictions, and test pipe blocking exist because freedom doesn’t scale.

Tests are the specification. The 10ms timeout forces unit tests to be pure business logic. The e2e harness ensures full integration. The pre-commit hook runs everything on every commit. If it isn’t tested, it doesn’t exist.

Lint rules should be code. Instead of configuring complex tool options, CodeLeash writes Python scripts that walk ASTs and scan with regex. A script is easier to write, easier to debug, and easier to explain than a YAML configuration.

The monorepo is the product. Backend, frontend, database migrations, lint rules, test infrastructure, and CI/CD all live together. Changes that cross boundaries are normal, not exceptional.

9.3 How to Adopt These Ideas

You don’t have to use CodeLeash as a whole. Individual systems are designed to be understood and adapted:

TDD Guard: The state machine in scripts/tdd_common.py is about 80 lines. The pre-edit hook is about 250. You could adapt this for any Claude Code project by adjusting the file classification patterns.
10ms Timeout: The pytest_runtest_call hook in tests/conftest.py is self-contained. Drop it into any pytest project and adjust the threshold.
Custom Lint Scripts: Each scripts/check_*.py is independent. Copy the pattern — parse files, check a rule, exit nonzero on violations — and add it to your .pre-commit-config.yaml.
Worker System: The jobs table migration, JobRepository, and QueueWorker are a complete job queue in about 400 lines total. No external broker required.
Worktree Port Hashing: The port calculation logic in init.sh is about 20 lines. Apply it to any project that needs parallel development environments.

9.4 Call to Action

Your coding agent, on a leash.

Not because agents are bad, but because good constraints produce good code. A TDD guard that forces Red-Green-Refactor is more reliable than a prompt that asks nicely. A 10ms timeout that rejects slow tests is more effective than a style guide that recommends mocking. A pre-commit hook that runs everything is more trustworthy than a CI pipeline that runs later.

The guardrails aren’t overhead — they’re the product.

1 Introduction

1.1 What Is CodeLeash?

1.2 Who Are These Docs For?

1.3 Technology Stack

1.4 Chapter Overview

1.5 Key Files

1.6 Quick Start

2 Full-Stack Monorepo

2.1 Dual-Server Architecture

2.2 The render_page() Pattern

2.2.1 Route Layer

2.2.2 Template Layer

2.2.3 React Layer

2.3 Complete Data Flow

2.4 Vite Integration

2.5 Type Safety: Pydantic to TypeScript

2.6 Rollup Entry Points

3 TDD Guard

3.1 State Machine

3.1.1 State Derivation

3.2 The CLI: tdd_log

3.2.1 Green Validation

3.2.2 Overrides

3.3 Pre-Edit Hook

3.3.1 File Classification

3.3.2 Permission Table

3.3.3 Green Allowlist

3.4 Post-Bash Hook

3.4.1 Example TDD Log

3.5 Plan Exit Hook

3.6 Session Start Hook

3.7 Per-Agent Isolation

4 How Tests Work

4.1 Test Levels

4.2 Running Tests

4.3 The 10ms Unit Test Timeout

4.3.1 How It Works

4.3.2 Automatic Retry

4.3.3 Flamegraph on Failure

4.3.4 Common Causes

4.3.5 Fixture Prewarming

4.4 E2E Test Harness

4.4.1 Isolated Supabase

4.4.2 Server Log Analysis

4.4.3 Output Suppression

4.5 Dot Silencing

4.6 Test Command Reference

5 Agent Optimizations

5.1 Deny Rules

5.2 PreToolUse Bash Hooks

5.2.1 Test Pipe Blocking

5.2.2 Direct Python Blocking

5.2.3 py_compile Blocking

5.2.4 Timeout Wrapper Blocking

5.2.5 Supabase Production Guard

5.3 Allow Rules

5.4 Git Commit Hook

5.5 Environment Settings

5.6 PostToolUse Hooks

5.7 Stop and PreCompact Hooks

5.8 Dot Silencing

6 Code Quality Checks

6.1 Integration Chain

6.2 The “Python Script as Lint Rule” Pattern

6.3 Third-Party Hooks

6.4 Custom Checks

6.4.1 Brand Colors (check_brand_colors.py)

6.4.2 Unused Routes (check_unused_routes.py)

6.4.3 Unused Code (check_unused_code.py)

6.4.4 Dynamic Imports (check_dynamic_imports.py)

6.4.5 Soft Deletes (check_soft_deletes.py)

6.4.6 Code Quality (check_code_quality.py)

6.4.7 Obsolete Terms (check_obsolete_terms.py)

6.4.8 Dashboard Metrics (check_dashboard_metrics.py)

6.5 Type Checking

6.5.1 Initial Data Type Sync

6.6 Dead Code Detection

6.7 Import Architecture

7 Worker System

7.1 Jobs Table

2.2 The `render_page()` Pattern

3.2 The CLI: `tdd_log`

6.4.1 Brand Colors (`check_brand_colors.py`)

6.4.2 Unused Routes (`check_unused_routes.py`)

6.4.3 Unused Code (`check_unused_code.py`)

6.4.4 Dynamic Imports (`check_dynamic_imports.py`)

6.4.5 Soft Deletes (`check_soft_deletes.py`)

6.4.6 Code Quality (`check_code_quality.py`)

6.4.7 Obsolete Terms (`check_obsolete_terms.py`)

6.4.8 Dashboard Metrics (`check_dashboard_metrics.py`)