Introducing opentraces 0.4

Traces are the new source code.

What the agent sees, does, and changes. A local-first evidence layer for agent work: captured privately, anchored to Git, turned into search, lineage, evals, and datasets.

There is an irony in today's tools. We spend our days inside agent sessions and reviewing what they produce, yet we still treat the software as the unit of work, when our value add has moved up to orchestrating the context, the verifiers, and the process within.

The source of our value is no longer just the software. It is the trace of how that software came to exist.

Almost all of that record gets thrown away. The prompt that set the direction. The files the model read. The dead ends. The edits that survived. The ones that got reverted. That is the record of how the software was actually made, and it is where improvement happens now.

Git keeps the diff. The rest evaporates when the session ends.

opentraces is built to fill that gap: a local-first evidence layer for agent work. It captures what the agent saw, did, and changed into a private bucket, anchors the changes to the Git history that accepted them, and turns the evidence into search, lineage, resumable context, shareable bug reports, evals, and training datasets.

It works with Claude Code, Codex, and Pi today. Nothing leaves your machine until you approve it.

The project started as a reply to a call to action, and the question behind it still steers the work: can we, as a community, unlock the traces we already paid for, and use them to improve our own workflows and products?

clem 🤗 @ClementDelangue · mar 27
We need more open agent traces datasets. Who can help?
Gabriele Farei @jayfarei · mar 27
could you just create a wee community plugin to share traces to improve open source community on CC? [...] happy to build a little poc for you as well.
clem 🤗 @ClementDelangue · mar 27
would be awesome, happy to amplify and support if you build it!
the origin of this work · opentraces is that wee plugin, ten weeks later · thread · reply

A few months in, this post shares what I have learned from working with traces: why this project came about, what it can do for you, how it works underneath, and where I want to take it.

the problem

A trace alone is not enough

Let's start with a simple realization: traces alone are not enough.

They are just logs. And logs are not very useful as training data. Often they are not even enough for evaluation or the downstream uses we care about.

The trace is the spine of something bigger. The useful signals live around it.

To learn from a session, for evals, skills, and eventually training, you need three things raw capture does not give you: replayable environments, captured intent, and grounded outcomes, from which rewards can be derived.

AI labs already get those signals. They own the harness and the model, so every session feeds their learning loops. The rest of us simply consume the resulting model.

For open source, we have to be explicit about the evidence we collect and how we use it. But we already have the ingredients.

The version control system gives us the outcome layer for free: commits, merges, reverts, passing tests, all verifiable.

Every commit is a replayable environment when the source is open. Snapshot the tree at session start, rerun the agent, compare trajectories.

Intent is already there too: prompts, commit messages, tests, PR descriptions. Quality varies, and joining the signals is the hard part.

opentraces is the plumbing that joins those pieces.

the model

The three things worth keeping

opentraces splits every session into three linked records, each defined by the question it answers, and stores them in a private bucket.

Trace is the spine: the step-by-step record of the session, every prompt, plan, read, command, and edit, in order. Everything else joins back to a step on this spine.

Trail is the change layer. opentraces snapshots your working tree into a parallel Git ref namespace that never touches your branches. Every step can produce a patch: a hunk of change between one snapshot and the next.

That attributes a session's changes to the individual step that produced them, before anything is committed. Work that never lands is itself signal. When work does land, the patch anchors to the commit and carries a survival state: alive, transformed, reverted, or lost.

trail · survivalonce a patch is anchored
a41f02e patch anchored · firm
alive_on_pathuntouched on the current branch
alive_transformededited since, identity preserved
revertedexplicitly undone · evidence retained
lostgone without an explicit revert
one patch = one hunk between snapshots · state recomputed as history moves · moved, repaired, partially_preserved, unknown cover the long tail
fig 1 · survival as a label. this is what turns version control into a labeling machine: "which sessions reached main" and "how much did we spend on code that never shipped" stop being vibes and become queries.

Ctx is what the model actually saw at each step. It is the heavyweight record, so it is often optional. But it is what lets you slice a long session with a complex context history without understanding the full trace.

Take the step you care about and bring over just the context that produced it. Inspect it. Resume from it. Reuse it.

All three land in your private trace bucket, one self-sufficient unit per session: trace, trail, and context together.

The bucket is not a record. It is where the records live. It is the evidence store you mine and combine into whatever projection you need.

The shape to keep in mind is simpler than a log: every session has an input side, an action timeline, and an outcome side.

Ctx captures the input side: what the model could see at each moment. Trace captures the action timeline: what it planned, read, ran, and edited. Trail captures the outcome side: which edits were produced, which commits accepted them, and which changes survived.

The bucket keeps those three views together. That gives you the full record of the work: what shaped the agent's behavior, what the agent did, and what became part of the codebase.

Training data eventually needs all three.

trace · trail (visual)one session, step by step
action
git
ctx
user
plan
think
read
exec
write
wrk
loc
rem
window
bucket trace.json·trail.jsonl.gz· context.jsonl.gz·blobs/· manifest.json one self-sufficient unit per session
fig 2 · one session, three views. the input side accumulates in the ctx window column; the action timeline runs through the lanes; the outcome side rises through the git columns to the commits that accepted it. the bucket keeps the three views together.
the pipeline

The pipeline

The three records flow through one pipeline.

It starts inside the agent harness. Every session is captured as what the agent sees, what it does, what it changes, and, through Git, what lasts.

From the harness down, the pipeline reads like this:

pipelinelocal first · gated egress
agent harness captured in every session
what it seescontext · ctx
what it doesthe agent · trace
what it changesenvironment · trail
what lastsin git history
lineageacross history
bucket
  • traces/ envelopes
  • blobs/ content-addressed
  • events/ append-only
  • manifest.json
project
workflow
  • SKILL.md
  • row.schema.json
  • build_rows.py
security + review ✓/✗
approve
dataset
  • rows · inbox → approved
  • HF-shaped, local
  • publish = approved only
local · private by defaultremote · explicit, gated
🤗 private bucket mirror · sync (opt-in)
🤗 training compute · run
🤗 hub dataset · push
fig 3 · the pipeline. capture sources (harness hooks, otel receiver, watcher + git) write into the bucket; workflows project evidence into rows; security tools run on capture and again at row build; only approved rows cross the line. the remote half is standard hugging face infrastructure.

Security tools run at two gates: once on capture into the bucket, and again as rows are built for a dataset.

The remote half (a private bucket mirror, dataset repos, compute for training runs) is standard Hugging Face infrastructure. opentraces ships the local half and the contracts.

The workflow in the middle is the dataset as code: a portable definition of a dataset as a procedure, prompt plus code, that produces every row.

It is inspectable and runs anywhere an agent that supports skills runs. And it is built to fail forward by escalating to you instead of failing silently.

Portable means you can run someone else's workflow over your own bucket and contribute rows to a common dataset without sharing any other private bucket data.

Imagine a workflow that collects every episode around a new library. Or one that projects traces into the exact format you want to train on.

The meta-loop this enables is the bigger point:

All within an open, composable stack.

in practice

Practically: what does this do for me?

Once sessions are captured and anchored, the questions you already ask in standups, code review, and post-mortems stop being memory exercises.

Each one becomes a command at the terminal, or a sentence to your agent. The entities and verbs are the interface: say traces, intent, resume, dataset and the agent reaches for the right command on its own. (ot is the short alias for opentraces.)

you, at the terminal
your agent, in your words
what did we do on this feature recently?
ot trace query "checkout flow" --since 7d
pull up the most recent traces where we worked on the checkout flow
what were we trying to do here?
ot trail blame commit 82c09ab
what were we trying to do in this commit? walk it back to the session
did last week's work actually land?
ot trail track
how much of last week's work never made it into git history?
can we pick up where we left off?
ot ctx resume <node-id>
resume this session at the point where we agreed on the implementation
explain this pull request
ot trail blame pr render
summarize the intent of this pull request
ship a dataset
ot dataset run && ot dataset publish
assemble a dataset from this month's traces and publish the approved rows

It composes upward too. The same entities assemble higher-order workflows, distilling the usage of an expensive model into a dataset, then judging a cheaper one against it:

create me a dataset of every time you wrote the marketing newsletter with opus 4.8
use the newsletter dataset with opus as the label and see how gpt 5.5 scores against it
🤗 distillation + evals from your own usage data · expensive tokens buying cheaper ones

Creating one of those commands is a markdown file, not a service:

.claude/commands/standup-traces-report.md
pull yesterday's sessions: ot trace query --since 1d --json
for each, get intent and what survived: ot trace map <id> --bursts
write attempted / landed / still open to .claude/inbox/standup.md

Then put it on a schedule:

/schedule daily 7am /standup-traces-report
/loop 45m /sync-dataset-usage
🤗 wakes up, reads your traces, leaves the report in your inbox

There is no service running behind any of this, and no lock-in. The same workflows run from anywhere: on your machine, inside your chosen code agent, or on the cloud as a Hugging Face job.

beyond training

What you can build on traces

This is the part that justifies the plumbing.

Once capture exists, a consumer is cheap: a workflow that filters and projects retained evidence, plus a renderer that sends it somewhere useful.

Not a new subsystem. Here are three examples.

consumer · trace capsule

The capsule that closed a real issue

before: you write up the bug and hope the maintainer can reproduce it · after: you send the failing session itself

During a session, my agent hit a bug in a small open-source library. Instead of writing a summary, I sealed the episode. This is what travels inside a capsule:

capsules · episodewhat travels inside
context pack systemmessages toolsruntime what the model saw at the failing step, inlined
snapshot repo @ a41f02e  deps pinned start exactly where the agent stood
trajectory
the bounded slice to continue from · re-pose with a new model, dependency, or skill

The maintainer's agent opens it with one command and replays the actual experience, not my retelling of it. When the library shipped a fix, re-posing the episode flipped the verdict, and posting it closed the issue, without anyone touching the capsule. This happened with a real library and a real issue, end to end, with zero changes on the client side.

consumer · intent pull request

The PR that explains itself

before: the reviewer sees what changed · after: they also see why

trail blame pr walks a branch's commits back to the originating sessions and renders intent, lineage, and trace evidence next to the diff. Deterministic. No LLM in the loop:

pull requests · intent alignmenttrail blame pr render
Pi — make capture opt-out by default
Flips Pi capture from opt-in to opt-out, consistent with claude/codex, and hardens the installer seams behind it.
✓ intent alignment
2 / 2
commits traced to intent
⌥ code scope
13 hunks
12 alive in today's history
◷ wall time
41m
92 steps · 2 traces
⚇ agents
1
● claude-code
intent alignment trail (visual) conversation diff
01 / 02 Make Pi capture opt-out (global-default) f5c03ee aligned · 9/9 alive_on_path
"make capture opt-out for Pi, consistent with claude/codex"
trajectory · 58 steps · 1 turn
02 / 02 harden installer seams 6bb3d83 aligned · 3/4 alive_transformed
"respect the excluded marker, don't write sidecars without consent"
trajectory · 34 steps · 2 turns

The intent is not summarized from the diff. It is joined from the prompts and commit messages that actually drove the work. And each patch's survival state, whether that change is still alive in today's history, is one ot trail track <trace-id> away.

consumer · skill verifier

Scoring a skill

before: "the new version feels better" · after: runs scored against a calibrated rubric

Take the newsletter example from earlier. Your team has a /newsletter skill and keeps tweaking it. Did this month's changes make the agent better at the job? The verifier mines past runs into a per-skill rubric, calibrates it against labels it can trust, and answers with one scored line:

skill intelligence · verifierot skill-verifier score newsletter
skill @ a3f9d12before the tweak · 12 runsavg 0.58 · 7/12 ≥ 0.70
skill @ 7c41e88after the tweak · 9 runsavg 0.81 · 8/9 ≥ 0.70
+0.23
avg score delta
0.70
pass threshold
keep
verdict on the tweak

Scores only count when the labels behind them can be trusted: survival states from the Trail, or human ratings. Without them, the verifier refuses to emit a reward. The status line comes back blocked_* instead. A skill cannot grade its own homework.

The skill verifier is one half of the teacher-student arc, the consumer I care most about: traces from your strongest setups become the training and eval signal for cheaper ones, with the verifier as the honest referee in between.

On most buckets today, the bottleneck is labels, not machinery. That is exactly what the Trail's survival states start to supply for free.

the open bet

The open bet

Traces are sensitive. They include prompts, code, context, edits, mistakes, and sometimes secrets.

So was code when we wrote it ourselves. It was the direct manifestation of our intent.

With AI, traces are becoming the new source code.

The code in Git is still the artifact that ships. But more and more, the real record of how software gets made lives in the process around it: the prompt that set the goal, the files the model read, the commands it ran, the edits it tried, the tests that passed, the changes that survived, and the ones that were reverted.

That process is where teams will learn. It is where better tools, evals, workflows, datasets, and agents will come from.

So the raw evidence should be private by default. You should own your bucket. You should decide what leaves it.

But the infrastructure around that evidence should be open. If traces become the source of how software is produced, then open source needs open trace infrastructure to keep up.

Not everything needs to be public. Most raw traces should not be.

What should be shared are the contracts, workflows, review gates, dataset builders, and publishing paths that let people turn private traces into useful artifacts when they choose to: a bug capsule, a PR explanation, a skill eval, a sanitized dataset, a report.

That is the balance opentraces is trying to make practical: private evidence, open learning loops.

This matters because software engineering is moving toward harness engineering. We are not only writing code anymore. We are shaping the systems that produce code: prompts, context, tools, verifiers, evals, workflows, and feedback loops.

To improve those systems, we need to keep the record of what happened.

Every session can become evidence. Every accepted or reverted change can become signal. Every workflow change can become a step toward a better process.

Training data is one possible output. It is not the only one.

The broader goal is to make the process of building software observable, reviewable, and improvable.

If that process is captured only inside closed products, open source loses access to the learning loops that will shape the next generation of tools.

If the infrastructure is open and composable, open source has a chance to build on its own work again.

That is the bet behind opentraces: developers should own the record of how their software gets made, and the ecosystem around that record should be open enough for others to build on it.

The examples above all follow the same pattern: capture the evidence, filter it with a workflow, review what leaves, and render it to a useful destination.

get started

Three ways in.

> Get your agent on it

Paste the setup prompt into Claude, Codex, or Pi. The agent installs the CLI, authenticates, and turns on capture for you.

$ Install it yourself

One line with pipx or Homebrew.

pipx install opentraces
brew install JayFarei/opentraces/opentraces
Fork it and make it yours

Open source, end to end. The next consumer is a workflow and a renderer away.