What the agent sees, does, and changes. A local-first evidence layer for agent work: captured privately, anchored to Git, turned into search, lineage, evals, and datasets.
There is an irony in today's tools. We spend our days inside agent sessions and reviewing what they produce, yet we still treat the software as the unit of work, when our value add has moved up to orchestrating the context, the verifiers, and the process within.
The source of our value is no longer just the software. It is the trace of how that software came to exist.
Almost all of that record gets thrown away. The prompt that set the direction. The files the model read. The dead ends. The edits that survived. The ones that got reverted. That is the record of how the software was actually made, and it is where improvement happens now.
Git keeps the diff. The rest evaporates when the session ends.
opentraces is built to fill that gap: a local-first evidence layer for agent work. It captures what the agent saw, did, and changed into a private bucket, anchors the changes to the Git history that accepted them, and turns the evidence into search, lineage, resumable context, shareable bug reports, evals, and training datasets.
It works with Claude Code, Codex, and Pi today. Nothing leaves your machine until you approve it.
The project started as a reply to a call to action, and the question behind it still steers the work: can we, as a community, unlock the traces we already paid for, and use them to improve our own workflows and products?
A few months in, this post shares what I have learned from working with traces: why this project came about, what it can do for you, how it works underneath, and where I want to take it.
Let's start with a simple realization: traces alone are not enough.
They are just logs. And logs are not very useful as training data. Often they are not even enough for evaluation or the downstream uses we care about.
The trace is the spine of something bigger. The useful signals live around it.
To learn from a session, for evals, skills, and eventually training, you need three things raw capture does not give you: replayable environments, captured intent, and grounded outcomes, from which rewards can be derived.
AI labs already get those signals. They own the harness and the model, so every session feeds their learning loops. The rest of us simply consume the resulting model.
For open source, we have to be explicit about the evidence we collect and how we use it. But we already have the ingredients.
The version control system gives us the outcome layer for free: commits, merges, reverts, passing tests, all verifiable.
Every commit is a replayable environment when the source is open. Snapshot the tree at session start, rerun the agent, compare trajectories.
Intent is already there too: prompts, commit messages, tests, PR descriptions. Quality varies, and joining the signals is the hard part.
opentraces is the plumbing that joins those pieces.
opentraces splits every session into three linked records, each defined by the question it answers, and stores them in a private bucket.
Trace is the spine: the step-by-step record of the session, every prompt, plan, read, command, and edit, in order. Everything else joins back to a step on this spine.
Trail is the change layer. opentraces snapshots your working tree into a parallel Git ref namespace that never touches your branches. Every step can produce a patch: a hunk of change between one snapshot and the next.
That attributes a session's changes to the individual step that produced them, before anything is committed. Work that never lands is itself signal. When work does land, the patch anchors to the commit and carries a survival state: alive, transformed, reverted, or lost.
Ctx is what the model actually saw at each step. It is the heavyweight record, so it is often optional. But it is what lets you slice a long session with a complex context history without understanding the full trace.
Take the step you care about and bring over just the context that produced it. Inspect it. Resume from it. Reuse it.
All three land in your private trace bucket, one self-sufficient unit per session: trace, trail, and context together.
The bucket is not a record. It is where the records live. It is the evidence store you mine and combine into whatever projection you need.
The shape to keep in mind is simpler than a log: every session has an input side, an action timeline, and an outcome side.
Ctx captures the input side: what the model could see at each moment. Trace captures the action timeline: what it planned, read, ran, and edited. Trail captures the outcome side: which edits were produced, which commits accepted them, and which changes survived.
The bucket keeps those three views together. That gives you the full record of the work: what shaped the agent's behavior, what the agent did, and what became part of the codebase.
Training data eventually needs all three.
The three records flow through one pipeline.
It starts inside the agent harness. Every session is captured as what the agent sees, what it does, what it changes, and, through Git, what lasts.
From the harness down, the pipeline reads like this:
Security tools run at two gates: once on capture into the bucket, and again as rows are built for a dataset.
The remote half (a private bucket mirror, dataset repos, compute for training runs) is standard Hugging Face infrastructure. opentraces ships the local half and the contracts.
The workflow in the middle is the dataset as code: a portable definition of a dataset as a procedure, prompt plus code, that produces every row.
It is inspectable and runs anywhere an agent that supports skills runs. And it is built to fail forward by escalating to you instead of failing silently.
Portable means you can run someone else's workflow over your own bucket and contribute rows to a common dataset without sharing any other private bucket data.
Imagine a workflow that collects every episode around a new library. Or one that projects traces into the exact format you want to train on.
The meta-loop this enables is the bigger point:
All within an open, composable stack.
Once sessions are captured and anchored, the questions you already ask in standups, code review, and post-mortems stop being memory exercises.
Each one becomes a command at the terminal, or a sentence to your agent. The
entities and verbs are the interface: say traces,
intent, resume,
dataset and the agent reaches for the right command on
its own. (ot is the short alias for opentraces.)
It composes upward too. The same entities assemble higher-order workflows, distilling the usage of an expensive model into a dataset, then judging a cheaper one against it:
Creating one of those commands is a markdown file, not a service:
Then put it on a schedule:
There is no service running behind any of this, and no lock-in. The same workflows run from anywhere: on your machine, inside your chosen code agent, or on the cloud as a Hugging Face job.
This is the part that justifies the plumbing.
Once capture exists, a consumer is cheap: a workflow that filters and projects retained evidence, plus a renderer that sends it somewhere useful.
Not a new subsystem. Here are three examples.
before: you write up the bug and hope the maintainer can reproduce it · after: you send the failing session itself
During a session, my agent hit a bug in a small open-source library. Instead of writing a summary, I sealed the episode. This is what travels inside a capsule:
The maintainer's agent opens it with one command and replays the actual experience, not my retelling of it. When the library shipped a fix, re-posing the episode flipped the verdict, and posting it closed the issue, without anyone touching the capsule. This happened with a real library and a real issue, end to end, with zero changes on the client side.
before: the reviewer sees what changed · after: they also see why
trail blame pr walks a branch's commits back to the originating
sessions and renders intent, lineage, and trace evidence next to the diff.
Deterministic. No LLM in the loop:
The intent is not summarized from the diff. It is joined from
the prompts and commit messages that actually drove the work. And each patch's
survival state, whether that change is still alive in today's history, is one
ot trail track <trace-id> away.
before: "the new version feels better" · after: runs scored against a calibrated rubric
Take the newsletter example from earlier. Your team has a /newsletter skill and keeps tweaking it. Did this month's changes make the agent better at the job? The verifier mines past runs into a per-skill rubric, calibrates it against labels it can trust, and answers with one scored line:
Scores only count when the labels behind them can be trusted:
survival states from the Trail, or human ratings. Without them, the verifier
refuses to emit a reward. The status line comes back blocked_*
instead. A skill cannot grade its own homework.
The skill verifier is one half of the teacher-student arc, the consumer I care most about: traces from your strongest setups become the training and eval signal for cheaper ones, with the verifier as the honest referee in between.
On most buckets today, the bottleneck is labels, not machinery. That is exactly what the Trail's survival states start to supply for free.
Traces are sensitive. They include prompts, code, context, edits, mistakes, and sometimes secrets.
So was code when we wrote it ourselves. It was the direct manifestation of our intent.
With AI, traces are becoming the new source code.
The code in Git is still the artifact that ships. But more and more, the real record of how software gets made lives in the process around it: the prompt that set the goal, the files the model read, the commands it ran, the edits it tried, the tests that passed, the changes that survived, and the ones that were reverted.
That process is where teams will learn. It is where better tools, evals, workflows, datasets, and agents will come from.
So the raw evidence should be private by default. You should own your bucket. You should decide what leaves it.
But the infrastructure around that evidence should be open. If traces become the source of how software is produced, then open source needs open trace infrastructure to keep up.
Not everything needs to be public. Most raw traces should not be.
What should be shared are the contracts, workflows, review gates, dataset builders, and publishing paths that let people turn private traces into useful artifacts when they choose to: a bug capsule, a PR explanation, a skill eval, a sanitized dataset, a report.
That is the balance opentraces is trying to make practical: private evidence, open learning loops.
This matters because software engineering is moving toward harness engineering. We are not only writing code anymore. We are shaping the systems that produce code: prompts, context, tools, verifiers, evals, workflows, and feedback loops.
To improve those systems, we need to keep the record of what happened.
Every session can become evidence. Every accepted or reverted change can become signal. Every workflow change can become a step toward a better process.
Training data is one possible output. It is not the only one.
The broader goal is to make the process of building software observable, reviewable, and improvable.
If that process is captured only inside closed products, open source loses access to the learning loops that will shape the next generation of tools.
If the infrastructure is open and composable, open source has a chance to build on its own work again.
That is the bet behind opentraces: developers should own the record of how their software gets made, and the ecosystem around that record should be open enough for others to build on it.
The examples above all follow the same pattern: capture the evidence, filter it with a workflow, review what leaves, and render it to a useful destination.
Three ways in.
Paste the setup prompt into Claude, Codex, or Pi. The agent installs the CLI, authenticates, and turns on capture for you.
One line with pipx or Homebrew.
Open source, end to end. The next consumer is a workflow and a renderer away.