emploke.
Essay · emploke

What we believe about agentic systems

The harness should shrink as the model grows. The one part that should grow is the AI's own ability to extend it.

We've spent the last while building emploke — a small system for running AI agents on real work. While building it, our taste for what an agentic system is has slowly drifted away from the mainstream. This is a short post about where it's drifted to, and why.

It's not a manifesto. It's the way we've come to think after enough cycles of writing code, throwing it out, and writing simpler code in its place.

What we've been noticing

Most agent frameworks are getting bigger. More node types, more graph primitives, more orchestration languages, more handlers for cases the model didn't quite get right. There's a quiet assumption running through all of this: the framework is what makes the agent capable. If the model is the engine, the framework is the chassis, the suspension, the steering, the seatbelts, and most of the dashboard.

We don't think that's how it's going to age.

The models keep getting better. Not in a smooth line, but in lurches that quietly retire whole categories of scaffolding. The clever planner you wrote last year is what the model now does in one sampling. The state machine you built around tool use is what the model now handles in its head. Each lurch makes a particular shape of harness embarrassing.

If you squint at the trend long enough, a question becomes hard to avoid: what part of the harness is actually durable?

A bet on the model curve

Our bet is straightforward. The model is going to absorb more and more of what frameworks currently do — planning, decomposition, recovery from mistakes, choosing when to use a tool, choosing what shape of answer to give. The boundary between "what the framework does" and "what the model does" will keep moving toward the model.

If you believe that, the design problem flips. You stop asking "what should the framework do for the model?" and start asking "what's the smallest scaffold the model needs to do real work?"

A good harness in 2026 is not the one that does the most. It's the one that does the least, while still letting the model reach further every quarter.

Three things we keep coming back to

Underneath the day-to-day code we write, three convictions keep showing up. They're the lens behind most decisions in emploke.

One

Agents are data, not code.

An agent is a markdown file with a name and instructions. Not a class, not a node in a graph, not a subclass of BaseAgent. The model knows how to read prose; it doesn't need a typed wrapper around it. Treating agents as data means the AI itself can author them — read them, edit them, version them, recombine them — without round-tripping through a programmer.

Two

The harness should shrink as the model grows.

Every line of orchestration code we write is a bet that the model won't get good enough to do this itself. Most of those bets lose, on a long enough timeline. So the rule we try to hold ourselves to is: don't add structure unless the model genuinely cannot do it; and when the model gets good enough, take that structure out.

Three

What the harness does grow is the AI's ability to extend itself.

We don't shrink the harness toward zero. We shrink it toward one shape: a clean surface where the model can add new capabilities — skills, tools, sub-agents — as data, on its own, without us. The harness's job is to make that surface safe, observable, and composable. Everything else is negotiable.

The parts that stay small

Three things stay in the harness, on purpose, because they're structurally outside what a language model can solve from inside its own context window:

A sandbox. The model needs somewhere to run that isn't your laptop. Filesystems, processes, network — bounded, observable, recoverable. This is engineering, not intelligence. No model upgrade removes the need for it.

Scheduling. Multiple agents, multiple tasks, shared state, human interrupts. Deciding what runs, when, with which inputs. This is also engineering. We try to make it as boring and reliable as possible.

Observability. If the AI is going to extend itself, somebody — eventually the AI itself — needs to be able to see what happened and why. Logs, traces, replay, diffs. Without this, every other belief on this list collapses into wishful thinking.

That's the whole list. Sandbox, scheduling, observability. Everything else we've been tempted to put in the harness, we've eventually pulled back out.

The one part that grows

The shape we keep wide open is capability: the surface where new powers enter the system. In emploke today, that surface has three flavors and one rule.

The flavors are skills (a markdown file teaching the model how to do something), MCPs (an external tool the model can call), and agents (a markdown file describing a role the model can play). All three are data. All three live in a folder. All three can be authored by a human or by the AI.

The rule is: anything the AI can do, the AI can also extend. If the AI can write code, it can write a skill. If it can call a tool, it can install one. If it can act as an agent, it can describe a new agent. The harness doesn't gate that; it just makes it observable and undoable.

The harness's job is not to be smart. It's to make the AI's own intelligence accumulate.

What this lets us do

If you accept the three beliefs, a lot of design questions stop being interesting.

You don't argue about graph DSLs vs. chains vs. event buses, because there isn't supposed to be one — the model picks the shape of its own work. You don't argue about whether agents should be classes or functions, because they're files. You don't build a planner, because the model plans. You don't build a router, because the model routes. You don't build a memory subsystem, because memory is just files in the workspace.

What you do spend your energy on is the boring stuff: the sandbox not leaking, the scheduler not deadlocking, the trace being actually readable, the capability surface being actually safe to extend at runtime. Unglamorous work. But it's the work that compounds, because none of it has to be redone when the model gets stronger. It just gets more leverage.

About emploke

emploke is our reference implementation of all this. Today it's a pnpm monorepo of small TypeScript packages running locally, exposing a single HTTP server (Hono) with a bundled React dashboard, with a workspace abstraction, a runtime registry, a capability catalog (skills, MCPs, agents), and a thin task/session model. The code is deliberately small. If we did our jobs right, it should keep shrinking — not growing — as the models get better.

If you want to see how the beliefs map onto code, the architecture guide is the right next read. It's the engineering view of the same picture: where the layers are, what the runtime contract looks like, how the repository pattern keeps the storage backend swappable.

The engineering will change. The runtime might consolidate to a single backend; the storage layer might collapse into something simpler; the catalog might absorb new flavors of capability we haven't thought of yet. None of that changes what's in this post. The beliefs are the slow-moving part. The code is just where they currently land.

If you're nodding along

This post is mostly here to find the people who already think this way and don't quite have words for it yet. If you've been quietly suspicious that your framework is doing too much, or that the next model release is going to make a third of your codebase pointless — you're not alone, and you're probably right.

We'd love to compare notes. Open an issue on the repo, or just read the code and tell us where we're wrong. The whole point of keeping the harness small is that being wrong is cheap; we'd rather find out fast.


Published 2026 by the emploke team. Source on GitHub. The companion architecture guide is here.

随笔 · emploke

我们对 agentic system 的一些看法

harness 应该随着模型变强而变薄。harness 里唯一应该变厚的部分,是 AI 自己扩展自己的能力。

过去一段时间我们在做 emploke——一个用来跑 AI agent 干真活的小系统。一边做一边发现,我们对一个 agentic system 应该是什么样 的看法,慢慢和主流分道扬镳了。这篇文章就是想把这个分歧讲出来,以及为什么我们这么想。

这不是宣言。只是写过一些代码、扔掉一些代码、又用更简单的代码替换它们之后,自然形成的某种 taste。

我们看到的现象

大部分 agent 框架在变大。更多种 node、更多 graph primitive、更多 orchestration 语法、更多用来兜底"模型没做对"的 handler。这些设计背后有一个不说出口的假设:是框架让 agent 变能干的。如果模型是引擎,那框架就是底盘、悬挂、方向、安全带,再加上仪表盘的大部分。

我们觉得这个范式经不起时间。

模型一直在变强。不是平滑地变,而是一阵一阵地把整类脚手架直接淘汰掉。你去年写的那个聪明 planner,模型现在一次采样就做了。你为 tool use 搭起来的状态机,模型现在在脑子里就处理了。每一次跳跃,都让某种形态的 harness 显得有点尴尬。

看久了这个趋势,一个问题就很难绕开:harness 里到底什么是耐用的?

对模型曲线的押注

我们的押注很直白:模型会越来越多地吸收掉今天框架在做的事——planning、任务拆分、错误恢复、判断什么时候用工具、判断答案的结构。"框架在做"和"模型在做"之间的边界,会持续向模型那一侧移动。

如果你接受这一点,设计问题就反过来了。你不再问 "框架应该替模型做什么?",而开始问 "模型干真活,最小需要的脚手架是什么?"

2026 年的好 harness,不是做得最多的那个。是做得最少、同时仍能让模型每个季度都伸得更远一点的那个。

我们反复回到的三件事

在每天写的代码底下,有三个判断一直反复出现。它们基本上就是 emploke 大部分决策背后的那副镜片。

其一

Agent 是数据,不是代码。

一个 agent 就是一个 markdown 文件,有个名字,写着指令。不是类,不是 graph 里的节点,不是 BaseAgent 的子类。模型本来就会读散文,根本不需要一层带类型的封装。把 agent 当作数据,意味着 AI 自己就可以创作 agent——读、改、版本管理、组合——不需要绕一圈让程序员介入。

其二

harness 应该随着模型变强而变薄。

每一行 orchestration 代码,都是一个赌注:赌模型不会进步到能自己做这件事。时间线拉长,这种赌注大部分会输。所以我们试着守住一个规矩:除非模型真的搞不定,否则不要加结构;一旦模型搞得定了,就把那段结构拆出去。

其三

harness 里 变厚的,是 AI 自己扩展自己的能力。

我们不是要把 harness 缩到零。我们要把它缩到一种形状:一个干净的扩展面,让模型可以自己往上加新能力——skill、工具、子 agent——以数据的形式,自主完成,不需要我们插手。harness 的工作是让这个面安全、可观测、可组合。其它都可以谈。

那些有意保持小的部分

三样东西我们刻意留在 harness 里,因为它们结构上就在语言模型自己的 context window 之外,模型靠自己解决不了:

沙箱。模型需要一个不在你笔记本上的地方运行。文件系统、进程、网络——有边界、可观察、可恢复。这是工程问题,不是智力问题。模型再升级也消除不掉这个需求。

调度。多 agent、多任务、共享状态、人类中断。决定什么时候跑什么、用什么输入。也是工程问题。我们尽量把它做得无聊、可靠。

可观测性。如果 AI 要扩展自己,那总得有人——最终是 AI 自己——能看到发生了什么、为什么发生。日志、trace、回放、diff。没有这一层,上面所有信念都会塌成一厢情愿。

就这三样。沙箱、调度、可观测性。除此以外那些我们曾经想塞进 harness 的东西,最后都被我们拆出来了。

那个我们留得很大的部分

我们坚决留宽的那个面,是 capability——新能力进入系统的接口。今天在 emploke 里,这个面有三种形态、一条规矩。

三种形态分别是:skill(一个 markdown 文件,教模型做某件事)、MCP(一个模型可以调用的外部工具)、agent(一个 markdown 文件,描述模型可以扮演的某个角色)。三者都是数据。三者都住在文件夹里。三者都可以由人写,也可以由 AI 写。

那条规矩是:AI 能做的事,AI 也能扩展。如果 AI 能写代码,那它就能写 skill。如果它能调用工具,那它就能装新工具。如果它能扮演 agent,那它就能描述新 agent。harness 不去把守这件事;它只负责让这件事可观测、可撤回。

harness 的工作不是变聪明。是让 AI 自己的智能能积累下来。

这让我们能做什么

如果你接受这三个信念,很多设计问题就不再有趣了。

你不会再纠结 graph DSL 还是 chain 还是 event bus,因为本来就不该有一个固定的形——模型自己选它工作的形状。你不会再纠结 agent 是类还是函数,因为它是文件。你不会再去搭 planner,因为模型会 plan。你不会再去搭 router,因为模型会 route。你不会再去搭 memory 子系统,因为 memory 就是 workspace 里的文件。

真正把精力投进去的是那些无聊的部分:沙箱不要漏、调度器不要死锁、trace 真的可读、capability 这个扩展面在运行时真的安全。一点都不光鲜的工作。但这些工作会复利,因为模型变强的时候,这些代码一行都不用重写,只是被赋予了更大的杠杆。

关于 emploke

emploke 是我们对这套思路的参考实现。今天它是一个 pnpm monorepo,由若干个小的 TypeScript 包组成,跑在本地,对外暴露一个 HTTP 服务器(Hono)加一个打包好的 React dashboard,里面有 workspace 抽象、runtime registry、capability catalog(skill、MCP、agent),以及一层很薄的 task/session 模型。代码刻意保持很小。如果我们做对了,它应该会随着模型变强而变得更小,而不是更大。

如果你想看这些信念落到代码上是什么样子,下一步推荐读 architecture 文档——那是同一幅图的工程视角:层次怎么分、runtime contract 长什么样、repository 模式怎么让存储后端可替换。

工程会变。runtime 也许有一天会收敛到单一后端;存储层也许会塌成更简单的东西;catalog 也许会吸收掉我们今天还没想到的新形态的能力。这些都不影响这篇文章里写的东西。信念是慢变量。代码只是它们当前的落脚点。

如果你也这么想

写这篇文章主要是想找到那些已经这么想、但还没找到合适的话讲出来的人。如果你心里一直有点怀疑——觉得自己用的框架做得太多了,或者觉得下一次模型升级会让自己 codebase 里三分之一变得没意义——你不是一个人,你大概率也是对的。

我们很想跟你交换看法。可以去 repo 里开 issue,或者直接读代码、告诉我们哪里想错了。harness 保持小的好处之一就是:错的成本很低,我们更愿意快点知道自己错在哪。