Draft note: numbers in this document are the 2026-04-22 audit snapshot. Re-verified 2026-04-24: fresh enumeration returns 787 files, 252,648 LOC, 88,491 audit rows — a 1.5% upward drift in both code and audit volume over 48 hours of active development. Bucket ratios, wrapper-class proportions, consciousness-tier counts, and the four claims are unchanged. Body text retains the 2026-04-22 numbers for internal consistency with the evidence dossier; a release pass would swap in the verification-day totals.
Abstract
There is a dismissal of local-first agent systems — "thin wrappers around CLIs" — that has real weight behind it and deserves a direct answer rather than a rhetorical one. This paper answers it empirically. I audited JARVIS, a single-operator local-first AI cybersecurity operations console, at 248,809 lines of Python across 766 files, and decomposed the codebase along four independent axes. Table 1 decomposes by functional bucket and shows that the "core intelligence" of the system (orchestration, behavioral, tools, runtime daemons) is about 34% of the codebase, roughly matched in size by the GUI alone at 17% plus tests/utilities at 26%. Table 2 decomposes the 258-tool surface by wrapper class: of 183 dispatch functions, only 12 (6.6% of functions, 11.2% of tool code) shell out to external binaries; the rest are original Python. Table 3 decomposes the 35 consciousness modules into three operational tiers — 17 that reach the operator on a live turn, 11 that accumulate ambient state between turns, and 7 that are currently inert or conditionally gated. Table 4 is the 87,315-row audit log histogram over 24 days: three event types account for 78.7% of rows, and the actual autonomy-decision events sum to 5.5%. The four axes tell the same story in different dialects. A local-first agent system is scaffolding surrounded by small moments of decision; the wrapper framing misreads what the code is doing. The five explicit limitations are N=1, heuristic methodology, no version-control history at snapshot, snapshot-date drift, and self-reporting bias on operator-side observations. I propose what replication at N>1 would need to look like.
1. The dismissal the paper is addressing
This paper examines a specific dismissal. "Local-first agent systems are wrappers around CLIs" is a position with weight behind it. The reference to shell-outs is real — security tooling especially does invoke subfinder, httpx, nuclei, and the rest. The prior that a single operator cannot produce serious engineering in thirty-three days is reasonable given the base rate. And the AI industry has, in fact, shipped many genuinely thin wrappers and called them agents. The critique deserves a direct answer, not a dismissal of the dismissal.
The answer I want to give is structural rather than rhetorical. What I will argue is that when you decompose a real single-operator agent system at scale — files, lines of code, tool dispatch functions, behavioral modules, audit events — the wrapper framing fails to fit any of the four axes, and the reason it fails is instructive. It fails because agent systems do not concentrate in the place the critique points at. They accumulate in the scaffolding around the small decision moments, and the scaffolding is the work.
The specific system is JARVIS: 766 authored Python files, 248,809 lines of code, 27 subsystems, 258 tools, 87,315 hash-chained audit records at the time of snapshot. It is single-operator, single-workstation, local-first, and has been under continuous development for 37 days. The preceding three papers in this sequence have described its safety architecture, its ambient-intelligence layer, and the personal-alignment load on its operator. This paper describes its shape — what the code looks like when you break it apart and count.
I want to be honest about what this paper is not. It is not a benchmark; there is no performance claim measured against other systems. It is not a how-to; the numbers are evidence, not instructions. It is not a defense of the code's quality; quality is a separate axis the audit does not speak to. What the audit gives us is a four-table empirical answer to the question what is this thing, actually, when you weigh it, and that answer turns out to contradict the "wrapper" framing at every table. If someone wants to argue the system is badly designed, or overbuilt, or under-tested, the tables do not prevent that argument. They only prevent the argument that the code is a thin shell.
The claim of this paper, then, is narrow and load-bearing: agent systems at scale decompose into scaffolding surrounded by small moments of decision, not into model calls surrounded by tools. Builders who budget for the decisions will systematically under-budget the scaffolding by approximately an order of magnitude, and the consequence of that mis-budgeting is that new agent systems ship half-finished and get dismissed as wrappers — because, under-budgeted, that is in fact what they are.
2. How the audit was done
The audit is an extraction, not a benchmark. Four tables, read-only against the running system, no code mutations, no inference beyond arithmetic.
File and line counts. Python files enumerated with Get-ChildItem -Recurse -Filter *.py; per-file physical-line counts via (Get-Content | Measure-Object -Line).Lines. This counts blank lines, comments, and docstrings alongside executable code — the same convention cloc uses for its "code + comment + blank" mode. A statement-only count would drop totals by roughly 15–20% uniformly across the tree; bucket ratios are approximately preserved under that alternative, though heavily-docstringed directories (notably consciousness/) would shift by one to three percentage points. The choice to use physical lines was made for reproducibility; anyone with PowerShell and Get-Content can replicate the totals without installing additional tools.
Authored vs vendored. Everything under cb-mpc/ — a C++ multi-party computation library with Python test harnesses, vendored whole — is bucketed as generated/vendor (42 files, 9,085 LOC). Nothing else in the tree is auto-generated: no protobuf output, no OpenAPI stubs, no ORM migrations. The "authored" designation therefore applies to 724 of 766 files.
Directories excluded from enumeration. __pycache__/, .venv/, venv/, node_modules/, jarvis_backups/, jarvis_patches/, backup and preview directories, and non-code content (markdown, JSON, HTML, SVG, audio samples, model weights). This paper is about Python code volume specifically.
Tool enumeration. Tool schema count comes from a regex of "name"\s*:\s*"([a-zA-Z_0-9]+)" against tools/registry.py's TOOL_SCHEMAS list: 258 unique schemas. Dispatch function count comes from ^(async\s+)?def\s+(tool_[a-zA-Z_0-9]+)\s*\( across all files in tools/: 183 functions. The gap — 258 schemas, 183 dispatch functions — is closed by inline handlers in registry.py's _dispatch_inner() that forward into autonomy and intelligence modules without a standalone wrapper function; these inline handlers are counted in the schema total but not in the dispatch-function total. The 12-function shell-wrapper subset was identified by searching each dispatch body for subprocess.*, os.system, Popen(, asyncio.create_subprocess, shutil.which(, or the _enqueue_to_parrot call that routes aggressive tools to a separate Parrot OS virtual machine. Each of the 12 flagged functions was manually verified.
Consciousness wiring. For each of the 35 substantive modules in consciousness/, I traced consumers via Select-String for consciousness.<module> across the tree, then manually verified every hit at the call site in agents/worker.py. "Reaches operator this turn" is true only if the module's output is read during the assembly of a single reply — either in the system-prompt build phase or in the post-LLM prepend phase — and not if it writes state that will be consumed later by another daemon loop. The per-turn definition is strict: a module that writes a row to a database every five minutes, consumed by a different subsystem that emits text on some other schedule, is counted as ambient, not live.
Audit histogram. Straight SQL: SELECT event_type, COUNT(*), MIN(ts), MAX(ts) FROM audit_log GROUP BY event_type ORDER BY COUNT(*) DESC against audit_log.db. Result: 87,315 rows across 42 distinct event types, spanning 2026-03-29 03:46:52Z to 2026-04-22 16:10:28Z (approximately 24 days). Emission sites for each event type were identified by grepping the source tree for the event literal and reading the surrounding context.
N=1. This is a single repository, a single operator, a single local-first architecture, a single 24-day audit window. I state this here so the reader is oriented for the whole paper: absolute numbers are to be read as an existence proof that one local-first agent can decompose the way this one does, not as population statistics. Replication across other local-first agent systems would be needed to generalize, and I sketch what that replication would look like in §5.
No git history. The project was not under version control during the snapshot period. .git/ does not exist in this tree. This is a real limitation — commit cadence, regression history, and authorship timeline cannot be reconstructed from repository artifacts — and it is a cost paid for the speed of the build. Going forward, the project initiates a git repository; this manuscript and the evidence dossier serve as the t=0 reference point for that history.
3. Four tables
Each of the four tables corresponds to one load-bearing claim. I present each table with a short setup, the table itself, and two to three paragraphs of reading.
3.1 Table 1 — Codebase by functional bucket
Thirteen buckets, mutually exclusive, covering 100% of the tree. The mapping of directories to buckets is described below the table; a small number of non-obvious assignments are explained explicitly.
| Bucket | Files | LOC | % of total |
|---|---|---|---|
| tools/adapters | 35 | 17,430 | 7.0% |
| orchestration/intelligence | 81 | 35,314 | 14.2% |
| policy/safety | 13 | 2,621 | 1.1% |
| persistence/state | 19 | 7,448 | 3.0% |
| runtime/daemons | 34 | 15,713 | 6.3% |
| behavioral/personality | 37 | 17,185 | 6.9% |
| voice/audio | 27 | 9,032 | 3.6% |
| gui | 75 | 42,205 | 17.0% |
| worldview/spatial | 25 | 5,806 | 2.3% |
| research/knowledge | 27 | 6,934 | 2.8% |
| tests/utilities | 295 | 64,899 | 26.1% |
| generated/vendor | 42 | 9,085 | 3.7% |
| other | 56 | 15,137 | 6.1% |
| Total | 766 | 248,809 | 100.0% |
The non-obvious assignments: storage/audit_log.py (950 LOC) is pulled out of persistence and placed in policy/safety, because it is the tamper-evident safety surface rather than a generic persistence helper. runtime/kill_switch.py (133 LOC) is in policy/safety, not runtime/daemons, because it is a gate rather than a daemon. bridge/scope.py (412 LOC) is in policy/safety. voice/profiles.py (246 LOC) is in behavioral/personality because it encodes per-persona voice parameters — persona data, not audio plumbing. Root-level scripts (check_*, fix_*, seed_*, main.py boot glue) are bucketed with tests/ and scripts/ as tests/utilities. These decisions each shift the containing buckets by less than half a percentage point; reclassifying any one of them does not change the shape of the answer.
The first reading of the table is the location of mass. The single largest bucket is tests/utilities at 26.1%, followed by GUI at 17.0% and orchestration/intelligence at 14.2%. Those three together account for 57.3% of the project. The "agent" — the model-adjacent decision-making code someone would point at if you asked which files are the AI — is in orchestration/intelligence and policy/safety and some of runtime/daemons, summing to roughly 22% of the tree. Everything else is what lets those files run: the database schema, the voice pipeline, the daemons that keep state coherent between turns, the windows that display results, the tests and scripts that verify and seed the rest.
The second reading is the surprise at the small end. The policy/safety bucket — the entire tamper-evident audit log, kill switch, scope gate, security stack, and autonomy policy that makes autonomous operation legally defensible — is 2,621 lines across 13 files, about 1.1% of the project. The behavioral/personality layer is 17,185 lines across 37 files, roughly 6.9%. The safety perimeter is small because the work safety does is narrow and load-bearing; the behavioral layer is larger because the work of making a system feel coherent between turns is broad and open-ended. These two buckets together — the ones most often caricatured as "most of the code" in dismissive summaries — are under 8% combined. The real mass sink is the scaffolding around the agent, not the agent itself.
The third reading is the consequence. A builder who budgets for an agent as "model plus tools" is budgeting roughly 22% of what actually ships. The other 78% is not discretionary. Without persistence, the agent forgets. Without daemons, it cannot react between turns. Without a GUI, it cannot be operated. Without tests and utilities, the rest regresses. The implication is not that the 22% is the real system and the 78% is overhead; it is that the 22% cannot exist operationally without the 78%.
3.2 Table 2 — Tool surface by wrapper class
The 258 tool schemas are implemented by 183 dispatch functions plus roughly 75 inline handlers in registry.py itself. The 183 dispatch functions are where the wrapper question can be answered directly. I classified each by body inspection into one of four wrapper classes: shell (calls out to an external binary), api (hits an HTTP endpoint), db (reads or writes local storage), or native (pure in-process Python).
| Class | Dispatch functions | % of functions | LOC | % of tool LOC |
|---|---|---|---|---|
| shell | 12 | 6.6% | 1,072 | 11.2% |
| api | ~46 | 25.1% | ~2,710 | 28.3% |
| db | ~38 | 20.8% | ~2,150 | 22.5% |
| native | ~87 | 47.5% | ~3,644 | 38.0% |
| Total | 183 | 100.0% | 9,576 | 100.0% |
The shell-wrapper set is short enough to name in full: tool_scan_git_exposure, tool_run_nuclei, tool_get_wayback_urls, tool_url_analyze, tool_run_subfinder, tool_run_httpx, tool_list_capabilities, tool_whois_lookup, tool_get_clipboard, tool_run_command, tool_open_app, tool_set_voice. These shell out to the well-known Go and Python security CLIs (subfinder, httpx, nuclei, git ls-remote), plus pyperclip/xclip for clipboard, powershell Start-Process for application launch, and Windows SAPI for voice. Every other dispatch function — all 171 of them — is original Python. The api class hits NVD for CVE data, GitHub for code search, Gmail for mail, HackerOne for bounty platform operations, Wayback for historical URLs, OpenSky for aircraft tracking, USGS for earthquake feeds, Stripe for credential validation, and a handful of others. The db class reads and writes the 55-table operational SQLite database and the 42-event-type audit log. The native class is pure in-process Python: state readers, dispatchers, formatters, heuristic scorers.
The first reading is direct. Of 183 tool dispatch functions, 12 — 6.6% — shell out to external binaries, and those 12 account for 11.2% of tool code. The common dismissal that local-first agent systems are "thin wrappers around existing tools" does not fit this distribution. The dominant wrapper class by both function count and line count is native: original orchestration logic operating on in-process state. By line count, the api class is second. The shell wrappers are the smallest of the four classes on both axes.
The second reading is where the glue actually lives. Outside tools/, the shell-wrapper pattern effectively disappears. The agents/, intelligence/, autonomy/, consciousness/, and memory/ directories — 35,314 LOC of orchestration plus 17,185 LOC of behavioral plus 7,448 LOC of persistence — are pure Python. The glue that dominates the codebase is not between JARVIS and external CLIs; it is between JARVIS's own subsystems. Orchestrator talks to policy talks to audit-log talks to finding-engine talks to brain-graph. That internal glue is where code volume accumulates, and none of it fits the wrapper critique because none of it is wrapping anything external.
The third reading is what the tool classification implies about the system's surface. If the system were in fact a thin wrapper, the shell class would dominate and the native class would be small — because a thin wrapper does little besides format arguments, shell out, and parse output. Here the opposite is true. The shell class is the smallest, the native class is the largest, and the api and db classes are comparably sized to each other. The shape of the tool surface is a system that does substantial work in-process and reaches out selectively, not a system whose primary mode is delegation to external binaries.
3.3 Table 3 — Consciousness modules by operational tier
The consciousness/ directory contains 35 substantive modules (excluding __init__.py). They are the system's personality-adjacent subsystems — inner voice, soul engine, emotional intelligence, relationship ledger, persona drift, proactive briefing, mirror, sibling channel, and others. A dismissive summary would say "35 consciousness modules" without qualifying what that means operationally. The purpose of this table is to qualify it.
I split the 35 into three tiers by reply-path effect:
- Live (17 modules): the module's output is read during the assembly of a single operator reply, either in the system-prompt build phase or in a post-LLM prepend phase. Confirmed by reading the call site in
agents/worker.pyfor every member of the tier. - Ambient (11 modules): the module runs on a schedule or subscribes to an event stream and mutates state — memory rows, soul-engine events, brain-graph edges, relationship ledger — that influences future turns, but its output is not read on any given turn.
- Inert or conditional (7 modules): the module is gated off by a configuration flag (
VISION_ENABLED=Falseis the primary case), is campfire-only and therefore outside the operator reply path, or is pure infrastructure (the scheduler itself, the shared cooldown gate) that enables other modules without producing reply-path output.
The live tier includes emotional_intelligence, human_layer, persona_drift, character_mode, session_context, relationship_ledger, soul_engine, social_context, personality_bridge, interruption_handler, inner_voice (via surface_pending_thoughts_to_prepend), proactive_voice, synthesis_engine's Socratic tail, and cross-sibling reads through jarvis_reachy_channel and reachy_brain. The ambient tier includes mirror, continuous_watcher, operator_awareness, parenting_layer, joy_ledger, joy_prompt_daemon, self_evolution, proactive_briefing (session-start only), sibling_channel, time_capsule, and emotional_memory's write-only paths. The inert tier includes screen_awareness (vision currently disabled), pattern_observer (depends on vision events that do not fire), campfire_context and campfire_watcher (campfire-only), conversation_mode in its auto-triggered branch, scheduler (infrastructure), and _shared_cooldown (infrastructure).
The first reading is that "35 modules" overstates what is operationally live at any moment. Seventeen modules — 48.6% — participate in any given live utterance. Eleven are doing ambient state accumulation that influences later utterances. Seven are either turned off or are infrastructure that supports the other two tiers without producing output themselves. Any future writing about this system should use the three-tier framing rather than the undifferentiated module count.
The second reading is about the shape of "behavioral breadth." The live tier is not a shrunken version of the full layer; it is the subset that got wired into the hot reply path by design. The ambient tier is doing real work between turns — memory curation, growth tracking, cross-session pattern detection, sibling-channel exchange — and that work does reach the operator, but asynchronously, through side channels like proactive voice announcements or next-session system-prompt injection. The inert tier is the smallest and is either feature-flagged off for GPU budget reasons (vision) or is infrastructure that would need to be reclassified if the paper's axis were changed from "reaches operator per turn" to "participates in the system's operation."
The third reading is honest about what this implies for the earlier papers in this sequence. Paper 2 (The Ambient Intelligence Problem) described the consciousness layer as a unified architectural tier. Empirically, the layer is three tiers. A more precise claim that Paper 2 v1.1 should adopt is that the layer has 17 active-per-turn modules, 11 ambient-state accumulators, and 7 currently-inert-or-conditional subsystems, and that the system's "behavioral scaffolding" is the union of the first two tiers rather than the full module count.
3.4 Table 4 — Audit log by event type
The audit log is a separate SQLite database, hash-chained row-by-row, covering every action the system takes that is deemed auditable by its emission site. At snapshot time it holds 87,315 rows spanning approximately 24 days. A full breakdown of all 42 event types is in the evidence dossier; the table here is the head of the distribution plus the tail of the autonomy-decision cluster that matters for the paper's claim.
| event_type | count | % | what triggers it |
|---|---|---|---|
| api_access | 45,912 | 52.6% | Non-localhost FastAPI request to the bridge server (one LAN client polled /api/jobs/pending). |
| tool_dispatch | 12,131 | 13.9% | Every tool dispatch call, logged unconditionally before the tool runs. |
| finding_cross_host_dup | 10,623 | 12.2% | SmartScope/FindingEngine dropped a finding as a cross-host duplicate. |
| orchestrator_action | 2,470 | 2.8% | Hunt state-machine action (recon/deep-dive/evidence/drafting). |
| finding_noise_filtered | 1,869 | 2.1% | Finding dropped by false-positive sweeps. |
| finding_processed | 1,769 | 2.0% | Finding passed all gates and was recorded. |
| self_eval:learn_outcome | 1,588 | 1.8% | Hit/miss outcome for a completed hunt recorded. |
| self_eval:ali_feedback | 1,588 | 1.8% | Operator approve/reject/dismiss feedback routed into learner. |
| scope_exclusion | 1,453 | 1.7% | SmartScope rejected a candidate host at the pre-filter. |
| finding_duplicate | 1,326 | 1.5% | Finding was an exact duplicate of a prior one. |
| hunt_auto_approved | 994 | 1.1% | Hunt director auto-approved next action under policy gate. |
| (31 further types, each under 1%) | 8,592 | 9.8% | — |
| Total | 87,315 | 100.0% | 42 distinct event types |
Three event types account for 78.7% of rows: api_access (52.6%), tool_dispatch (13.9%), and finding_cross_host_dup (12.2%). If you sum the events that are actual autonomy decisions — orchestrator_action, hunt_auto_approved, policy_decision, orchestrator_transition, hypothesis_formed — they come to approximately 4,841 rows, or 5.5% of the log. Day-to-day, the system is writing roughly 500 tool dispatches, 440 cross-host-duplicate drops, 100 orchestrator actions, 40 policy decisions, and 11 hypothesis formations per day. For every one "thought" recorded, there are roughly fourteen action-surface events recorded.
The first reading is that the audit log records the action surface at a per-dispatch granularity rather than selectively logging interesting decisions. Every tool call emits a tool_dispatch row before the tool runs, whether the tool is a passive DNS lookup or an orchestrator-initiated nuclei scan. Every cross-host duplicate drop is logged, even though no action reaches a target. The audit mechanism is proportional to the system's total activity, not to the subset of activity a designer might label as "decisions." This is the correct design for a tamper-evident record: the log is either universal or it is selectable, and a selectable log is one an adversary can game by reclassifying.
The second reading is the ratio. The 14:1 ratio of scaffolding-event rows to decision-event rows in the audit log mirrors the approximate 9:1 ratio of scaffolding LOC to decision-plus-safety LOC in Table 1. The two ratios are not identical, but they tell the same story at different scales. The system spends most of its time dispatching and deduplicating, not deliberating, and the audit log reflects that activity distribution faithfully.
The third reading is what the scope-gate numbers show about where gating actually happens. The log contains 1,453 scope_exclusion events (SmartScope dropping a candidate host before any tool runs against it) and 2 scope_violation events (a network tool's own scope check catching a target that somehow slipped through). The ratio is 726:1. The scope gate does almost all of its work at the SmartScope pre-filter stage; by the time a scope check reaches the network-tool layer, the target has already been vetted hundreds of times. This is a minor point that is easy to miss: serial gates with early, cheap pre-filters concentrate the work at the top, not at the bottom, and the audit log's event histogram shows which layers are doing the work.
4. Synthesis
The four tables are four decompositions of the same system along four axes. Taken alone, none of them is decisive. The wrapper critique could be salvaged against any single table with a sufficiently determined reading — "your 6.6% is a big 6.6%," "17 modules is still a lot," "audit logs lie about what matters." Taken together, they describe a shape the critique cannot fit.
Start with what the "wrapper around CLIs" framing predicts. If a local-first agent system is a thin shell over external binaries, the predictions are: most of the tool code is shell invocation; most of the non-tool code is either UI around those shells or glue to pipe their output; the behavioral surface is shallow because it is decoration around a pass-through pipeline; the audit trail, if it exists, logs decisions rather than actions, because there are fewer internal actions to log.
Now read the four tables against those predictions. Table 2 contradicts the first prediction directly: 12 of 183 dispatch functions shell out, 11.2% of tool code, and the balance is original Python. Table 1 contradicts the second: the non-tool mass is distributed across persistence, orchestration, behavioral, runtime, and GUI — scaffolding around the agent, not glue around shells. Table 3 contradicts the third: the behavioral surface is 35 modules, 17 of them wired into the live reply path by design, and the ambient and inert tiers add state accumulation and conditional capability that a wrapper architecture has nowhere to put. Table 4 contradicts the fourth: the audit log records roughly 500 tool_dispatch events per day — every single tool call, shell or api or db or native — and 5.5% of rows are autonomy-decision events. The log is universal and action-level, not selective and decision-level.
The shape the tables describe, taken together, is something like this: a hot inner loop of orchestration, policy evaluation, hypothesis formation, and tool dispatch — small, expensive per unit, and representing the decisions of the system — surrounded on every side by scaffolding: the tools themselves (17,430 LOC), the persistence and memory layers (7,448 LOC), the runtime daemons that keep state coherent (15,713 LOC), the behavioral layer that shapes how the system presents (17,185 LOC), the voice pipeline that gives it a mouth (9,032 LOC), and the GUI that lets an operator see it (42,205 LOC). The hot loop and its direct safety perimeter together are under 25% of the codebase. The scaffolding is everything else.
The generalizable observation from this shape is a ratio, not a specific number. In the LOC axis the ratio of scaffolding to decision-plus-safety sits around 9:1; in the audit-event axis it sits around 14:1; in the tool-wrapper-class axis the non-shell fraction is roughly 9:1 against the shell fraction. These are three independent decompositions, and they all land in the same order of magnitude. The hypothesis I take from the coincidence is that local-first agent systems at scale invest roughly an order of magnitude more effort into scaffolding than into the hot decision loop, and that this ratio is a property of the local-first setting rather than of a particular builder's choices. Cloud-native agent systems can rent scaffolding — hosted databases, managed UI frameworks, SaaS audit platforms, externally-maintained tool catalogs — and they will show a different ratio because much of the scaffolding does not appear in their repository. A local-first system, by definition, owns its own scaffolding, and the scaffolding mass is what you see when you count.
That reframing has a consequence for how agent systems are dismissed. The "wrapper around CLIs" framing locates the work in the wrong place. It assumes the agent is the hot loop plus the shell-outs, and that the rest of the code is decoration. The tables show the opposite: the hot loop plus the shell-outs are the smallest visible slice of the system, and the rest is the infrastructure that makes the hot loop operable for a single operator over continuous sessions. A builder who does not invest in the scaffolding does not produce a leaner agent; they produce a brittle one. A critic who does not count the scaffolding does not see a simpler system; they see a system whose complexity is hidden from them by the axis of their measurement.
None of this makes the scaffolding automatically good. A system with 248,809 lines of Python can be elegant or sprawling, well-tested or under-tested, clear or impenetrable. The tables do not speak to those questions. They speak to a narrower question: whether the system is a thin wrapper. It is not.
5. Discussion
The paper's claim is narrow; its generalization is narrower still. I want to be explicit about what the tables do and do not support beyond the specific case.
What they support. They support the existence claim: a single-operator, local-first agent system at the 248K-LOC scale can decompose along these four axes and produce these four distributions. That is not a population statement; it is a proof by instance. They also support the weaker generalization: the "wrapper around CLIs" framing does not fit this instance. A critic holding that framing has to show which axis of the critique the tables have missed, not repeat the critique against the summary.
What they do not support. They do not support claims about other builders' systems. A 30K-LOC local-first agent would have different ratios and a different shape. They do not support claims about cloud-native agents, where scaffolding is rented and therefore off-ledger. They do not support quality claims: the LOC distribution is silent on whether the code is well-structured, tested, or maintainable. They do not support productivity claims: the 37-day build is not offered as a benchmark, and the "one operator" framing is not a prescription.
What replication would need. A more general version of this paper would audit N>1 local-first agent systems along the same four axes and report the ratios across them. The candidates I would want to see audited include Tabby (local coding assistant), Continue (local coding agent), Superagent (local autonomous agent framework), and the cluster of local-llm-function-calling stacks that have emerged in the past year. For each, the method is: physical-LOC count by bucket; tool dispatch enumeration with shell/api/db/native classification; behavioral-layer tier analysis if a behavioral layer exists; audit or event-log histogram over a comparable window. The prediction I would carry into that study is that the scaffolding-to-decision ratio sits in the 5:1 to 20:1 range for systems over ~50K LOC. A ratio significantly outside that range — either a system that is overwhelmingly shell wrappers, or one that is overwhelmingly pure hot-loop code with no scaffolding — would be evidence against the generalization.
What would disprove the thesis. The thesis is falsifiable in at least three ways. First, an audit of several local-first agent systems that shows shell-wrapper code dominating the tool surface in systems at comparable scale would directly contradict Table 2's generalization. Second, an audit that shows an equivalent system performing comparably without a scaffolding layer — no persistence, no runtime daemons, no behavioral tier — would contradict Tables 1 and 3's generalization. Third, an audit log from another system that shows the decision/action-event ratio approaching 1:1 rather than clustering near 1:14 would contradict Table 4's generalization. Any of these would weaken the paper's framing and deserve a revision. I invite them.
What is open. The tier distinction in Table 3 applies to a behavioral layer that is itself a design choice rather than a necessity of the architecture. Agent systems that do not maintain ambient state between turns — simpler, statelier chat-agent architectures — would not have a Table 3 at all, because they would have no ambient tier to measure. Whether the ambient tier is load-bearing for operator experience, or whether it is overbuilt in this specific instance, is a question the paper does not answer. The third paper in this sequence treats the operator-side consequences of running such a system; the load-bearing question for this paper is only whether the layer exists and how much of it operates per turn.
What this means for builders. The practical advice this paper supports is thin but specific. If you are budgeting an agent system as "model plus tools," you are budgeting approximately 22% of what will ship. The other 78% is persistence, orchestration, runtime coordination, behavioral scaffolding, voice or UI, and tests. The 78% is not discretionary and will not wait. A common failure mode of new agent projects is shipping the 22% and discovering the 78% as emergency work mid-operation. The correction is to allocate attention to the scaffolding at the architectural-decision stage, not at the demo-next-week stage.
6. Related work
The question of what a system's size means is older than AI. Thomas McCabe's cyclomatic complexity (1976) and Maurice Halstead's software science (1977) are the canonical answers to what counts when you count code. Both were proposed as proxies for cognitive complexity and tested against defect-density and maintenance-cost outcomes. Both have known limits: complexity proxies do not capture architectural modularity, and lines of code as a crude measure was famously pilloried as "measuring the weight of a building to assess its structural quality." I use physical LOC here not as a complexity proxy but as a mass measurement — the question is where code lives, not how hard it is to understand — and I acknowledge upstream that any complexity conclusion drawn from this paper's numbers would be overreach.
Barry Boehm's COCOMO model (1981) and its descendants take the inverse question: given size, what does it cost to produce? COCOMO's constants are tuned on commercial software projects, not on single-operator research builds, and the effort estimate for 248K LOC under COCOMO's basic model would be a small team working for a year or more — an estimate the 37-day build plainly contradicts. The interpretation I offer is not that the model is wrong; it is that single-operator AI-assisted builds sit outside COCOMO's training distribution. Paper 4's numbers are a data point from outside that distribution, not a refutation inside it.
Benchmark work on agent capability — WebArena (Zhou et al., 2024), SWE-bench (Jimenez et al., 2024), OSWorld (Xie et al., 2024) — treats agent systems from the outside, measuring task success rates against held-out suites. The inside view this paper offers is orthogonal: benchmarks measure what an agent can do; the decomposition measures what the agent is made of. A system can score well on a benchmark and still be a thin wrapper; a system can score poorly on a benchmark and still have a rich internal structure. The two axes are independent, and the wrapper critique lives on the internal axis. I cite the external-benchmark literature here because it is the closest existing body of work on agent systems, and because the internal axis has been under-attended relative to the external one.
The specific "wrapper" critique has been made publicly in several forms. The most common version is the dismissal of agent systems built on top of existing security CLIs as repackaging rather than engineering. I searched for the critique in published form and did not find a version prominent enough to cite as representative. The framing circulates in security-practitioner conversation without a canonical written source, which is itself a note about the state of the discourse: the dismissal is common enough to be an intellectual headwind for builders in the space, but diffuse enough that no single author carries it. This paper responds to the framing as it exists rather than to any particular statement of it.
Published work on single-operator software development at scale is sparse. Fabien Sanglard's deep-dive reconstruction work (on Doom, Wolfenstein 3D, Quake) is an adjacent genre but reverses the arrow — he audits finished systems; this paper audits a system under development. The closest analog I have found is the literature on solo-founder software-as-a-service operations (for example, the Software Engineering at Google book's chapter on team size versus code quality, or DHH's writing on the Basecamp stack), but those are operational rather than architectural and do not decompose by the four axes used here.
7. Limitations
I list these explicitly rather than softening them, because a decomposition paper whose limitations are hedged is a paper whose numbers should not be trusted. The five items below are reproduced from the evidence-dossier review and are meant to be read as features of the method, not concessions.
- Sample size: N = 1. One operator (Ali Hassan Assi), one workstation, one local-first architecture, 37 days of continuous development. Ratios and absolute numbers should be read as an existence proof that a local-first agent can decompose this way, not as population statistics. Generalization beyond this setting is open, and the disproof conditions are spelled out in §5.
- Methodology is heuristic. Category assignment to buckets in Table 1 is manual, guided by directory name and per-file responsibility, and could shift by one to three percentage points under a different rubric. Shell-wrapper detection in Table 2 is surface-level regex against dispatch-function bodies and could miss a wrapper that routes through a helper import; manual review of the 12 shell-flagged tools confirmed zero false positives, but false negatives cannot be ruled out without deeper call-graph analysis. LOC counts include blank lines, comments, and docstrings; statement-only counts would drop totals by roughly 15–20% uniformly. Tool-schema counts (258) diverge from dispatch-function counts (183) because some schemas are served by inline handlers in
registry.pythat do not have a standalonetool_*function; the 75-function gap is documented but not individually enumerated in this paper. - No git history. The project was not under version control during the snapshot period. Commit cadence, regression history, and authorship timeline cannot be reconstructed from repository artifacts. This is a real limitation for establishing the 37-day build timeline empirically; the timeline given in this paper derives from file modification timestamps, audit-log first-seen dates, and the author's session records. Future development initiates a git repository; this manuscript and the evidence dossier serve as the t=0 reference point for that history.
- Snapshot date. 2026-04-22 for the physical-LOC and tool-enumeration numbers; 2026-04-22 16:10:28Z for the audit-log tail. The codebase is actively developed, and totals can shift by fractions of a percent within days. A re-snapshot on publication morning is part of the release procedure; bucket ratios and the four claims are stable across the 12-hour window between the evidence-dossier audit and this manuscript's draft.
- Self-reporting bias on operator-side observations. The consciousness-layer tiering in Table 3 rests on manual call-site verification in
agents/worker.py, which is subject to the author's reading of the code. The classification has been cross-checked against the audit log's read-path events where they exist, but some modules (notably the inert tier) have no runtime signal available and are classified on the basis of their configuration gates and intended role. An external auditor might shift two or three modules across the live/ambient boundary without changing the paper's headline claim.
8. Conclusion
The wrapper framing predicts a shape that the four tables do not produce. The actual shape is a small hot loop of orchestration, policy, hypothesis, and tool dispatch, embedded in a much larger mass of scaffolding — persistence, runtime daemons, behavioral layer, voice pipeline, GUI, and tests. Across three independent axes (LOC distribution, tool dispatch classification, audit-event histogram), the scaffolding-to-decision ratio sits in the same order of magnitude, roughly 9:1 to 14:1. A fourth axis (consciousness-layer tiering) adds a distinction the wrapper framing has nowhere to put: behavioral modules come in three operational grades, and only about half reach the operator on any given turn.
What the decomposition method offers is a cheap empirical check against dismissals. Before accepting or rejecting a claim about what an agent system is, count. The count is not an argument about quality. It is a shape, and the shape is what the dismissal either fits or fails to fit.
What it does not offer is a quality verdict, a performance claim, or a prescription. The next paper in this sequence, if one is written, would take one of these four axes — most likely the ambient-state tier of the behavioral layer — and investigate whether it is load-bearing for operator experience or over-engineered for a single-operator setting. That is a question the tables cannot answer. But they can say, honestly and reproducibly, that the layer exists and that roughly eleven modules of it are doing work between turns.
Write the evidence before writing the defense. The defense, once the evidence is in, can be short.