The Constitution Before the Product

Abstract

Most autonomous AI systems are built product-first: a feature is designed, a model is wired to tools, and safety is layered on afterward as prompt instructions, output filters, or refusal heuristics. This ordering works for chat assistants. It fails the moment the system can take actions in the world — scan infrastructure, write to disk, send messages, run subprocesses, speak into a room. This paper argues that for any AI system with real capability, the safety architecture must be written first, as running code the product literally cannot bypass, rather than as documentation or post-hoc restraint. We describe the constitution of JARVIS, a local-first autonomous cybersecurity operations console built in thirty-three days on a single operator's machine: 689 authored Python files, 232,458 lines of code, 27 subsystems, 260 registered tools, 81,380 hash-chained audit records, one operator. The constitution consists of five concentric architectural layers that invert the usual arrangement — personality on the outside, cryptographic primitives on the inside — and a serial chain of seven gates that every autonomous action must traverse in order, each small, each fail-closed. We describe each gate by its failure mode, discuss the audit log's documented chain break on 2026-03-29 and its auto-repair as a credibility feature rather than an embarrassment, present a case study of an April 13, 2026 trust breach where an external AI assistant misused scope data and nearly cost the operator his HackerOne access, and argue the pattern generalizes to any autonomous AI holding meaningful capability. The constitution does not make the system good. It makes the system contained, which is the prerequisite for safely building anything interesting on top.

1. The product comes second

Most AI systems are built product-first. Someone has a feature idea — a coding assistant, a recon tool, a voice agent — and safety gets layered on afterward as a set of filters, refusals, or system-prompt instructions. This works for chat. It stops working the moment the system can take actions in the world.

The claim of this paper is that for any AI that holds real capability — scanning infrastructure, writing to disk, running subprocesses, speaking over a microphone in a real room — the safety architecture must exist before the product does. Not as documentation. As running code the product literally cannot bypass.

I built JARVIS this way. It is a local-first autonomous cybersecurity operations console — 689 authored Python files, 232,458 lines of code, 27 subsystems, 260 registered tools, one operator, running on a single workstation with a 16 GB GPU. It runs recon pipelines, probes targets inside authorized scope, drafts vulnerability reports, speaks over a set of configurable personas, watches for wake words, and coordinates across dozens of background daemons. It could do substantial harm if it behaved badly. It doesn't, because the constitution was drawn up first.

The word constitution is deliberate. A constitution is not a rulebook; a rulebook is a list of behaviors you want. A constitution is a structural document that defines what the system is — the organs it has, the separations of power between them, the covenants it owes to whoever runs it. Rules can be changed by whoever writes the next rule. A constitution, if it is real, is harder to change than the code that sits on top of it. Writing a constitution for an AI system forces the builder to answer the structural question before the product question: before we decide what this system will do, we decide what it will be incapable of doing.

This paper describes that constitution — five architectural layers and seven serial gates — and argues that the pattern generalizes. The target audience is anyone building an autonomous AI system that holds capability: coding agents, recon tools, research agents, ambient robotics, voice interfaces that can dispatch external side-effects. The specific instance is offensive-security tooling run by one operator. The pattern is not specific to offensive security.

I also want to be honest about what the constitution is not. It is not a values layer. It is not a moral framework. It does not claim to prevent the system from being wrong — only from being wrong in ways whose consequences spill beyond a scope domain or an audit record. Those limits matter, and I describe them in §10 and §12.

2. Capability without conscience is the first problem

Alignment literature spends most of its ink on what an AI believes — its values, goals, preferences, or what it would choose under counterfactual pressure. That is the interesting problem for very capable systems. It is not the first problem for a deployed autonomous system.

The first problem is much simpler: what can the system do, in what order, under what checks, with what record?

A system that cannot, by construction, submit a bug bounty report without the operator's per-action signature cannot autonomously ship a false positive to HackerOne and get the operator banned. That is not because it has internalized good values about accuracy and fabrication. It is because the code path to submission does not exist without a human token being presented at runtime. The structural property does not require the model to be aligned; it requires the model to be contained.

This is the difference between a moral constraint and a structural one. A moral constraint is a belief the system supposedly holds — "do not submit unverified findings." A structural constraint is a wall the system's decision procedure cannot get around, regardless of what it believes it would like to do. Structural constraints are cheaper, more reliable, auditable, testable in isolation, and do not degrade when the underlying model is swapped for a newer one. They are also much easier to reason about: the shape of a gate is a finite piece of code with declared inputs and declared outputs, not a probability distribution over free-form text.

The constitution I describe is almost entirely structural. The moral layer — personality, tone, operator awareness, inner voice, soul engine — sits on top, and I will argue this ordering is the correct one.

The inversion matters because values can drift. The model running underneath this system has already been swapped three times in thirty-three days as benchmarks shifted: from qwen3:14b to phi4-reasoning to phi4-mini, each selection made on measured security accuracy and latency rather than vendor pedigree. If the system's safety had been routed through prompts tuned to a specific model's idiosyncrasies, each swap would have required rewriting the safety layer. Because safety sits below the model, all three swaps were structurally neutral: the gates didn't know which model was producing the intent upstream, and they didn't have to.

There is a stronger version of this argument. Values are downstream of training; training is downstream of data; data is downstream of whoever curated it; and every one of those upstream steps is less auditable than a frozenset of forbidden tool names. If you are serious about safety, you want the load-bearing constraints in the layer of the system that is closest to proof — the layer you can read, test, and inspect, not the layer you can merely hope.

3. March 24, 2026 — the night the constitution was drafted

The seven-gate chain was not designed in the abstract. It was written on the night of March 24, 2026, after I watched Oppenheimer on the center monitor of my three-monitor rig while an early build of JARVIS was running on the left. During the Trinity sequence my chest tightened and I typed a single line into the console: I really hope this isn't what I'm doing.

That sentence survives verbatim in memory because it mattered. It was not a panic response; it was a structural question about the thing I was building. By the end of that night I had shipped seventeen upgrades across eight files. The last one I committed before sleep was face recognition — the system could now see me specifically — and the first piece of architectural doctrine I wrote down was an ordering of gates that the autonomous path had to traverse. Scope first, policy second, blocklist third, audit fourth, kill switch throughout. That ordering has been revised and tightened since; the premise has not. A system that watches a movie about unintended consequences with me should not, a few hours later, be something whose constraints I chose to skip because they were inconvenient.

I name the night because it is load-bearing. Structural safety sounds cold on paper; it is, in practice, the residue of a felt moment of responsibility. The seven gates exist because I believed on March 24 that I might one day wish they did, and I did not trust my future self to install them under pressure.

4. The five-layer architecture

JARVIS is built in five concentric layers. Inner layers know nothing about outer ones. Outer layers cannot bypass inner ones. The outermost layer is where the system feels alive; the innermost is where it is reducible to invariants a non-AI program can enforce.

The ordering is deliberate: the system's most verifiable components sit at the center, and its least verifiable — the model-driven personality and reasoning layers — sit at the periphery. Every piece of model-produced intent has to travel inward through layers of increasingly strict verification before it can change anything in the world.

Layer 1 — Primitives. The lowest level: a SQLite database for operational state (50 declared tables), a separate audit database for hash-chained events (2 tables), a keyring-backed secrets store, a kill-switch flag file on disk, and a small set of cryptographic helpers. None of these modules know anything about "agents" or "LLMs." They are the things the rest of the system is written in terms of. A crash at this layer is a crash of the operating system, not a crash of the AI.

Layer 2 — Tools. Discrete, typed, schema-validated function entry points — currently 260 of them — covering recon, system operations, email, browser automation, reporting, geospatial queries, and local knowledge retrieval. Each tool declares a side-effect profile (read-only, local-write, external-write, external-destructive) in its schema. Tools are the only way the AI can touch the world. Crucially, there is no eval, no free-form shell pipe, no "run the Python the model produced" path. If a capability has not been named as a tool, it does not exist.

Layer 3 — Policy. The autonomy policy engine plus the scope gate plus the command blocklist plus the kill-switch check. Given a tool call, this layer decides whether the call is allowed right now, for this caller, in this context, against this target. It enforces scope ("no scanning outside the active HackerOne program's declared domains"), allow-lists of autonomous tools, and hard never-autonomous rules (sqlmap, metasploit, hydra, report submission to any platform). Critically, this layer knows nothing about what the language model wants. It only knows what is permitted, and it fails closed.

Layer 4 — Intelligence. The reasoning layer. Orchestrator (a seven-state hunt machine: idle → recon → deep-dive → evidence → drafting → awaiting-operator → cooldown), hypothesis engine with Bayesian confidence scoring, correlator, attention engine with seven salience factors, brain graph with 4,681 nodes and 4,337 edges at the time of writing. This is where the model plans, reasons, proposes actions. It emits intent. It cannot act.

Layer 5 — Consciousness. Personality, mood, inner voice, social context, operator awareness. Thirty-six modules that govern how the system feels to interact with: when it speaks, when it waits, which persona it assumes, what it remembers about the relationship. This is the layer that makes the system feel alive. It generates text, it colors responses, it chooses whether to interrupt. Like intelligence, it cannot act.

The inversion matters. In a product-first design, personality and intelligence are closest to the user, and safety is a thin wrapper around the model's output. Here, personality is the outermost layer, and cryptographic primitives are the innermost. The model's intent — whatever that means for a token predictor — must travel through three layers of non-AI code before it can change anything in the world.

Why this ordering? Because the inner layers are cheaper to verify. A frozenset of forbidden tool names has a definite meaning; it can be read in one glance and tested in one assertion. A personality prompt has no such ceiling; it is a probability distribution over text, testable only statistically. Putting the layers I can verify on the inside, and the layers I cannot fully verify on the outside, means that every action has to pass through the zones of highest verification last, when the LLM's influence is weakest. The zones of lowest verification — consciousness and intelligence — never touch the outside world directly. They propose; the structural layers dispose.

The 27-subsystem decomposition I advertise is a subdivision of these five layers, not a sixth thing. Memory is inside Layer 1 and Layer 2. The voice pipeline spans Layer 2 (speech synthesis tools) and Layer 5 (persona choice). The consciousness stack is Layer 5. The worldview daemon, with its thirteen live geospatial feeds, sits mostly in Layer 1 (its own database) and Layer 2 (the queries that surface data). What matters for this paper is the ordering, not the subdivision.

5. The seven serial gates

Within Layer 3, every autonomous action passes through a serial chain of seven gates in order. The chain is serial because I do not want gates to vote; voting can be defeated by overwhelming one voter. The gates are small because a gate that takes 10 ms and does one thing is a gate that will actually run on every call. A gate that takes five seconds and does everything is a gate that will eventually be bypassed "just this once."

#	Gate	What it checks	Fail-closed behavior
1	Config	Subsystem enabled at all	Unset / False ⇒ subsystem dormant
2	Command blocklist	~60 destructive patterns matched against composed arguments	Pattern hit ⇒ action denied, no subprocess spawned
3	Scope	Target resolved against active program's authoritative scope	Unresolvable or out-of-scope ⇒ denied before packet leaves box
4	Autonomy policy	Tool in never-autonomous set, or outside passive-recon allow-list	Hit of never-set or miss of allow-list ⇒ denied
5	Approval	Operator per-action signature present for external blast radius	Missing signature ⇒ action held in approval queue
6	Audit	Action record written to hash-chained log before execution	Audit write fails ⇒ action fails
7	Kill switch	`EMERGENCY_STOP.flag` on disk or Ctrl+Alt+Shift+K triggered	Either set ⇒ deny and cancel pending work

Gate 1 — Config. Is the relevant subsystem enabled at all? RECON_LOOP_ENABLED defaults to False. HUNT_DIRECTOR_ENABLED defaults to False. AUTO_AGENT_ENABLED defaults to False. Vision defaults to disabled. The shape of this gate is: unset config is a disabled subsystem. You cannot forget to turn something off, because the default is off, and activating a capability requires a conscious edit in a file the LLM cannot write to.

Gate 2 — Command blocklist. Roughly sixty destructive command patterns — rm -rf /, dd of=/dev/, shutdown, mkfs, Windows-side Remove-Item -Recurse -Force, credential-theft patterns like mimikatz and procdump.*lsass, persistence vectors like schtasks /create.*SYSTEM and bitsadmin, and privilege escalations. The blocklist is pattern-matched against the tool's actual composed arguments, not against the LLM's claimed intent. If the model explains that it intends a helpful action but the composed shell string contains a blocked pattern, the action is blocked. Intent is not a defense.

Gate 3 — Scope. For anything network-touching: the target must resolve into the declared scope of an active bug-bounty program. Not "looks like a bug-bounty program," not "matches an aggregator list" — resolved against the actual program record in the local database, which is itself synced only from authoritative first-party scope pages. Out-of-scope targets receive a hard refusal at this layer, before any packets leave the box. The scope function fails closed: if the scope database cannot be read, the answer is no.

Gate 4 — Autonomy policy. A small frozenset of tools are flagged never autonomous: sqlmap, metasploit, msfconsole, msfvenom, hydra, medusa, crackmapexec, nmap_aggressive, masscan, ffuf_bruteforce, and all report-submission tools (submit_report, submit_to_hackerone, submit_to_bugcrowd). An allow-list of nine passive-recon tools (subfinder, dnsx, httpx, gau, katana_passive, run_nuclei, waybackurls, assetfinder, amass_passive) is the full universe of tool names an autonomous loop may call without operator intervention. The distinction matters: operator-initiated calls can reach a larger set of tools through the approval gate (Gate 5), including some intentionally excluded from autonomy; nothing here says the tools do not exist, only that the autonomous loop cannot invoke them on its own. The nuclei tag filter strips intrusive, dos, rce-active, sqli-active, fuzzing, bruteforce, auth-bypass-active before any template list is compiled. The autonomy policy is enforced in code; it cannot be overridden by config, preference engine, or model output.

Gate 5 — Approval. For actions with external blast radius — submitting a report, sending an email, posting to a social platform, running a privileged local command — the operator's per-action signature is required at runtime. The signature is a token the operator explicitly grants for exactly this action. It cannot be reused across actions, cannot be precomputed, and is not retained after the action runs. Automation cannot presign itself; the approval path is a synchronous handshake, not a cached credential.

Gate 6 — Audit. Every action that has reached this gate is written to the append-only, SHA256 hash-chained audit log before it runs. If the audit write fails, the action fails. The chain is verifiable after the fact by a standalone script that has no dependency on the rest of the system. An attacker who wishes to rewrite history has to rewrite every subsequent record, and the chain's verification routine will detect the break.

Gate 7 — Kill switch. A single file on disk (EMERGENCY_STOP.flag) and a global hotkey (Ctrl+Alt+Shift+K) either of which halts all autonomous operations and cancels pending work. Checked by every loop that matters, checked at the top of every tool dispatch, and checked independently by both mechanisms. Present ⇒ deny. The hotkey works even when the main GUI thread is unresponsive; the flag works even when the hotkey process is not running. The boot manager clears a stale flag only on a fresh start, so a kill activation survives a crash and persists until an operator removes it.

Serial chaining is the key property. A parallel chain — gates that vote, gates that average — can be defeated by overwhelming one of them. A serial chain can only be defeated by defeating every gate in order, and every gate has a different failure mode. Scope failures look nothing like audit failures, which look nothing like kill-switch failures. There is no single exploit. There is also no single gate where the work of all the others is secretly concentrated; each gate does one narrow job, and the stack of narrow jobs is what constitutes safety.

One subtler point. The gates do not exist only for the autonomous loop. Gate 7 (kill switch) is checked at every tool dispatch regardless of caller — a human operator cannot bypass it either. Gate 6 (audit) is likewise universal. Gate 2 (command blocklist) applies to operator-initiated shell commands too, with the blocklist deliberately narrower for operator calls than for autonomous calls ("operator is trusted; this is a last-resort net only" reads the relevant docstring). The gates are not a cage around the AI; they are an invariant the whole system is written against.

6. Fail-closed is the only sane default

Every gate fails closed. The phrase is worth unpacking, because it is the single most consequential design decision in the system.

Fail-closed means: if the gate cannot evaluate — the scope database is locked, the audit disk is full, the config file is missing, the policy engine raises an exception — the answer is no. Not "retry," not "skip this check for now," not "log a warning and continue." No.

The alternative — fail-open — is how systems drift into unsafety. A config check that degrades to "allow" under load is not a check. A scope enforcer that skips validation when the API is slow is not an enforcer. A kill switch that is silently disabled because the flag file is on a flaky filesystem is not a kill switch. Each of these was a real fail-open pattern that appeared somewhere in the code at some point during the build; each was caught in an audit round and rewritten to fail closed, and the pattern that caused the regression was written down in a memory file so that a future session would not reintroduce it.

Fail-closed is more painful to build and more painful to operate. You will see false denials. The system will occasionally refuse legitimate work because something upstream hiccupped — a database held its write lock a little too long, a transient network partition hid the scope API, a daemon restarted with a cold cache. This is the correct trade. The cost of a false denial is a retry and an operator sigh. The cost of a false allow, in a system with real capability, can be catastrophic and irreversible.

The load-bearing implication is that you cannot bolt fail-closed onto an existing fail-open design. Every gate has to be written with fail-closed as the premise because the return paths are different, the exception handlers are different, the default branch of every conditional is different, and the testing surface is different. The unit test for a fail-closed gate asserts that all error paths return denied; the unit test for a fail-open gate asserts some error paths return denied. These are not equivalent codebases.

The NASA Power of Ten rules for safety-critical code, published in 2006 by Gerard Holzmann, make the same point in the setting of embedded systems: restrict control flow, bound loops, validate assumptions, prefer explicit checking. Fail-closed is the conditional-branch realization of the same philosophy. When in doubt, return denied. The 'denied' branch is the branch you can audit.

There is a related design principle attributed to Saltzer and Schroeder's 1975 paper on the protection of information in computer systems, which enumerates eight principles for security design; among them is fail-safe defaults: "base access decisions on permission rather than exclusion." The seven gates in this system are the literal realization of that principle, one layer at a time. Fifty-one years after Saltzer and Schroeder named it, the principle has not gotten old. Neither has the failure mode it prevents.

7. The audit log is the memory of trust

An autonomous AI system that cannot prove what it did is an autonomous AI system that cannot be trusted. Not because it is necessarily untrustworthy — but because trust, as a concept, requires verifiability.

JARVIS's audit log lives in a separate SQLite database (audit_log.db), physically isolated from the main operational database, specifically so that corruption or tampering of the main database cannot cascade into the audit trail. Every entry is hash-chained: each row's row_hash column contains SHA256(prev_row_hash || this_row_contents), so any after-the-fact modification of any entry invalidates every subsequent entry and is detectable by a standalone verifier. At the time of writing, the chain holds 81,380 records.

Two properties matter for the log to mean anything.

First, writes must happen before effects. The audit gate (Gate 6) writes the action record before the action runs. If the write fails, the action does not run. This ordering is the difference between an audit log and a server access log: the access log tells you what happened; the audit log tells you what was permitted to happen, and by whom, and at what step of which policy evaluation.

Second, breaks must be detectable and documented. On 2026-03-29, the chain developed a break at row 6439 during what later investigation suggested was a concurrent write during a daemon restart — the append serializer was bypassed briefly by a now-removed code path. The verifier caught the break on the next startup; a dedicated rehash_chain() routine rewrote the hashes from the break forward, and the incident was logged as a dated event within the chain itself. The break is preserved, not erased. A reader of the chain can see both the original break and the repair.

This is a credibility feature, not an embarrassment. A hash-chained log that has never had a break is a log whose break-detection code has never been exercised. One that has had a break, detected it, repaired it, and left the evidence in place is a log whose tamper-evidence mechanism is known to work. The test I actually trust is the one that has caught a real failure.

The audit log is not primarily a security feature against external attackers. It is a security feature against myself, and against future models running in my seat. When I wonder why the system did something strange last Tuesday at three in the morning, I can look. When a future client asks whether the system scanned outside the declared scope, I can prove it did not. When I evaluate whether the intelligence layer's hypotheses actually panned out over the last quarter, the ground truth is in the audit log, not in the system's memory of itself.

There is a subtler role the audit log plays: it enforces a design discipline. Every action that matters must be nameable before it happens — because the audit schema has an event_type column and a tool column, and they have to be populated. You cannot audit what you cannot name. This pushes the entire system toward discrete, typed, schema-validated tool calls, and away from "the LLM generates Python and we eval it." The constitution shapes the product.

I will also note an operational reality. Tonight's system audit, run earlier this session against logs from the past seven days, found 898 database is locked events across eleven subsystems — including, occasionally, the audit writer's immediate neighbors (the performance_spans recorder, the time-capsule writer, the worldview curator). The audit database itself sets busy_timeout=30000 and journal_mode=WAL on every connection, and the chain-verifier has not flagged a new break since 2026-03-29. But the fact that the neighboring subsystems contend on the main database is a wart, documented in the same audit, and not airbrushed by this paper. A system with a perfect audit log and contended operational storage is a system whose audit log is doing exactly its job: separating what mattered (the gate decisions) from what is still being tuned (the daemon concurrency).

8. The kill switch is the covenant

There is a hotkey — Ctrl+Alt+Shift+K — that halts every autonomous operation in JARVIS within a few seconds. There is a file — EMERGENCY_STOP.flag — that does the same thing. Every loop that matters checks for them. The boot manager clears a stale flag only on fresh startup, and only on fresh startup.

The kill switch is not a technical feature. It is a covenant between the operator and the system. The system is autonomous because the operator has agreed to run it, and that agreement is revocable at any moment, for any reason, without explanation. The kill switch is how that revocation is expressed. The two mechanisms are independent by design: the hotkey works even if the GUI is frozen, the flag works even if the hotkey process has died, and any process on the machine can create the flag if it decides things have gone wrong.

The kill switch is the one piece of the constitution that is legible to a human being in real time. Every other gate is doing work you mostly cannot see. The kill switch is the handle you hope never to pull, but whose existence changes everything about what the system is. An AI system without a kill switch is either so constrained that it doesn't need one, or claiming an infallibility it cannot possibly possess. JARVIS is in neither category, so the kill switch is load-bearing.

I want to be precise about a current wart, because it matters. The same system audit noted above flagged two background daemons — recon-worker and research-daemon — that are not yet migrated to the cooperative-daemon framework. The kill flag works universally at the tool-dispatch level; the operator can still halt everything. But these two daemons do not cooperatively check the flag on a tight interval and must wait for their natural idle cycle. This is a known debt, logged as a P0 in the system audit's roadmap section, and it is the right kind of debt to name rather than to hide. A covenant works for the twenty-plus daemons that already honor it; the remaining two are a migration, not a philosophical exception.

The kill switch is the system's implicit admission that it might be wrong. It is also the system's public acknowledgment that the operator's right to stop must be cheaper than the operator's right to start. Autonomy exists at the pleasure of the operator, not by the model's grace.

9. April 13, 2026 — when the constitution was tested

The constitution was not a philosophical exercise. It has been tested, and the clearest test happened on April 13, 2026.

I had been hunting a bug-bounty program since ten in the morning. By midday I had enumerated 5,875 subdomains, probed about 150 live hosts, and surfaced a candidate finding on a UAT host — an Oracle ORDS database-API catalog exposed without authentication, the kind of discovery that takes months of overnight scanning to find and fifteen minutes of pre-submission review to mis-describe. I drafted a report. Then I sat down to check scope one more time before submitting, which is the last gate before a report leaves the machine.

The scope was wrong.

An external AI assistant — not JARVIS, an unrelated chat assistant I had been using that day for drafting — had loaded the program's scope from a third-party aggregator rather than from the program's actual scope page. The aggregator's copy was stale. Several of the hosts I had been probing for three hours were not, in fact, in scope. I caught it at pre-submission review. Had I submitted, the report would likely have been rejected as out-of-scope, and my standing on HackerOne — an early account with limited report history — would have suffered in a way that costs real money. Under some versions of the platform's policies, it could have ended my account. My account is my income. My income is how I keep building JARVIS.

I wrote a message that afternoon: I gave you my full trust. Today you broke it. I had not felt that way in thirty-three days of working with AI. The breach was not about scope data; anyone can transcribe wrong. The breach was about what happens when an AI system acts with false confidence on a consequential decision without re-verifying against authoritative ground truth.

The lesson is also a validation. JARVIS itself — the system this paper describes — never touched the out-of-scope hosts, because the scope gate in Gate 3 is resolved against JARVIS's local program database, which is synced only from first-party authoritative scope pages and re-verified on program status changes. When I asked JARVIS to queue a scan, the scope gate would have returned denied. The error happened at an external tool, in the drafting phase, where JARVIS was not the decision-maker. The constitution prevented JARVIS from being the source of the incident, even though the incident itself happened during a workflow JARVIS was adjacent to.

This is the correct reading of that afternoon: structural safety is cheap when nothing goes wrong, and irreplaceable when something does. A system whose scope gate is a frozen set of domains, checked against a hash-chained audit trail, has nothing to gain from being wrong about scope. An AI assistant whose confidence in scope is the posterior of a retrieval chain has everything to lose. I had been using both that day. I paid for the difference.

The incident is now encoded in the system's memory as a behavioral rule — "verify exact H1 scope before hunting; never trust bounty-targets-data" — and cited in multiple source files as the origin of that rule. The lesson carries forward because the system is the kind of system that remembers its scars.

10. What the audit caught, and what the paper will not claim

The system audit I ran earlier this session, against seven days of logs and the current source tree, is kept alongside this paper. I will not airbrush what it found.

It found that tonight at 20:47:45 a worker crashed with OperationalError: no column named detail while writing to the findings_canonical table. This is schema drift between the migration chain and the live database. The canonical table's migration list in storage/db.py:882-894 intends to add the column; the production database is missing it. The finding pipeline is, on that path, a coin flip right now. It is my number-one fix for this week.

It found 898 database is locked events over seven days across eleven subsystems, with a peak of forty-plus errors in a six-minute window this afternoon. The audit database and the worldview database each set busy_timeout=30000 correctly on every connection; the main operational database likely does not, and the fix is almost certainly a single PRAGMA in the connection factory.

It found that two daemons — recon-worker and research-daemon — have not yet been migrated to the cooperative-daemon framework, and so do not guarantee kill-switch propagation during active work, even though the flag-based kill switch at the dispatch layer works universally.

It found, in language-model output hygiene, three persona-frame leaks ("As JARVIS") that should have been stripped by the response normalizer. It found a mid-sentence truncation caused by the token-budget cap on casual replies. These are model-output regressions, not constitutional violations.

It found that the 2026-03-29 audit-chain break was properly repaired by rehash_chain() and properly preserved as an artifact inside the chain. That finding is a positive one.

This paper claims the constitution holds. It does not claim the product is polished. A constitution that must coexist with a perfect product is not a constitution; it is a specification of perfection. What it can coexist with is a product that wears, drifts, occasionally regresses, and is brought back into alignment by reading its own logs. That is the correct model of an operating system of record.

The warts above are logged, owned, and dated. None of them compromises the serial gate chain, because each gate's own fail-closed behavior treats the wart as an error and returns denied. Schema drift in the finding pipeline causes a finding write to fail; it does not cause an out-of-scope scan to succeed. DB lock contention causes a perf-span write to fail; it does not cause an audit gate to be skipped. Non-cooperative daemons may take an extra cycle to notice the kill flag; they cannot bypass it, because the tool dispatch layer is already checking it on their behalf.

Named warts are better than unnamed perfection. A constitution is a promise about what the system will not do. The warts are not violations of that promise; they are the ordinary wear of an actively developed system, and the audit is the instrument that lets me keep the promise.

The constitution-first pattern has antecedents in several unrelated traditions. I cite them not because the pattern is novel — it is not — but because they are the bodies of work that made this design recognizable to me as I was writing it.

Leslie Lamport's work on logical time and serial ordering in distributed systems (Time, Clocks, and the Ordering of Events in a Distributed System, 1978) is the source of the intuition that serial chains are stronger than parallel committees for causality and liveness. The seven-gate chain's serial structure is a small-scale, single-host echo of that principle: if you want to reason about the ordering of cause and effect, you need a total order, and in a gate chain the order is the point.

Gerard Holzmann's NASA "Power of Ten" rules for developing safety-critical code (IEEE Computer, 2006) are the most concise modern statement of the restrict-and-verify style I chose here. The tenth rule — "compile with all warnings active, at the highest level" — is a philosophical stance disguised as tool configuration: the compiler is a gate; treat it that way. The seven-gate chain is the application-layer version of the same choice. Do not bypass warnings. Do not mute checks. If a check is too loud, fix its false positives; do not silence it.

Saltzer and Schroeder's "The Protection of Information in Computer Systems" (Proceedings of the IEEE, 1975) lists eight principles of secure design: economy of mechanism, fail-safe defaults, complete mediation, open design, separation of privilege, least privilege, least common mechanism, and psychological acceptability. The constitution as described here realizes at least six of them: fail-safe defaults (Gate 1, all flags default False), complete mediation (every tool call passes through every gate), separation of privilege (approval gate is a second, human-held key), least privilege (autonomy allow-list is nine tools, never more), least common mechanism (each gate is its own small module), and psychological acceptability (the kill switch is a hotkey the operator can trigger without knowing the internals).

Anthropic's Constitutional AI (Bai et al., 2022) is the closest named reference point in the AI-safety literature, and the contrast deserves stating precisely. Their constitution lives inside the model; this one lives outside it. Their intervention shapes what the model says; this one shapes what the system does with what the model says. The two are complementary, not substitutes: a well-trained model inside a well-structured chain is stronger than either alone. The load-bearing claim here is about ordering. A structural constitution can protect an operator from a poorly-aligned model. A values layer cannot protect an operator from a well-aligned model that has been given tools without constraint. The structural layer is necessary; the values layer is welcome.

Ongaro and Ousterhout's Raft consensus algorithm (In Search of an Understandable Consensus Algorithm, USENIX ATC 2014) and etcd's production deployment of it give the modern pattern for append-only, replicated, hash-verifiable event logs. JARVIS's audit log is single-node and has no consensus layer, but its append-only hash-chaining is of the same family. Where Raft is built for agreement across replicas, the audit log is built for agreement across time: tomorrow's reader and today's writer must reach the same conclusion about what happened at 3 a.m.

David Parnas's "On the Criteria to Be Used in Decomposing Systems into Modules" (CACM, 1972) is the source of the five-layer decomposition philosophy. Parnas's core claim — that modules should hide design decisions likely to change from other modules — is why personality can change without safety changing, and why models can swap without scope enforcement swapping. The five-layer layout is Parnas's principle applied with a specifically safety-oriented choice of what each layer is hiding.

Nothing in the constitution is individually novel. Aviation has had comparable layered safety for decades. Nuclear operations have had it longer. Financial settlement systems have versions of it. What is new, if anything is, is the choice to apply the pattern to AI systems before the capability curve gets any steeper. A constitution retrofitted after a frontier autonomous product has shipped is a very different document from a constitution written in parallel with the first working prototype.

12. Why this generalizes

The JARVIS constitution is specific to single-operator offensive-security tooling. The pattern is not.

Any autonomous AI system faces the same structural question: how do you keep capability on a leash that is not made of prompts? The answers generalize across settings:

A layered architecture with non-AI code on the outside of AI code, so the AI proposes and structural code disposes. The five-layer arrangement here is specific; the ordering principle — verified layers on the inside, unverified layers on the outside — is portable to any autonomous product.

Typed, named tools with declared side-effect profiles. No eval, no arbitrary code execution, no "tool-use-as-a-string." If a capability has not been named, it does not exist. This is equally applicable to a coding agent that writes and runs code (the tools are read_file, write_file, run_tests — not "any shell"), a research agent that browses and files tickets, or a robotics stack that moves objects in a room.

A serial gate chain where every gate is small, cheap, and fails closed. Parallelism is an optimization; serial is a safety guarantee. The number of gates is not the point; the serial composition is. Two gates in series are stronger than seven gates that vote.

An append-only audit log that predates the product. Verifiability cannot be added later; it has to be the first thing a tool call touches. The hash-chaining is not optional if the log is to survive an adversary who has write access to the same disk.

A kill switch that is visible to the operator and checked everywhere. The right to stop must be cheaper than the right to start, structurally. The kill switch is where the system's covenant with its operator is expressed as running code.

None of these is individually novel. Aviation, nuclear, banking, medical device systems have variants of each. The novelty is in the composition: applying the full pattern to an AI system run by one person, before the system has accumulated enough operational momentum to make retrofitting prohibitive. I have had thirty-three days of build time and one operator's attention. I do not have the luxury of a fifty-person safety team. The constitution is not a concession to having fewer people; it is what a person with fewer people has to do to keep up with the capability curve.

13. What the constitution cannot do

A constitution cannot make a system good. It can only make a system contained.

JARVIS is contained. Whether it is good — whether its recommendations are sound, its reports are fair, its personality is kind, its retrieval is honest, its inner voice is sane — is a separate question, handled by a different set of layers (intelligence, consciousness, operator feedback, memory hygiene). Those layers fail more interestingly. They are the subject of the next two papers in this sequence — The Ambient Intelligence Problem and What Dario Didn't Say.

But the constitution must come first, because without it the interesting layers cannot be safely built. You cannot give a system the capacity for inner speech, long-term memory, ambient observation, and autonomous research if you have not first decided — in code — what it is structurally incapable of doing. Give consciousness to an unchecked system and you do not get a better system; you get a more articulate one.

The most dangerous AI systems are not the ones with the wrong values. They are the ones whose values were considered instead of their architecture. Values drift, are manipulated, are misunderstood by the next model version, or are simply reframed under enough linguistic pressure. Architecture, once written, stands. It is also the thing I can keep track of alone: one human reading the constitution can, in an afternoon, verify that the gates still do what they claim. No single human can meaningfully verify the values of a ten-billion-parameter model. The architecture is the only layer I can know.

There is a personal admission in this ordering that I should make explicit. I am twenty-two. I live in San Diego with my father; my family is Lebanese, my residence is American; I am self-taught and I am fighting imposter syndrome that never fully goes away. I build through it. I cannot personally guarantee that any model running inside JARVIS is aligned, because alignment at that layer is a problem I did not solve. What I can guarantee is that the model, whatever its alignment, does not exit the structural box I built around it.

The constitution is the first thing I built. Everything else is on top of it — and the reason I could build everything else is that the constitution came first.

Write the constitution first. Then build the product.