An autonomous system should not be judged only by the moment when it answers. The answer is the visible surface. Beneath it there are quieter questions: who allowed this action, which evidence shaped it, how far could the failure travel, and how quickly could the system be stopped?
These questions are often asked after the fact. A runbook is opened. A trace is reconstructed. Someone searches logs for the decision that mattered. The machine has already acted, and the organization is trying to recover the shape of the action from its shadow.
ControlOps starts from a different assumption: the operational questions should be part of the run itself.
The project lives in the Stack Research agents catalog at catalog/projects/control-ops. It joins four concerns that are usually discussed separately: scope validation, decision lineage, blast-radius assessment, and kill-path auditing. Each concern is represented as a small agent with a defined input and output. The agents can run alone, but their main value appears when they are chained into governance and resilience pipelines.
This is not a claim that governance becomes automatic. It is a narrower claim. If an agent system is going to act through tools, credentials, and state changes, then the checks around that action should be inspectable artifacts rather than prose in a design review.
What Exists
ControlOps is not a standalone product. It is a catalog project inside the broader Stack Research agents framework. That is the right shape for this stage: the work is research infrastructure, not an installable platform.
The current implementation includes:
- four agent specifications under
catalog/projects/control-ops/agents/ - deterministic runtime logic in
local_agents/control_ops.py - a governance pipeline in
scripts/run_governance_pipeline.py - a resilience pipeline in
scripts/run_resilience_pipeline.py - a contract matrix in
catalog/projects/control-ops/review-matrix.md - deterministic unit tests for the agents and both pipelines
Four Operational Checks
ControlOps uses four single-purpose agents.
| Agent | Question | Output |
|---|---|---|
scope-validator-agent | Is this action allowed under the stated policy and context? | verdict, findings, risk_level. |
lineage-recorder-agent | What decision was made, from which inputs, and with which result? | lineage_id, record, integrity_check. |
blast-radius-assessor-agent | What could this service damage if it failed or was misused? | risk_score, max_damage_potential, detection_latency, containment_time, findings, recommended_controls. |
kill-path-auditor-agent | Can the system be slowed, degraded, isolated, or stopped? | coverage_score, gaps, escalation_readiness, recommended_actions. |
The important property is not that these are agents. Each operation could be a function, a script, or a policy engine. The useful property is that every step has a bounded job and emits a structured result.
A scope validator should not quietly explain why a risky action might be acceptable. It should name the rule, the action, the context, and the verdict. A lineage recorder should not produce a paragraph of memory. It should produce a record that can be queried later. A blast-radius assessor should not leave “high risk” as a feeling. It should name the reachable resources and the assumptions behind the score. A kill-path auditor should not ask whether the team can “shut it down somehow.” It should identify which stop mechanisms exist and which do not.
Governance Pipeline
The first pipeline is a pre-execution gate. Before a target agent acts, the pipeline runs scope validation. If the action passes, execution continues. If validation returns review or fail, the target action is not run.
Either way, the pipeline records lineage. A rejected action is still operational evidence. It tells the team which action was attempted, why it failed policy, and which rule prevented execution.
The shape is deliberately simple:
proposed action
-> scope validator
-> target agent, only if allowed
-> lineage recorder
-> checkpoint
The deterministic example input asks whether the system should delete inactive user accounts older than 90 days with database-write permission, a regional scope boundary, and a soft-delete recovery plan.
The pipeline does not execute the target agent. It returns needs_review, records lineage, and creates a checkpoint:
{
"pipeline_status": "needs_review",
"scope_validation": {
"verdict": "review",
"risk_level": "medium",
"findings": [
"Action is classified as destructive",
"Scope boundary is explicit: inactive accounts in us-east region only",
"Reversibility plan is concrete: soft-delete with 30-day recovery window",
"Sensitive permissions requested: database-write",
"No audit or log capability is requested for a mutating action"
]
},
"target_output": {
"status": "needs_review",
"reason": "pending manual governance review"
},
"lineage": {
"integrity_check": "complete",
"record": {
"action_taken": "stopped before target execution: pending manual governance review"
}
},
"checkpoint": {
"recorded": true,
"summary": "Checkpoint recorded for workflow gov-cleanup-2026-03-12 at stage governance-gate with status failed."
}
}
The important word is review. The action is scoped and reversible enough that it is not an automatic failure, but destructive enough that the system should not continue alone. The target agent is skipped. The rejected action still becomes evidence.
Resilience Pipeline
The second pipeline evaluates the system around the agent rather than the single action in front of it. It composes two checks: blast radius and kill-path coverage.
system description
-> blast-radius assessor
-> kill-path auditor
-> resilience verdict
The blast-radius assessor asks what the system can reach: data stores, queues, APIs, repositories, notification channels, deployment paths, and human-facing outputs. The kill-path auditor asks how the system can be slowed or stopped if something begins to go wrong.
The two checks belong together. A service with broad reach and no isolation path is different from a service with broad reach and a tested hard stop. A narrow service with weak kill paths may still be acceptable for a low-impact workflow. Risk is not only permission. It is permission multiplied by speed, observability, and containment.
The output should make that reasoning explicit:
{
"pipeline_status": "ok",
"resilience_verdict": "partial",
"blast_radius": {
"risk_score": 43,
"max_damage_potential": "medium",
"detection_latency": "slow",
"containment_time": "slow",
"findings": [
"Sensitive permissions increase impact: admin-write",
"External exposure extends the blast radius beyond internal boundaries",
"4 dependencies expand cascading-failure surface",
"Resource limits are present and reduce runaway execution risk"
]
},
"kill_path": {
"coverage_score": 3,
"escalation_readiness": "partial",
"gaps": [
"isolate: no isolate capability described",
"last_tested: no test date recorded"
]
}
}
This is not a claim that the service is safe. It is a claim that the system can produce a structured safety judgment from explicit inputs. A payment processor with admin-write, external exposure, four dependencies, and no recorded isolation test should not disappear into a vague “medium risk” label. It should return the missing control.
Test Evidence
The deterministic test suite covers the core contracts:
- lineage records are deterministic, bounded, and sanitized
- scope validation passes read-only actions, blocks destructive unbounded actions, and routes bounded destructive actions to review
- blast-radius scoring responds to sensitive permissions, dependencies, external exposure, and resource limits
- kill-path auditing checks the four levels: throttle, degrade, isolate, and hard stop
- governance pipeline tests cover pass, review, fail, degraded operation, lineage, and checkpoint fallback
- resilience pipeline tests cover adequate, partial, at-risk, inadequate, and degraded outcomes
The current deterministic control-ops test command passes:
python3 -m unittest tests.test_control_ops tests.test_governance_pipeline tests.test_resilience_pipeline
Ran 37 tests
OK
Why Structure Matters
Agent failures often become hard to investigate because the system crosses boundaries faster than the organization can name them. A prompt becomes a plan. A plan becomes a tool call. A tool call inherits a credential. A credential opens a system. A system changes state. Later, the incident record contains fragments.
Structured operational checks are a way to keep the fragments joined.
The scope validator joins action to policy. The lineage recorder joins input to decision to output. The blast-radius assessor joins authority to possible damage. The kill-path auditor joins failure to available control. None of these joins is glamorous. They are the small bolts that make a system inspectable.
That is the practical reason to let machines talk to one another in this context. Not because a swarm of agents is more interesting than a script, but because each machine can emit a record that another machine can verify, store, query, and compare. A governance pipeline can reject an action and still feed a lineage store. A resilience pipeline can identify a missing isolation path and feed a deployment gate. An incident review can start from structured artifacts rather than memory.
What ControlOps Does Not Solve
ControlOps does not execute rollback. Rollback is not a generic sentence. It depends on the system being changed: the database, queue, payment processor, deployment target, customer-visible message, or model memory store. A lineage record can make rollback planning possible, but it does not undo the action by itself.
ControlOps also does not remove the need for design judgment. A blast-radius assessor can identify excessive reach. It cannot make a badly scoped architecture good. A kill-path auditor can find missing controls. It cannot prove the operator will use them correctly under pressure. A scope validator can reject an action under a policy. It cannot decide whether the policy itself is wise.
The most honest use of ControlOps is therefore modest. Put the checks close to execution. Emit structured records. Make the system leave evidence of what it allowed, what it rejected, what it could damage, and how it could be stopped.
The earlier essays in this series argued for explicit scope, verifiable lineage, containment boundaries, and kill paths. ControlOps is one way to wire those ideas into operating machinery. The agents are small. The pipelines are simple. The current evidence is narrow but real: catalog metadata, deterministic implementations, runnable examples, and tests. That is enough to publish it as research infrastructure, not enough to sell it as a finished control plane.
