The Human-in-the-Loop Problem: When Should an AI Agent Ask for Permission?

Every enterprise deploying AI agents hits the same wall. It usually arrives around week three, after the initial excitement fades and the first real-world edge case appears.

An agent drafts a customer email — and sends it without review. An agent queries a database — and the query joins on a table containing salary data. An agent calls an external API — and includes an API key in the request body because the tool's documentation said to.

The immediate reaction is to lock everything down: every tool invocation needs human approval. And that works — for about a day. Then the humans approving 200 tool calls per hour start clicking "Approve" without reading. Or they stop using the agent entirely because the interruptions make it slower than doing the work manually.

This is the human-in-the-loop (HITL) calibration problem: how do you determine which actions need human oversight and which can be automated?

MPP's answer is a quantitative system that replaces gut-feel policy with a scoring model.

The Spectrum of Risk

Not all tool invocations carry the same risk. Consider four examples:

A string formatting tool — no network, no filesystem, no credentials, pure computation. Risk: near zero.
A tool that reads a local config file — filesystem access to one specific path, read-only. Risk: low.
A tool that queries a production database — network access to an internal hostname, reads potentially sensitive data. Risk: medium to high.
A tool that calls an external payment API — outbound network, handles financial data, uses credentials. Risk: high to critical.

Under a binary HITL model (either "always ask" or "never ask"), you either interrupt the user for the string formatter or let the payment API call sail through. Both are wrong. The right answer is a system that distinguishes between these cases automatically.

Sensitivity Scoring

MPP assigns every tool invocation a sensitivity score — a number from 0 to 100 that quantifies how much risk the invocation carries. The score is computed from the tool's manifest at the time of invocation, using a weighted formula that considers multiple dimensions:

The Scoring Dimensions

Capabilities breadth. A tool that declares network, filesystem, and environment access is inherently more sensitive than one that declares no capabilities. Each capability category contributes to the score.

Capability specificity. A tool that declares network access to ["api.example.com"] is less sensitive than one that declares ["*.example.com"] — and far less sensitive than one that declares ["*"] (all domains). Wildcards increase the score; specific domains reduce it.

Resource sensitivity. Some resources are inherently more sensitive than others. Environment variable access (which may expose credentials) scores higher than read-only filesystem access to a designated directory. Filesystem write access scores higher than read access.

Security level declaration. The tool author declares a security level in the manifest: low, medium, high, or critical. This self-declared level is a starting point — it can be overridden by policy — but it anchors the score.

Historical behaviour. Over time, a tool that consistently produces clean audit logs, never triggers privacy filters for unexpected patterns, and never exceeds its declared capabilities builds a lower effective sensitivity score. A tool that frequently triggers privacy filters or produces errors trends higher. (This is a future capability under active development.)

Score Computation

The sensitivity score is not a single formula published as an API spec — it is configurable by the host operator. But the default computation follows this general structure:

base_score = security_level_weight        // low=10, medium=30, high=60, critical=90
+ capability_count × 5                     // each capability category adds 5
+ wildcard_penalty                         // wildcards in network/filesystem add 10-20
+ credential_exposure_penalty              // env var access adds 15
+ write_access_penalty                     // filesystem write adds 10
- specificity_discount                     // narrow resource lists subtract 5-15

A string formatting tool with no capabilities and low security level scores roughly 10. A database query tool with network access to one hostname and medium security level scores roughly 40. A payment API tool with network access, environment variable access and high security level scores roughly 75.

The Four Tiers

The sensitivity score maps to a confirmation level — one of four tiers that determines what happens before the tool runs:

Tier 1: None (Score 0–25)

No human involvement. The tool runs immediately. The user may not even see a notification.

This tier is for tools that are computationally isolated and can't cause harm: formatters, parsers, calculators, schema validators. They have no capabilities, low security level, and minimal blast radius.

The HITL overhead for these tools should be exactly zero. An agent that asks permission to capitalise a string is an agent that won't be used.

Tier 2: Notify (Score 26–50)

The user is notified that the tool ran, but not asked for permission. The notification is informational — it appears in an activity log, a sidebar, or a toast notification.

This tier is for tools with limited capabilities and manageable risk: a tool that reads a local config file, a tool that queries a read-only API endpoint, a tool that accesses a specific directory.

The user maintains awareness without being interrupted. If they see a notification that concerns them, they can investigate. But the default path is that the tool runs and the agent continues.

Tier 3: Confirm (Score 51–75)

The user must explicitly approve the invocation before it runs. The agent's workflow pauses. A dialog appears showing:

Which tool is requesting execution
Who published it (publisher identity and key ID)
What capabilities it needs (specific network domains, filesystem paths, env vars)
The computed sensitivity score and the reason for the confirmation requirement

The user reviews and clicks Approve or Deny. If they don't respond within a configurable timeout, the invocation is denied by default (fail-closed).

This tier is for tools with meaningful risk: database access, external API calls, tools that handle business-critical data. The interruption is justified because the potential impact is significant.

Tier 4: Multi-Factor (Score 76–100)

The user must provide additional authentication — a second factor, a manager approval, or a confirmation on a separate device — before the tool runs.

This tier is for the highest-risk operations: tools that access payment systems, tools that modify production infrastructure, tools that handle PII at scale. The overhead is significant, and that's the point. These operations should be rare, deliberate, and attributable to a specific human.

Multi-factor confirmation is the ceiling. If a tool consistently requires multi-factor approval and the user finds it too onerous, the correct response is to evaluate whether the tool's capabilities can be narrowed — not to lower the threshold.

Policy Configuration

The sensitivity scoring and tier mapping are defaults. Enterprise teams can — and should — customise them.

Per-Package Overrides

let policy = PermissionPolicy::new()
    .trust_package("org.mycompany.internal-db-tool")
    .set_confirmation_level("org.mycompany.internal-db-tool", ConfirmationLevel::Notify);

This says: "We trust our internal database tool. It was developed by our team, signed with our key, and we've reviewed its source code. Drop the confirmation level from Confirm to Notify."

The override doesn't disable security. The Gatekeeper still verifies the signature. The sandbox still enforces capabilities. The privacy filter still runs. What changes is the HITL requirement — because the organisation has determined, based on its own risk assessment, that this specific tool doesn't need interactive approval.

Per-Publisher Trust

let policy = PermissionPolicy::new()
    .trust_publisher(acme_public_key)
    .auto_approve_from_trusted(ConfirmationLevel::Notify);

This says: "We trust everything published by Acme. Their packages are verified by their Ed25519 key. Set the maximum confirmation level to Notify for all their tools."

Publisher-level trust is useful when an organisation has vetted a publisher through a procurement or security review process. Rather than reviewing each tool individually, the trust extends to the publisher's entire catalog.

Global Defaults

let policy = PermissionPolicy::new()
    .auto_approve_low(true)          // Score 0-25: no confirmation
    .default_confirm_medium(true)    // Score 26-50: notify → confirm
    .require_multifactor_critical(true); // Score 76+: always MFA

Global defaults let security teams set the baseline posture for the entire deployment. A conservative posture bumps every tier up by one. An aggressive posture (for development environments) drops every tier down.

Building Trust Over Time

The HITL system is not static. As teams use MPP, a pattern emerges:

Week 1: Conservative policy. Most tools require Confirm-level approval. The team is learning what tools do and how they behave.

Month 1: Per-package overrides. Frequently-used tools that have been reviewed and produce clean audit logs are moved to Notify. The team approves less often but still monitors.

Month 3: Publisher-level trust. Trusted internal publishers and vetted third-party publishers get blanket Notify treatment. Confirm is reserved for new tools and untrusted publishers.

Month 6: The audit log contains thousands of clean invocations. The team has evidence — not opinion — that specific tools, publishers, and capability profiles are safe. Policy reflects this evidence.

This is the intended trajectory: from high oversight to calibrated oversight, driven by data rather than guesswork. The sensitivity scoring system provides the gradations. The audit log provides the evidence. The policy configuration provides the mechanism.

The Anti-Patterns

Always Approve

The most common failure mode: configure every tool to auto-approve. This eliminates HITL entirely and provides no opportunity to catch unexpected behaviour before it impacts users or systems.

This is appropriate only for sandboxed development environments with no access to production resources. In any other context, it is accepting unknown risk.

Always Confirm

The second most common failure mode: require explicit confirmation for every tool invocation, regardless of risk. This trains users to approve without reading — rubber-stamping — and eliminates the value of having approvals at all. Human attention is a finite resource, and if you spend it on string formatters, it won't be there for payment processors.

Static Policy

Setting a policy once and never updating it. Six months later, the team is still manually approving tools that have been invoked 10,000 times without incident. The audit log has the evidence to justify lowering the tier, but nobody has reviewed it.

Scheduled policy reviews (monthly or quarterly) should include audit log analysis to identify candidates for tier changes — both upward and downward.

What This Looks Like in Practice

An enterprise coding assistant integrated with MPP. The developer asks the agent to refactor a module:

Agent calls code-formatter tool. Sensitivity score: 12. Tier: None. Runs instantly. Developer doesn't see a prompt.
Agent calls dependency-checker tool. Sensitivity score: 35. Tier: Notify. Developer sees a subtle notification: "dependency-checker accessed registry.npmjs.org". No interruption.
Agent calls database-migration tool. Sensitivity score: 62. Tier: Confirm. Dialog appears: "database-migration (v2.1.0, publisher: org.mycompany) requests network access to db.prod.internal and filesystem write to /migrations/. Approve?" Developer reviews, clicks Approve.
Agent calls deploy-to-staging tool. Sensitivity score: 81. Tier: Multi-Factor. Dialog appears with full details. Developer must confirm via their authentication app. Manager notification is sent.

Four tool calls. Four different levels of oversight. Each calibrated to the actual risk of the operation. The developer was interrupted twice — once for a meaningful decision (database migration) and once for a critical operation (deployment). The other two ran silently.

This is what calibrated HITL looks like. Not a checkbox. Not a binary switch. A graduated system that allocates human attention where it matters most.

For the technical implementation of sensitivity scoring, see the Permission System documentation. For how capability-based permissions feed into the sensitivity score, read Capability-Based Permissions for AI Tools.