MCP for FHIR: An Architecture for AI Agents on Clinical Data

An architecture for exposing a FHIR backend to AI agents through a Model Context Protocol server, with the access boundary, audit, and operational patterns that keep clinical data safe under autonomous use.

Executive Summary

AI agents have moved from research demos to production tools faster than the access patterns around clinical data have caught up. The Model Context Protocol (MCP) has emerged as a common interface between agents and the systems they consume, including healthcare systems. This paper describes an architecture for exposing a FHIR backend to AI agents through an MCP server, with the access boundary, audit, and operational patterns that keep clinical data safe under autonomous use.

The architecture is informed by deployments that have run in production for months rather than days, and by the failure modes that show up at scale. The architecture is consistent with HIPAA, GDPR, and DiGA expectations, but it is not the only way to achieve them. It is the pattern that has held up under the loads we have seen.

The audience is engineering and security leads building AI features into a healthcare product, regulatory affairs leads evaluating the access patterns AI agents introduce, and architects designing the next generation of clinical decision support and patient-facing assistants.

1. Why MCP

The Model Context Protocol is a protocol for connecting AI agents to external systems. The protocol defines a small number of primitives (tools the agent can call, resources the agent can read, prompts the system can offer) and a transport for moving them between the agent and the system.

For a FHIR backend, MCP is attractive because it lets the backend’s existing capabilities (resource reads, searches, writes, bulk operations) be exposed to an agent through a single interface that the agent can introspect and call. The agent does not need a custom integration; the MCP server is the integration.

The alternative is a per-agent integration that the agent’s developer writes against a custom API or against the FHIR REST surface directly. The per-agent integration model works but does not scale across agents, across product lines, or across organisations.

2. The Architecture in Brief

The architecture has four layers.

Agent. The AI agent itself, running in whatever environment its developer hosts it in. The agent connects to the MCP server through MCP’s transport.

MCP server. The server that translates between MCP primitives and FHIR operations. The MCP server runs alongside the FHIR backend and depends on the backend for the data layer.

FHIR backend. The data layer that holds the clinical data. The backend’s authorization rules, audit, and operational tooling are the same regardless of whether the consumer is an MCP server or another integration.

Identity provider. The OAuth 2.0 / OIDC identity provider that authenticates the on-behalf-of identity the agent acts for. The MCP server validates the identity and carries it in the request context.

The architecture’s key claim is that the access boundary lives at the FHIR backend, not at the MCP server. The MCP server translates intent; the FHIR backend enforces the boundary.

3. Authentication and the On-Behalf-Of Identity

AI agents act on behalf of users or organisations. The agent itself is a service; the on-behalf-of identity is what gives the action meaning under the access boundary.

Authentication runs in two layers.

The first layer authenticates the agent as a service. The preferred credential is an OAuth 2.0 / OIDC access token issued to the agent’s service identity (a SMART backend services JWT, an Entra ID workload identity, or any OIDC-compliant equivalent), which the MCP server validates and binds to the agent’s role. Where the agent’s runtime cannot do OIDC, an API token tied to the same identity is an acceptable fallback, rotated frequently. The agent’s identity is recorded either way.

The second layer authenticates the on-behalf-of identity. The agent presents a token issued for the user (typically through OAuth 2.0 token exchange or through a SMART backend services delegation) and the MCP server validates it. The on-behalf-of identity is recorded and carried in the request context.

The FHIR backend’s authorization rules evaluate against the on-behalf-of identity, not against the agent. The audit log records both identities so the action can be traced to both the agent that performed it and the user it was performed for.

3.1 Failure modes to avoid

The most common failure mode is to authenticate the agent only and to lose the on-behalf-of identity. The audit log then records the agent as the actor; the access boundary cannot evaluate per-user rules; the patient-side access-log query cannot attribute the action to anyone.

The fix is to require the on-behalf-of identity in every request and to reject requests that do not carry it. The MCP server can enforce this at the boundary.

4. Authorization and the Tool Surface

MCP exposes capabilities to the agent as tools. The tool surface determines what the agent can attempt; the FHIR backend’s authorization rules determine what the agent can actually do.

4.1 Tool design

Design tools to map to clinical actions, not to FHIR primitives. A tool called lookup_recent_observations(patient_id) is more useful and more bounded than a tool called fhir_search. The tool’s contract is narrower; the agent’s reasoning is simpler; the access boundary’s evaluation is more focused.

The tool implementation translates the call into FHIR operations. The translation is bounded by the authorization rules at the backend.

4.2 Tool granularity

Tools should be granular enough to be useful but not so granular that the agent has to compose dozens of calls for a single task. A tool called get_patient_summary(patient_id) that returns a structured summary of recent observations, conditions, and medications is more useful than a tool that returns one field at a time.

The trade-off is fidelity versus performance. A summary tool returns less than the full FHIR detail; an agent that needs the full detail has to call additional tools. Design the summary tool around the most common use case and expose detail tools for the cases that need them.

4.3 Write tool design

Write tools deserve special care. Expose write tools only for the specific actions the product wants the agent to perform: record_blood_pressure(patient_id, systolic, diastolic) rather than fhir_create(observation). The narrower contract is easier to reason about, easier to test, and easier to gate at the access boundary.

Write tools should also be idempotent. An agent that retries a tool call should not produce duplicate writes. Require an idempotency key on every write tool and dedupe on the backend side.

4.4 Authorization evaluation

Authorization is evaluated at the FHIR backend’s data-access layer, against the on-behalf-of identity. The MCP server does not evaluate authorization itself; it relies on the backend’s rule chain.

The same default-deny rule chain that gates all other access to the backend gates AI agent access. AI agents are one more access context; the rule chain treats them as such.

5. Audit and the Action Trail

AI agents generate more activity than human users by orders of magnitude. The audit log has to absorb the volume and remain queryable.

5.1 Identity-attributable audit

Every action the agent performs emits an audit record with the agent identity, the on-behalf-of identity, the tool that was called, the FHIR operations that resulted, and the rule that authorised them. The audit record is the same shape as the audit record for a human user, with the agent identity added.

5.2 Action chains

A single tool call often results in multiple FHIR operations. Record the tool call as a parent action and the FHIR operations as child actions, linked by a correlation identifier. The audit log can then show the agent’s high-level intent (the tool call) alongside the low-level operations (the FHIR reads and writes).

5.3 Patient-side access logs

Patients have rights to access logs of who accessed their data. AI agent actions appear in the access log alongside human actions. Present the agent’s identity in a way the patient can understand: the product name, the type of action, the on-behalf-of identity (typically the patient themselves or their clinician).

5.4 Volume management

AI agents can produce orders of magnitude more audit records than human users. The audit storage has to absorb the volume; the audit query path has to remain responsive; the retention policy has to handle the storage cost.

Write audit records to a separate store from the operational database, use a storage technology that handles high write volume cheaply (column-oriented storage, log-structured storage), and apply retention based on the regulatory requirements.

6. The Read-Only Bias

Production deployments tend to bias the tool surface toward read-only operations. Agents that read clinical data and propose actions to a human are easier to deploy safely than agents that take clinical actions autonomously.

The bias is not absolute. Some products genuinely need agents to take actions. The bias is a default that the team has to consciously override for specific actions, with the documentation and the access boundary that goes with the override.

The reasons are practical. Read-only agents fail by producing wrong outputs that a human can review; write-capable agents fail by producing wrong actions that may be hard to reverse. The first failure mode is recoverable; the second often is not.

7. Prompt Injection and the Trust Boundary

Prompt injection is the dominant security risk in production AI agent deployments. An attacker who can place text in the agent’s input can attempt to override the agent’s instructions and to direct the agent to take unintended actions.

For a healthcare agent, the attack surface includes any text the agent reads: patient-supplied messages, clinician notes, scanned documents, third-party content the agent retrieves. Each of these is an untrusted input from the agent’s perspective.

7.1 The trust boundary at the access layer

The mitigation that holds is to put the trust boundary at the access layer, not at the prompt. An agent that has been instructed to extract the patient’s medications cannot retrieve another patient’s medications regardless of what the prompt injection asks it to do, because the access boundary at the FHIR backend will not permit it.

The boundary at the access layer survives prompt injection. The boundary at the prompt does not.

7.2 Tool call review

For high-risk tool calls (writes, sensitive reads, anything outside the agent’s typical pattern), introduce a human review step. The agent proposes the tool call; a human approves it; the MCP server executes it. The review step is a layer above the access boundary that catches semantically-suspicious-but-technically-permitted actions.

7.3 Output filtering

The agent’s output to the user is also a place where injected content can flow through. Constrain the output format to structured fields rather than free text where the use case allows, and apply output filters where the format must be free.

8. Operational Patterns

Three operational patterns make a production MCP deployment tractable.

8.1 Per-tenant agent deployments

When the FHIR backend is multi-tenant, the MCP server should be per-tenant or should carry the tenant context in every request. An MCP server that mixes tenants creates a cross-tenant attack surface that is hard to defend.

8.2 Rate limiting per agent and per identity

Rate limits should be applied per agent and per on-behalf-of identity. An agent that goes haywire (a model failure, a runaway loop, a misconfigured prompt) can otherwise overwhelm the backend.

8.3 Observability

Production MCP deployments benefit from observability that ties the agent’s tool calls to the FHIR operations to the access decisions to the audit records. A correlation identifier that flows through the entire request path and is searchable in each system’s logs is what makes the trail navigable.

9. Where MCP Deployments Most Often Get Stuck

Across the deployments we have seen, four issues account for most of the time-sinks.

9.1 On-behalf-of identity loss

The agent is authenticated but the on-behalf-of identity is lost between the MCP server and the backend. The audit log attributes actions to the agent only; the access boundary evaluates the wrong identity.

Fix: pass the on-behalf-of identity through every request and reject requests that do not carry it.

9.2 Tool surface that mirrors FHIR

A tool surface that mirrors FHIR primitives is hard for the agent to reason about and easy for the agent to misuse. A summary endpoint reduces the surface area to the actions the product wants the agent to take.

Fix: design tools around clinical actions, not FHIR primitives.

9.3 Audit volume

Audit records from agent activity overwhelm the audit storage and degrade the query path.

Fix: separate audit storage with high-volume-write characteristics; per-tenant retention; structured records that support efficient queries.

9.4 Write-without-idempotency

Write tools that are not idempotent produce duplicate records when the agent retries. Duplicate records corrupt the clinical record.

Fix: require an idempotency key on every write tool and dedupe on the backend.

10. The Architecture as a Default

The architecture above is the default that has worked in production across the deployments we have seen. The defaults are:

Two-layer authentication: the agent and the on-behalf-of identity.
Authorization at the FHIR backend, evaluated against the on-behalf-of identity.
Tool surface designed around clinical actions, biased toward read-only.
Audit recorded with both identities and the action chain.
Write tools with idempotency keys.
Trust boundary at the access layer, not at the prompt.
Per-tenant deployments and per-tenant audit.
Correlation identifiers across the request path.

A deployment that follows these defaults survives prompt injection, scales to production loads, and produces the audit shape regulators expect.

11. Closing

AI agents on clinical data are not a special case. They are a new access context that fits the existing access boundary if the boundary lives at the data layer. The MCP server is a translation layer; the FHIR backend is the boundary. Agents that operate within this architecture get the same audit shape, the same access guarantees, and the same operational predictability as any other authorised consumer.

The mistakes the architecture is meant to prevent are the mistakes that have already happened in production deployments without this architecture: prompt injection that succeeds because the boundary is at the prompt, audit logs that cannot attribute agent actions to a clinical context, write tools that produce duplicate records on retry. The architecture catches each of these by treating the agent as one more authorised consumer of a backend that already has a working access boundary.

References

Anthropic Model Context Protocol (MCP) specification.
HL7 FHIR R4 specification.
IETF RFC 8693, OAuth 2.0 Token Exchange.
SMART App Launch Implementation Guide, backend services profile.
OWASP Top 10 for LLM Applications (2024).
Fire Arrow documentation: Authorization concepts, Property filters, Audit log.
Related Fire Arrow whitepapers: Agentic LLM Access, Identity Filtering, FHIR Authentication and Authorization Patterns.
Related landing pages on this site: MCP server for FHIR, LLM access control for healthcare, AI agents on FHIR.