Hybrid Agent – Foundry Local + Microsoft Foundry

A production-grade Python-only hybrid AI application. A lightweight local model classifies every request first. Simple, private, or latency-sensitive requests stay on-device using Foundry Local. Complex, reasoning-heavy, or frontier-capability requests escalate to a Microsoft Foundry cloud model via Azure AI Projects.

Every request returns the same AgentResponse schema — the caller never knows which path was taken.

Screenshots

Web UI – conversation with live routing diagnostics

Left panel: chat with colour-coded path indicators. Right panel: live routing decision — path, model, confidence, latency, privacy class, complexity, and full response JSON.

Privacy hard gate – RESTRICTED content forced local

Sensitive data detected deterministically. The router LLM is never called. Cloud is blocked at both the policy layer and the service layer.

Cloud fallback – local model fails, cloud recovers

Local model raises RuntimeError. Service automatically falls back to cloud. Response path is cloud_fallback; fallback_reason is populated.

CLI demo – all five scenarios without real models

Quick start – demo mode (no models required)

Run all five routing scenarios with simulated responses in under a second:

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate

pip install -r requirements.txt
python -m app.main --demo

No Azure account, no Foundry Local installation, and no model downloads are needed for demo mode.

Architecture

User prompt
     │
     ▼
 HybridAgentService.ask()
     │
     ├─► RoutingPolicy.decide()
     │       ├─► HeuristicClassifier   (deterministic – zero model calls)
     │       └─► RouterLLM             (local router model – classification only)
     │
     ├─── RouteTarget.LOCAL  ──► LocalModelProvider.complete()   (foundry-local-sdk)
     │                               └─► cloud fallback if local fails
     │
     └─── RouteTarget.CLOUD  ──► CloudModelProvider.complete()   (azure-ai-projects)
                                     └─► local fallback if cloud fails + policy allows

Routing decision flow

Stage	Component	Trigger
1a	`HeuristicClassifier`	Sensitive keyword → `RESTRICTED` hard gate (no LLM call)
1b	`HeuristicClassifier`	Trivial greeting → `LOCAL` (no LLM call)
1c	`HeuristicClassifier`	`very_high` complexity → `CLOUD` (no LLM call)
2	`RouterLLM`	All other cases – local model scores complexity, privacy, confidence
3	`RoutingPolicy`	LLM confidence < threshold → safe default `LOCAL`

Key design decisions

Decision	Rationale
Two separate local models (router + task)	Router stays fast and cheap; task model can be larger
Deterministic gates before LLM routing	Privacy and compliance must not depend solely on model inference
Defence-in-depth RESTRICTED guard	Service layer independently blocks cloud for RESTRICTED content
Same `AgentResponse` schema from both paths	Callers are path-agnostic; routing is transparent
`DefaultAzureCredential` for cloud auth	Keyless; no secrets in code or environment files
Conservative fallback (default to local)	Low-confidence routing prefers privacy over capability

The five routing scenarios

#	Prompt type	Path	Triggered by
1	Trivial greeting	`local`	Heuristic (no LLM call)
2	Simple factual question	`local`	RouterLLM → high confidence local
3	Sensitive data (password, PII)	`local`	Privacy hard gate (no LLM call)
4	Complex multi-step reasoning	`cloud`	RouterLLM → high confidence cloud
5	Local model fails	`cloud_fallback`	`enable_cloud_fallback=true`

Project structure

fl-mixedmodel/
├── app/
│   ├── main.py               CLI + --ui + --demo entry point
│   ├── demo.py               Demo runner (all 5 scenarios, no models needed)
│   ├── config.py             Env-var-driven typed configuration
│   ├── models/
│   │   ├── response.py       AgentResponse – unified output schema
│   │   └── routing.py        RoutingDecision, RouterLLMOutput, enums
│   ├── routing/
│   │   ├── heuristic.py      Deterministic keyword/rule pre-classifier
│   │   ├── router_llm.py     Local model classifier + JSON parser
│   │   └── policy.py         RoutingPolicy – two-stage decision logic
│   ├── providers/
│   │   ├── local_provider.py foundry-local-sdk wrapper (router + task models)
│   │   └── cloud_provider.py azure-ai-projects wrapper (Responses API)
│   ├── services/
│   │   └── hybrid_agent.py   HybridAgentService – orchestration + fallback
│   ├── telemetry/
│   │   └── logger.py         Structured JSON logger, correlation IDs, timers
│   ├── ui/
│   │   └── gradio_app.py     Gradio web UI (chat + live routing diagnostics)
│   └── tests/
│       ├── test_routing.py      Unit tests – heuristic, policy, router LLM
│       ├── test_hybrid_agent.py Unit tests – service paths and fallback
│       └── test_e2e.py          E2E tests – full pipeline, privacy gates, UI
├── docs/
│   └── screenshots/          UI and scenario screenshot SVGs
├── requirements.txt
├── .env.example
├── specification.md          Full technical specification for future agents
└── README.md

Setup (full mode with real models)

Prerequisites

Python 3.11 or later
Foundry Local installed (foundry service start)
An Azure AI Foundry project with at least one model deployment
Azure CLI logged in: az login

1 – Create virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

2 – Install dependencies

pip install -r requirements.txt

3 – Configure environment

# Windows
copy .env.example .env

# macOS / Linux
cp .env.example .env

Edit .env and fill in at minimum:

Variable	Required	Description
`FOUNDRY_PROJECT_ENDPOINT`	✅	Azure AI Foundry project endpoint URL
`FOUNDRY_CLOUD_MODEL_DEPLOYMENT`	✅	Cloud deployment name, e.g. `gpt-4o`
`FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS`	recommended	Local router model alias
`FOUNDRY_LOCAL_TASK_MODEL_ALIAS`	recommended	Local task model alias

To find available local model aliases:

from foundry_local import FoundryLocalManager
from foundry_local.models import Configuration

m = FoundryLocalManager.initialize(Configuration(app_name="hybrid-agent"))
print([model.alias for model in m.catalog.list_models()])

4 – Run

CLI (interactive REPL)

python -m app.main

Web UI (Gradio on port 7860)

python -m app.main --ui
python -m app.main --ui --port 8080
python -m app.main --ui --share        # creates a public share URL

Demo mode (no models required)

python -m app.main --demo

Running tests

Tests mock all network and model calls — no external dependencies:

# All 43 tests
pytest app/tests/ -v

# With coverage report
pytest app/tests/ -v --cov=app --cov-report=term-missing

# Single suite
pytest app/tests/test_e2e.py -v

Environment variable reference

Variable	Default	Description
`FOUNDRY_PROJECT_ENDPOINT`	—	Azure AI Foundry project endpoint (required)
`FOUNDRY_CLOUD_MODEL_DEPLOYMENT`	—	Cloud deployment name (required)
`FOUNDRY_CLOUD_ROUTER_DEPLOYMENT`	`model-router`	Cloud auto-router or named deployment
`FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS`	SDK default	Local router model alias
`FOUNDRY_LOCAL_TASK_MODEL_ALIAS`	SDK default	Local task model alias
`HYBRID_PRIVACY_MODE`	`strict`	`strict` – confidential content stays local
`HYBRID_MAX_LOCAL_TOKENS`	`4096`	Max token budget for local inference
`HYBRID_ENABLE_CLOUD_FALLBACK`	`true`	Fall back to cloud if local model fails
`HYBRID_ENABLE_LOCAL_FALLBACK`	`true`	Fall back to local if cloud is unavailable
`HYBRID_MIN_ROUTER_CONFIDENCE`	`0.6`	Min router LLM confidence to trust its decision
`APP_LOG_LEVEL`	`INFO`	Log level (`DEBUG` / `INFO` / `WARNING` / `ERROR`)

See .env.example for fully annotated values.

Routing policy

Stage 1 – Deterministic (always runs first, zero model calls)

Condition	Action
Sensitive keyword detected (password, PII, credentials, NHS, SSN…)	Force local, `RESTRICTED`
Prompt exceeds local context threshold (3,000 chars)	Hint cloud, `HIGH` complexity
Cloud-indicative keywords (generate code, tool use, research paper…)	Hint cloud, `HIGH`
Trivial greeting or single-word acknowledgement	Force local, `TRIVIAL`
Complexity `very_high`	Force cloud

Stage 2 – Local router LLM (ambiguous cases only)

The router model returns structured JSON scored on:

target local or cloud
confidence [0, 1] — below HYBRID_MIN_ROUTER_CONFIDENCE defaults to local
privacy_class public | internal | confidential | restricted
complexity trivial | low | medium | high | very_high
requires_long_context, requires_tool_use, requires_frontier_capability

Fallback rules

Scenario	Behaviour
Local fails + `HYBRID_ENABLE_CLOUD_FALLBACK=true`	Escalate → `cloud_fallback`
Cloud fails + `HYBRID_ENABLE_LOCAL_FALLBACK=true` + content not `RESTRICTED`	Recover → `local_fallback`
Both fail	Graceful error `AgentResponse` — never raises to caller
Cloud unavailable + content `RESTRICTED`	Error response — content never sent elsewhere

Response schema

Every response — local, cloud, or fallback — returns the same shape:

{
  "answer": "The Berlin Wall fell on 9 November 1989.",
  "path": "local",
  "model": "phi-3.5-mini",
  "reason": "Simple historical fact, answerable by a small local model",
  "confidence": 0.87,
  "latency_ms": 52.1,
  "correlation_id": "ce35c6e5-1835-4b4a-b858-4c71d6b582f4",
  "prompt_tokens": 9,
  "completion_tokens": 12,
  "fallback": false,
  "fallback_reason": null,
  "metadata": {
    "privacy_class": "public",
    "complexity": "low",
    "deterministic_override": false,
    "requires_long_context": false,
    "requires_tool_use": false
  }
}

Troubleshooting

Gradio UI: "data incompatible with message" (resolved)

Earlier Gradio 4.x samples used show_copy_button=True on gr.Chatbot and the tuple message format. Both were removed in Gradio 6.x and produced a TypeError or "data incompatible with message" error.

Resolved in this repo by:

Initialising gr.Chatbot with type="messages" and buttons=['copy', 'copy_all', 'share']
Returning {"role": "assistant", "content": ...} dicts from the chat handler

If you fork the UI and hit this error again, check app/ui/gradio_app.py for the Chatbot component initialisation.

Cloud returns 400 "Unsupported parameter: max_tokens"

gpt-5 and o-series deployments require max_completion_tokens instead of max_tokens and reject custom temperature values. app/providers/cloud_provider.py tries max_completion_tokens first and falls back to max_tokens only when the API returns the specific unsupported parameter error, so the same code works against both old and new deployments.

Local model first-load is slow (30–60 seconds)

First run downloads model weights through Foundry Local. Subsequent runs reuse the cache. Use python -m app.main --demo to validate routing logic without waiting for a download.

Diagnostics

List available local model aliases:

python -c "from foundry_local import FoundryLocalManager; from foundry_local.models import Configuration; m=FoundryLocalManager.initialize(Configuration(app_name='test')); print([x.alias for x in m.catalog.list_models()][:10])"

Check cloud connectivity (run az login first):

python -c "from azure.identity import DefaultAzureCredential; from azure.ai.projects import AIProjectClient; import os; AIProjectClient(endpoint=os.environ['FOUNDRY_PROJECT_ENDPOINT'], credential=DefaultAzureCredential()); print('Connected')"

Logging

All events are emitted as structured JSON (one object per line):

{"ts":"2026-05-26T12:00:00","level":"INFO","logger":"hybrid_agent",
 "message":"Routing decision","correlation_id":"ce35c6e5",
 "target":"local","confidence":0.87,"reason":"Simple historical fact",
 "privacy_class":"public","complexity":"low","deterministic_override":false}

Ingest directly into Azure Monitor, Splunk, ELK, or any log aggregator.

Specification

specification.md contains the complete technical specification — component interfaces, routing decision tree, data models, testing strategy, security model, extension guide, and known limitations. Written for future developers and coding agents.

References

Contributing and licence

CONTRIBUTING.md — development setup, test workflow, PR checklist.
CODE_OF_CONDUCT.md — Contributor Covenant v2.1.
SECURITY.md — vulnerability reporting and secretless-by-design notes.
CHANGELOG.md — version history.
Licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
setup_and_run.bat		setup_and_run.bat
specification.md		specification.md

Folders and files

Latest commit

History

Repository files navigation

Hybrid Agent – Foundry Local + Microsoft Foundry

Screenshots

Web UI – conversation with live routing diagnostics

Privacy hard gate – RESTRICTED content forced local

Cloud fallback – local model fails, cloud recovers

CLI demo – all five scenarios without real models

Quick start – demo mode (no models required)

Architecture

Routing decision flow

Key design decisions

The five routing scenarios

Project structure

Setup (full mode with real models)

Prerequisites

1 – Create virtual environment

2 – Install dependencies

3 – Configure environment

4 – Run

Running tests

Environment variable reference

Routing policy

Stage 1 – Deterministic (always runs first, zero model calls)

Stage 2 – Local router LLM (ambiguous cases only)

Fallback rules

Response schema

Troubleshooting

Gradio UI: "data incompatible with message" (resolved)

Cloud returns 400 "Unsupported parameter: max_tokens"

Local model first-load is slow (30–60 seconds)

Diagnostics

Logging

Specification

References

Contributing and licence

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages