A production-grade Python-only hybrid AI application. A lightweight local model classifies every request first. Simple, private, or latency-sensitive requests stay on-device using Foundry Local. Complex, reasoning-heavy, or frontier-capability requests escalate to a Microsoft Foundry cloud model via Azure AI Projects.
Every request returns the same AgentResponse schema — the caller never knows
which path was taken.
Left panel: chat with colour-coded path indicators. Right panel: live routing decision — path, model, confidence, latency, privacy class, complexity, and full response JSON.
Sensitive data detected deterministically. The router LLM is never called. Cloud is blocked at both the policy layer and the service layer.
Local model raises RuntimeError. Service automatically falls back to cloud.
Response path is cloud_fallback; fallback_reason is populated.
Run all five routing scenarios with simulated responses in under a second:
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
pip install -r requirements.txt
python -m app.main --demoNo Azure account, no Foundry Local installation, and no model downloads are needed for demo mode.
User prompt
│
▼
HybridAgentService.ask()
│
├─► RoutingPolicy.decide()
│ ├─► HeuristicClassifier (deterministic – zero model calls)
│ └─► RouterLLM (local router model – classification only)
│
├─── RouteTarget.LOCAL ──► LocalModelProvider.complete() (foundry-local-sdk)
│ └─► cloud fallback if local fails
│
└─── RouteTarget.CLOUD ──► CloudModelProvider.complete() (azure-ai-projects)
└─► local fallback if cloud fails + policy allows
| Stage | Component | Trigger |
|---|---|---|
| 1a | HeuristicClassifier |
Sensitive keyword → RESTRICTED hard gate (no LLM call) |
| 1b | HeuristicClassifier |
Trivial greeting → LOCAL (no LLM call) |
| 1c | HeuristicClassifier |
very_high complexity → CLOUD (no LLM call) |
| 2 | RouterLLM |
All other cases – local model scores complexity, privacy, confidence |
| 3 | RoutingPolicy |
LLM confidence < threshold → safe default LOCAL |
| Decision | Rationale |
|---|---|
| Two separate local models (router + task) | Router stays fast and cheap; task model can be larger |
| Deterministic gates before LLM routing | Privacy and compliance must not depend solely on model inference |
| Defence-in-depth RESTRICTED guard | Service layer independently blocks cloud for RESTRICTED content |
Same AgentResponse schema from both paths |
Callers are path-agnostic; routing is transparent |
DefaultAzureCredential for cloud auth |
Keyless; no secrets in code or environment files |
| Conservative fallback (default to local) | Low-confidence routing prefers privacy over capability |
| # | Prompt type | Path | Triggered by |
|---|---|---|---|
| 1 | Trivial greeting | local |
Heuristic (no LLM call) |
| 2 | Simple factual question | local |
RouterLLM → high confidence local |
| 3 | Sensitive data (password, PII) | local |
Privacy hard gate (no LLM call) |
| 4 | Complex multi-step reasoning | cloud |
RouterLLM → high confidence cloud |
| 5 | Local model fails | cloud_fallback |
enable_cloud_fallback=true |
fl-mixedmodel/
├── app/
│ ├── main.py CLI + --ui + --demo entry point
│ ├── demo.py Demo runner (all 5 scenarios, no models needed)
│ ├── config.py Env-var-driven typed configuration
│ ├── models/
│ │ ├── response.py AgentResponse – unified output schema
│ │ └── routing.py RoutingDecision, RouterLLMOutput, enums
│ ├── routing/
│ │ ├── heuristic.py Deterministic keyword/rule pre-classifier
│ │ ├── router_llm.py Local model classifier + JSON parser
│ │ └── policy.py RoutingPolicy – two-stage decision logic
│ ├── providers/
│ │ ├── local_provider.py foundry-local-sdk wrapper (router + task models)
│ │ └── cloud_provider.py azure-ai-projects wrapper (Responses API)
│ ├── services/
│ │ └── hybrid_agent.py HybridAgentService – orchestration + fallback
│ ├── telemetry/
│ │ └── logger.py Structured JSON logger, correlation IDs, timers
│ ├── ui/
│ │ └── gradio_app.py Gradio web UI (chat + live routing diagnostics)
│ └── tests/
│ ├── test_routing.py Unit tests – heuristic, policy, router LLM
│ ├── test_hybrid_agent.py Unit tests – service paths and fallback
│ └── test_e2e.py E2E tests – full pipeline, privacy gates, UI
├── docs/
│ └── screenshots/ UI and scenario screenshot SVGs
├── requirements.txt
├── .env.example
├── specification.md Full technical specification for future agents
└── README.md
- Python 3.11 or later
- Foundry Local installed (
foundry service start) - An Azure AI Foundry project with at least one model deployment
- Azure CLI logged in:
az login
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activatepip install -r requirements.txt# Windows
copy .env.example .env
# macOS / Linux
cp .env.example .envEdit .env and fill in at minimum:
| Variable | Required | Description |
|---|---|---|
FOUNDRY_PROJECT_ENDPOINT |
✅ | Azure AI Foundry project endpoint URL |
FOUNDRY_CLOUD_MODEL_DEPLOYMENT |
✅ | Cloud deployment name, e.g. gpt-4o |
FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS |
recommended | Local router model alias |
FOUNDRY_LOCAL_TASK_MODEL_ALIAS |
recommended | Local task model alias |
To find available local model aliases:
from foundry_local import FoundryLocalManager
from foundry_local.models import Configuration
m = FoundryLocalManager.initialize(Configuration(app_name="hybrid-agent"))
print([model.alias for model in m.catalog.list_models()])CLI (interactive REPL)
python -m app.mainWeb UI (Gradio on port 7860)
python -m app.main --ui
python -m app.main --ui --port 8080
python -m app.main --ui --share # creates a public share URLDemo mode (no models required)
python -m app.main --demoTests mock all network and model calls — no external dependencies:
# All 43 tests
pytest app/tests/ -v
# With coverage report
pytest app/tests/ -v --cov=app --cov-report=term-missing
# Single suite
pytest app/tests/test_e2e.py -v| Variable | Default | Description |
|---|---|---|
FOUNDRY_PROJECT_ENDPOINT |
— | Azure AI Foundry project endpoint (required) |
FOUNDRY_CLOUD_MODEL_DEPLOYMENT |
— | Cloud deployment name (required) |
FOUNDRY_CLOUD_ROUTER_DEPLOYMENT |
model-router |
Cloud auto-router or named deployment |
FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS |
SDK default | Local router model alias |
FOUNDRY_LOCAL_TASK_MODEL_ALIAS |
SDK default | Local task model alias |
HYBRID_PRIVACY_MODE |
strict |
strict – confidential content stays local |
HYBRID_MAX_LOCAL_TOKENS |
4096 |
Max token budget for local inference |
HYBRID_ENABLE_CLOUD_FALLBACK |
true |
Fall back to cloud if local model fails |
HYBRID_ENABLE_LOCAL_FALLBACK |
true |
Fall back to local if cloud is unavailable |
HYBRID_MIN_ROUTER_CONFIDENCE |
0.6 |
Min router LLM confidence to trust its decision |
APP_LOG_LEVEL |
INFO |
Log level (DEBUG / INFO / WARNING / ERROR) |
See .env.example for fully annotated values.
| Condition | Action |
|---|---|
| Sensitive keyword detected (password, PII, credentials, NHS, SSN…) | Force local, RESTRICTED |
| Prompt exceeds local context threshold (3,000 chars) | Hint cloud, HIGH complexity |
| Cloud-indicative keywords (generate code, tool use, research paper…) | Hint cloud, HIGH |
| Trivial greeting or single-word acknowledgement | Force local, TRIVIAL |
Complexity very_high |
Force cloud |
The router model returns structured JSON scored on:
- target
localorcloud - confidence
[0, 1]— belowHYBRID_MIN_ROUTER_CONFIDENCEdefaults tolocal - privacy_class
public | internal | confidential | restricted - complexity
trivial | low | medium | high | very_high - requires_long_context, requires_tool_use, requires_frontier_capability
| Scenario | Behaviour |
|---|---|
Local fails + HYBRID_ENABLE_CLOUD_FALLBACK=true |
Escalate → cloud_fallback |
Cloud fails + HYBRID_ENABLE_LOCAL_FALLBACK=true + content not RESTRICTED |
Recover → local_fallback |
| Both fail | Graceful error AgentResponse — never raises to caller |
Cloud unavailable + content RESTRICTED |
Error response — content never sent elsewhere |
Every response — local, cloud, or fallback — returns the same shape:
{
"answer": "The Berlin Wall fell on 9 November 1989.",
"path": "local",
"model": "phi-3.5-mini",
"reason": "Simple historical fact, answerable by a small local model",
"confidence": 0.87,
"latency_ms": 52.1,
"correlation_id": "ce35c6e5-1835-4b4a-b858-4c71d6b582f4",
"prompt_tokens": 9,
"completion_tokens": 12,
"fallback": false,
"fallback_reason": null,
"metadata": {
"privacy_class": "public",
"complexity": "low",
"deterministic_override": false,
"requires_long_context": false,
"requires_tool_use": false
}
}Earlier Gradio 4.x samples used show_copy_button=True on gr.Chatbot and the tuple message format. Both were removed in Gradio 6.x and produced a TypeError or "data incompatible with message" error.
Resolved in this repo by:
- Initialising
gr.Chatbotwithtype="messages"andbuttons=['copy', 'copy_all', 'share'] - Returning
{"role": "assistant", "content": ...}dicts from the chat handler
If you fork the UI and hit this error again, check app/ui/gradio_app.py for the Chatbot component initialisation.
gpt-5 and o-series deployments require max_completion_tokens instead of max_tokens and reject custom temperature values. app/providers/cloud_provider.py tries max_completion_tokens first and falls back to max_tokens only when the API returns the specific unsupported parameter error, so the same code works against both old and new deployments.
First run downloads model weights through Foundry Local. Subsequent runs reuse the cache. Use python -m app.main --demo to validate routing logic without waiting for a download.
List available local model aliases:
python -c "from foundry_local import FoundryLocalManager; from foundry_local.models import Configuration; m=FoundryLocalManager.initialize(Configuration(app_name='test')); print([x.alias for x in m.catalog.list_models()][:10])"Check cloud connectivity (run az login first):
python -c "from azure.identity import DefaultAzureCredential; from azure.ai.projects import AIProjectClient; import os; AIProjectClient(endpoint=os.environ['FOUNDRY_PROJECT_ENDPOINT'], credential=DefaultAzureCredential()); print('Connected')"All events are emitted as structured JSON (one object per line):
{"ts":"2026-05-26T12:00:00","level":"INFO","logger":"hybrid_agent",
"message":"Routing decision","correlation_id":"ce35c6e5",
"target":"local","confidence":0.87,"reason":"Simple historical fact",
"privacy_class":"public","complexity":"low","deterministic_override":false}Ingest directly into Azure Monitor, Splunk, ELK, or any log aggregator.
specification.md contains the complete technical specification —
component interfaces, routing decision tree, data models, testing strategy, security
model, extension guide, and known limitations. Written for future developers and
coding agents.
- Foundry Local homepage
- Get started with Foundry Local
- Foundry Local Python SDK (GitHub)
- foundry-local-sdk on PyPI
- Azure AI Projects client library for Python
- azure-ai-projects on PyPI
- Responses model routing
- Generate responses with Foundry Models
CONTRIBUTING.md— development setup, test workflow, PR checklist.CODE_OF_CONDUCT.md— Contributor Covenant v2.1.SECURITY.md— vulnerability reporting and secretless-by-design notes.CHANGELOG.md— version history.- Licensed under the MIT License.



