Skip to content

leestott/fl-mixedmodel

Repository files navigation

Hybrid Agent – Foundry Local + Microsoft Foundry

A production-grade Python-only hybrid AI application. A lightweight local model classifies every request first. Simple, private, or latency-sensitive requests stay on-device using Foundry Local. Complex, reasoning-heavy, or frontier-capability requests escalate to a Microsoft Foundry cloud model via Azure AI Projects.

Every request returns the same AgentResponse schema — the caller never knows which path was taken.


Screenshots

Web UI – conversation with live routing diagnostics

Hybrid Agent Web UI

Left panel: chat with colour-coded path indicators. Right panel: live routing decision — path, model, confidence, latency, privacy class, complexity, and full response JSON.

Privacy hard gate – RESTRICTED content forced local

Privacy Gate

Sensitive data detected deterministically. The router LLM is never called. Cloud is blocked at both the policy layer and the service layer.

Cloud fallback – local model fails, cloud recovers

Cloud Fallback

Local model raises RuntimeError. Service automatically falls back to cloud. Response path is cloud_fallback; fallback_reason is populated.

CLI demo – all five scenarios without real models

CLI Demo


Quick start – demo mode (no models required)

Run all five routing scenarios with simulated responses in under a second:

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate

pip install -r requirements.txt
python -m app.main --demo

No Azure account, no Foundry Local installation, and no model downloads are needed for demo mode.


Architecture

User prompt
     │
     ▼
 HybridAgentService.ask()
     │
     ├─► RoutingPolicy.decide()
     │       ├─► HeuristicClassifier   (deterministic – zero model calls)
     │       └─► RouterLLM             (local router model – classification only)
     │
     ├─── RouteTarget.LOCAL  ──► LocalModelProvider.complete()   (foundry-local-sdk)
     │                               └─► cloud fallback if local fails
     │
     └─── RouteTarget.CLOUD  ──► CloudModelProvider.complete()   (azure-ai-projects)
                                     └─► local fallback if cloud fails + policy allows

Routing decision flow

Stage Component Trigger
1a HeuristicClassifier Sensitive keyword → RESTRICTED hard gate (no LLM call)
1b HeuristicClassifier Trivial greeting → LOCAL (no LLM call)
1c HeuristicClassifier very_high complexity → CLOUD (no LLM call)
2 RouterLLM All other cases – local model scores complexity, privacy, confidence
3 RoutingPolicy LLM confidence < threshold → safe default LOCAL

Key design decisions

Decision Rationale
Two separate local models (router + task) Router stays fast and cheap; task model can be larger
Deterministic gates before LLM routing Privacy and compliance must not depend solely on model inference
Defence-in-depth RESTRICTED guard Service layer independently blocks cloud for RESTRICTED content
Same AgentResponse schema from both paths Callers are path-agnostic; routing is transparent
DefaultAzureCredential for cloud auth Keyless; no secrets in code or environment files
Conservative fallback (default to local) Low-confidence routing prefers privacy over capability

The five routing scenarios

# Prompt type Path Triggered by
1 Trivial greeting local Heuristic (no LLM call)
2 Simple factual question local RouterLLM → high confidence local
3 Sensitive data (password, PII) local Privacy hard gate (no LLM call)
4 Complex multi-step reasoning cloud RouterLLM → high confidence cloud
5 Local model fails cloud_fallback enable_cloud_fallback=true

Project structure

fl-mixedmodel/
├── app/
│   ├── main.py               CLI + --ui + --demo entry point
│   ├── demo.py               Demo runner (all 5 scenarios, no models needed)
│   ├── config.py             Env-var-driven typed configuration
│   ├── models/
│   │   ├── response.py       AgentResponse – unified output schema
│   │   └── routing.py        RoutingDecision, RouterLLMOutput, enums
│   ├── routing/
│   │   ├── heuristic.py      Deterministic keyword/rule pre-classifier
│   │   ├── router_llm.py     Local model classifier + JSON parser
│   │   └── policy.py         RoutingPolicy – two-stage decision logic
│   ├── providers/
│   │   ├── local_provider.py foundry-local-sdk wrapper (router + task models)
│   │   └── cloud_provider.py azure-ai-projects wrapper (Responses API)
│   ├── services/
│   │   └── hybrid_agent.py   HybridAgentService – orchestration + fallback
│   ├── telemetry/
│   │   └── logger.py         Structured JSON logger, correlation IDs, timers
│   ├── ui/
│   │   └── gradio_app.py     Gradio web UI (chat + live routing diagnostics)
│   └── tests/
│       ├── test_routing.py      Unit tests – heuristic, policy, router LLM
│       ├── test_hybrid_agent.py Unit tests – service paths and fallback
│       └── test_e2e.py          E2E tests – full pipeline, privacy gates, UI
├── docs/
│   └── screenshots/          UI and scenario screenshot SVGs
├── requirements.txt
├── .env.example
├── specification.md          Full technical specification for future agents
└── README.md

Setup (full mode with real models)

Prerequisites

  • Python 3.11 or later
  • Foundry Local installed (foundry service start)
  • An Azure AI Foundry project with at least one model deployment
  • Azure CLI logged in: az login

1 – Create virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

2 – Install dependencies

pip install -r requirements.txt

3 – Configure environment

# Windows
copy .env.example .env

# macOS / Linux
cp .env.example .env

Edit .env and fill in at minimum:

Variable Required Description
FOUNDRY_PROJECT_ENDPOINT Azure AI Foundry project endpoint URL
FOUNDRY_CLOUD_MODEL_DEPLOYMENT Cloud deployment name, e.g. gpt-4o
FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS recommended Local router model alias
FOUNDRY_LOCAL_TASK_MODEL_ALIAS recommended Local task model alias

To find available local model aliases:

from foundry_local import FoundryLocalManager
from foundry_local.models import Configuration

m = FoundryLocalManager.initialize(Configuration(app_name="hybrid-agent"))
print([model.alias for model in m.catalog.list_models()])

4 – Run

CLI (interactive REPL)

python -m app.main

Web UI (Gradio on port 7860)

python -m app.main --ui
python -m app.main --ui --port 8080
python -m app.main --ui --share        # creates a public share URL

Demo mode (no models required)

python -m app.main --demo

Running tests

Tests mock all network and model calls — no external dependencies:

# All 43 tests
pytest app/tests/ -v

# With coverage report
pytest app/tests/ -v --cov=app --cov-report=term-missing

# Single suite
pytest app/tests/test_e2e.py -v

Environment variable reference

Variable Default Description
FOUNDRY_PROJECT_ENDPOINT Azure AI Foundry project endpoint (required)
FOUNDRY_CLOUD_MODEL_DEPLOYMENT Cloud deployment name (required)
FOUNDRY_CLOUD_ROUTER_DEPLOYMENT model-router Cloud auto-router or named deployment
FOUNDRY_LOCAL_ROUTER_MODEL_ALIAS SDK default Local router model alias
FOUNDRY_LOCAL_TASK_MODEL_ALIAS SDK default Local task model alias
HYBRID_PRIVACY_MODE strict strict – confidential content stays local
HYBRID_MAX_LOCAL_TOKENS 4096 Max token budget for local inference
HYBRID_ENABLE_CLOUD_FALLBACK true Fall back to cloud if local model fails
HYBRID_ENABLE_LOCAL_FALLBACK true Fall back to local if cloud is unavailable
HYBRID_MIN_ROUTER_CONFIDENCE 0.6 Min router LLM confidence to trust its decision
APP_LOG_LEVEL INFO Log level (DEBUG / INFO / WARNING / ERROR)

See .env.example for fully annotated values.


Routing policy

Stage 1 – Deterministic (always runs first, zero model calls)

Condition Action
Sensitive keyword detected (password, PII, credentials, NHS, SSN…) Force local, RESTRICTED
Prompt exceeds local context threshold (3,000 chars) Hint cloud, HIGH complexity
Cloud-indicative keywords (generate code, tool use, research paper…) Hint cloud, HIGH
Trivial greeting or single-word acknowledgement Force local, TRIVIAL
Complexity very_high Force cloud

Stage 2 – Local router LLM (ambiguous cases only)

The router model returns structured JSON scored on:

  • target local or cloud
  • confidence [0, 1] — below HYBRID_MIN_ROUTER_CONFIDENCE defaults to local
  • privacy_class public | internal | confidential | restricted
  • complexity trivial | low | medium | high | very_high
  • requires_long_context, requires_tool_use, requires_frontier_capability

Fallback rules

Scenario Behaviour
Local fails + HYBRID_ENABLE_CLOUD_FALLBACK=true Escalate → cloud_fallback
Cloud fails + HYBRID_ENABLE_LOCAL_FALLBACK=true + content not RESTRICTED Recover → local_fallback
Both fail Graceful error AgentResponse — never raises to caller
Cloud unavailable + content RESTRICTED Error response — content never sent elsewhere

Response schema

Every response — local, cloud, or fallback — returns the same shape:

{
  "answer": "The Berlin Wall fell on 9 November 1989.",
  "path": "local",
  "model": "phi-3.5-mini",
  "reason": "Simple historical fact, answerable by a small local model",
  "confidence": 0.87,
  "latency_ms": 52.1,
  "correlation_id": "ce35c6e5-1835-4b4a-b858-4c71d6b582f4",
  "prompt_tokens": 9,
  "completion_tokens": 12,
  "fallback": false,
  "fallback_reason": null,
  "metadata": {
    "privacy_class": "public",
    "complexity": "low",
    "deterministic_override": false,
    "requires_long_context": false,
    "requires_tool_use": false
  }
}

Troubleshooting

Gradio UI: "data incompatible with message" (resolved)

Earlier Gradio 4.x samples used show_copy_button=True on gr.Chatbot and the tuple message format. Both were removed in Gradio 6.x and produced a TypeError or "data incompatible with message" error.

Resolved in this repo by:

  • Initialising gr.Chatbot with type="messages" and buttons=['copy', 'copy_all', 'share']
  • Returning {"role": "assistant", "content": ...} dicts from the chat handler

If you fork the UI and hit this error again, check app/ui/gradio_app.py for the Chatbot component initialisation.

Cloud returns 400 "Unsupported parameter: max_tokens"

gpt-5 and o-series deployments require max_completion_tokens instead of max_tokens and reject custom temperature values. app/providers/cloud_provider.py tries max_completion_tokens first and falls back to max_tokens only when the API returns the specific unsupported parameter error, so the same code works against both old and new deployments.

Local model first-load is slow (30–60 seconds)

First run downloads model weights through Foundry Local. Subsequent runs reuse the cache. Use python -m app.main --demo to validate routing logic without waiting for a download.

Diagnostics

List available local model aliases:

python -c "from foundry_local import FoundryLocalManager; from foundry_local.models import Configuration; m=FoundryLocalManager.initialize(Configuration(app_name='test')); print([x.alias for x in m.catalog.list_models()][:10])"

Check cloud connectivity (run az login first):

python -c "from azure.identity import DefaultAzureCredential; from azure.ai.projects import AIProjectClient; import os; AIProjectClient(endpoint=os.environ['FOUNDRY_PROJECT_ENDPOINT'], credential=DefaultAzureCredential()); print('Connected')"

Logging

All events are emitted as structured JSON (one object per line):

{"ts":"2026-05-26T12:00:00","level":"INFO","logger":"hybrid_agent",
 "message":"Routing decision","correlation_id":"ce35c6e5",
 "target":"local","confidence":0.87,"reason":"Simple historical fact",
 "privacy_class":"public","complexity":"low","deterministic_override":false}

Ingest directly into Azure Monitor, Splunk, ELK, or any log aggregator.


Specification

specification.md contains the complete technical specification — component interfaces, routing decision tree, data models, testing strategy, security model, extension guide, and known limitations. Written for future developers and coding agents.


References


Contributing and licence

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors