Hybrid inference router for local-plus-cloud LLMs via a single OpenAI-compatible endpoint
LFO is a self-hosted hybrid inference router that exposes one OpenAI-compatible /v1/chat/completions endpoint and routes each request to either a local Android-hosted FunctionGemma model or a cloud Gemini model. It is designed for agentic systems that need fast, private on-device inference when possible and powerful cloud models when necessary. By implementing a two-phase hybrid routing strategy—combining token-count pre-filtering with confidence-aware escalation—LFO ensures that agents remain responsive and capable while prioritising local execution to reduce latency and API costs.
Target users: Developers building LLM agents and tools that require edge-plus-cloud hybrid inference behind a single, reusable service.
- Description
- Features
- Tech Stack
- Architecture Overview
- Installation
- Usage
- Configuration
- Screenshots / Demo
- API / CLI Reference
- Tests
- Roadmap
- Contributing
- License
- Contact / Support
- OpenAI-Compatible: Drop-in replacement for any client using the standard chat completions endpoint.
- Hybrid Routing: Per-request decision making via
metadata.mode(local | cloud | auto). - Confidence Escalation: Automatically re-routes to cloud if the local model's confidence score falls below a threshold.
- Local-First: Fast, private inference on Android via Cactus Engine + FunctionGemma-270M.
- Cloud Power: Seamless access to Google Gemini 2.x for long-context and complex reasoning.
- Tool Calling: Native support for function calling on both local and cloud backends.
- Resiliency: Built-in circuit breaker for the local path to prevent offline device blocking.
- Observability: Real-time request logging and provider health metrics via an internal dashboard.
- Router: Node.js, TypeScript, Express
- Local Inference: Cactus Engine, FunctionGemma-270M (GGUF)
- Cloud Inference: Google Gemini 2.x (
@google/generative-ai) - Mobile Host: React Native (Android),
react-native-tcp-socket - Testing:
node:testwith mock providers
flowchart LR
Client[Agent / Client] --> Router[LFO Core Router]
Router -->|mode: cloud| Gemini[Google Gemini API]
Router -->|mode: local| Bridge[Android Bridge]
Bridge --> Mobile[LFO Mobile App]
Mobile --> Engine[Cactus + FunctionGemma]
The system uses a Node.js-based core router to manage request flow between local and cloud providers. Requests are intelligently routed based on token heuristics or explicit metadata, with local traffic forwarded via a dedicated bridge to the mobile inference host.
-
Clone the repository:
git clone https://github.com/MasteraSnackin/LFO.git cd LFO -
Install core dependencies:
cd lfo-core && npm install
-
Configure environment:
cp .env.example .env # Set at minimum: # GEMINI_API_KEY=<your_key> # ANDROID_HOST=<device_ip>
-
Prepare Android Device: Ensure the LFO Mobile App is installed and running on a device within the same LAN.
Start the orchestrator:
npm startExecute a hybrid request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is 2+2?"}],
"metadata": {"mode": "auto"}
}'LFO is primarily configured via environment variables in the lfo-core directory:
GEMINI_API_KEY: Required for cloud escalation.PORT: HTTP port for the router (default:8080).ANDROID_HOST: LAN IP address of the Android inference host.MAX_LOCAL_TOKENS: Maximum prompt length for local routing (default:1500).LFO_AUTH_TOKEN: Optional bearer token for endpoint security.
The internal dashboard provides visibility into routing decisions, latency, and model confidence scores.
Accepts standard OpenAI payloads. Custom routing is controlled via the metadata object:
{
"messages": [...],
"metadata": {
"mode": "local",
"confidence_threshold": 0.7
}
}Responses include lfo_metadata containing the actual confidence score and the routing_reason.
The project uses the built-in Node.js test runner for unit and integration testing.
cd lfo-core && npm testThe suite includes 45+ tests covering routing logic, circuit breaker states, and error mapping across providers.
- Support for Server-Sent Events (SSE) streaming.
- Multi-device load balancing for local inference clusters.
- Integration with
tiktokenfor precise tokenisation. - Persistent storage for usage statistics and historical performance.
Contributions are welcome via GitHub issues and pull requests. Please ensure all tests pass before submitting changes.
This project is licensed under the MIT License - see the LICENSE file for details.
- Maintainer: MasteraSnackin
- GitHub: MasteraSnackin
- Issues: Project Issues