Skip to content

Architecture Review #21

@RGGH

Description

@RGGH

Evaluate: A Rust-Based LLM Evaluation Framework

Let me walk you through this codebase—it's a pretty solid example of a production-ready Rust application that combines async runtime patterns, multi-provider API integration, WebSocket-based real-time updates, and SQLite persistence. Think of it as an LLM testing harness with a web UI, similar to what you'd build if you wanted to compare model outputs across different providers systematically.

Architecture Overview

This is fundamentally a multi-provider LLM evaluation framework with an HTTP API and WebSocket support. The architecture follows a clean separation of concerns:

  • Providers layer: Abstractions over different LLM APIs (Anthropic, OpenAI, Gemini, Ollama)
  • Runner layer: Orchestrates evaluations and implements "LLM-as-a-judge" pattern
  • API layer: Actix-web server with REST endpoints and WebSocket broadcasting
  • Persistence layer: SQLx with SQLite for storing evaluation history

The really interesting bit is how it handles the "LLM-as-a-judge" pattern—you run a prompt against one model, then use another model to evaluate whether the output meets your criteria. It's evaluations all the way down.

The Provider Abstraction

Let's start with the core trait that makes multi-provider support possible:

pub trait LlmProvider: Send + Sync {
    fn generate(&self, model: &str, prompt: &str) 
        -> impl std::future::Future<Output = Result<(String, u64, TokenUsage)>> + Send;
}

Notice they're not using async_trait here. Instead, they're leveraging Rust's newer impl Trait in trait methods (RPITIT), which became stable relatively recently. This is actually more efficient than async_trait because it avoids the boxing overhead, though the comment explicitly calls this out, which suggests the team was intentional about this choice.

**RPITIT = "Return Position Impl Trait In Traits"

Each provider returns a tuple: (String, u64, TokenUsage) representing the response text, latency in milliseconds, and token usage. This unified interface means the rest of the system doesn't need to know which provider it's talking to.

Anthropic Provider Deep Dive

The Anthropic implementation is particularly clean:

let body = AnthropicRequest {
    model,
    messages: vec![Message {
        role: "user",
        content: prompt,
    }],
    max_tokens: 4096,
    temperature: Some(0.7),
};

let start = Instant::now();

let resp = self
    .client
    .post(&url)
    //.header("x-api-key", &self.config.api_key)
    //.header("anthropic-version", "2023-06-01")
    .header("Authorization", &format!("Bearer {}", &self.config.api_key))
    .header("Content-Type", "application/json")
    .json(&body)
    .send()
    .await?;

let latency_ms = start.elapsed().as_millis() as u64;

They're timing the entire HTTP round-trip with Instant, which is smart for evaluation purposes—you want wall-clock time, not just processing time. The ? operator propagates errors up through the Result chain, and they've got custom error types (via thiserror) to handle different failure modes.

What's particularly nice is the error handling strategy:

if !status.is_success() {
    let error_body = resp
        .text()
        .await
        .unwrap_or_else(|_| "Could not read error body".to_string());
    return Err(EvalError::ApiError {
        status: status.as_u16(),
        body: error_body,
    });
}

They're capturing the error body for debugging, which is invaluable when dealing with API errors. The unwrap_or_else ensures you always get something back, even if reading the error response fails.

Configuration and Environment Management

The configuration system is interesting because it's entirely environment-variable-driven, which is very cloud-native:

pub struct AppConfig {
    pub anthropic: Option<AnthropicConfig>,
    pub gemini: Option<GeminiConfig>,
    pub ollama: Option<OllamaConfig>,
    pub openai: Option<OpenAIConfig>,
    pub models: Vec<String>,
}

All providers are optional, and the from_env() method builds up the config dynamically:

let anthropic_config = if let Ok(api_key) = std::env::var("ANTHROPIC_API_KEY") {
    let api_base = std::env::var("ANTHROPIC_API_BASE")
        .unwrap_or_else(|_| "https://api.anthropic.com".to_string());
    // ... build config
    Some(AnthropicConfig { api_base, api_key, models })
} else {
    None
};

This pattern repeats for each provider, and they aggregate all available models into a flat list with provider prefixes like "anthropic:claude-sonnet-4". The validation at the end ensures you can't start the server without any providers configured:

if anthropic_config.is_none() && gemini_config.is_none() 
   && ollama_config.is_none() && openai_config.is_none() {
    return Err(EvalError::Config(
        "No LLM providers configured. Please set at least one of: ...".to_string()
    ));
}

Template Rendering for Parameterized Tests

One clever feature is the template system for parameterized evaluations:

fn render_template(template: &str, data: &serde_json::Value) -> String {
    let re = Regex::new(r"\{\{\s*(\w+)\s*\}\}").unwrap();
    re.replace_all(template, |caps: &regex::Captures| {
        let key = &caps[1];
        data.get(key)
            .and_then(|v| v.as_str())
            .map(|s| s.to_string())
            .unwrap_or_else(|| caps[0].to_string())
    }).to_string()
}

This allows you to define test cases with variables like "What is the capital of {{country}}?" and pass in metadata to substitute values. It's not a full templating engine, but it's sufficient for the use case and keeps dependencies minimal.

The Runner: Orchestrating Evaluations

The runner module is where the interesting coordination happens. Let's look at the core evaluation flow:

pub async fn run_eval(
    config: &AppConfig,
    eval: &EvalConfig,
    client: &reqwest::Client,
) -> Result<EvalResult> {
    let rendered_eval = eval.render()?;
    let eval_start = Instant::now();
    
    // Step 1: Call the target model
    let (provider_name, model_name) = parse_model_string(&rendered_eval.model);
    let (model_output_str, latency_ms, token_usage) = match call_provider(
        config, client, &provider_name, &model_name, &rendered_eval.prompt,
    ).await {
        Ok(result) => result,
        Err(e @ EvalError::ProviderNotFound(_)) => {
            return Err(e); // Propagate configuration errors
        }
        Err(e) => {
            return Err(EvalError::ModelFailure {
                model: rendered_eval.model.clone(),
            });
        }
    };
    
    // Step 2: Run judge evaluation if configured
    let judge_result = if let (Some(expected), Some(judge_model)) =
        (&rendered_eval.expected, &rendered_eval.judge_model) {
        // ... judge evaluation logic
    };
    
    Ok(EvalResult { /* ... */ })
}

Notice the careful error handling: ProviderNotFound errors (config issues) are propagated directly, while other errors get wrapped as ModelFailure. This distinction matters for the API layer—configuration errors return 400, runtime errors return 500.

LLM-as-a-Judge Pattern

The judge evaluation is particularly sophisticated:

fn create_judge_prompt(expected: &str, actual: &str, criteria: Option<&str>) -> String {
    let base_criteria = criteria.unwrap_or(
        "The outputs should convey the same core meaning, even if phrased differently."
    );

    format!(
        r#"You are an expert evaluator comparing two text outputs.

EVALUATION CRITERIA:
{}

EXPECTED OUTPUT:
{}

ACTUAL OUTPUT:
{}

INSTRUCTIONS:
1. Carefully compare both outputs
2. Consider semantic equivalence, not just exact wording
3. Provide your verdict as the first line: "Verdict: PASS" or "Verdict: FAIL"
4. Then explain your reasoning in 2-3 sentences"#,
        base_criteria, expected, actual
    )
}

This structured prompt encourages consistent responses from the judge model. The parsing logic handles variations:

fn parse_judge_response(response: &str) -> JudgeResult {
    let response_lower = response.to_lowercase();
    
    let verdict = if response_lower.contains("verdict: pass") || 
                     (response_lower.starts_with("yes") || response_lower.contains("yes, they")) {
        JudgeVerdict::Pass
    } else if response_lower.contains("verdict: fail") || 
              (response_lower.starts_with("no") || response_lower.contains("no, they")) {
        JudgeVerdict::Fail
    } else {
        JudgeVerdict::Uncertain
    };
    // ...
}

This fuzzy matching makes the system more robust to variations in how different models format their responses.

Concurrent Batch Evaluation

The batch runner uses future::join_all for concurrent execution:

pub async fn run_batch_evals(
    config: &AppConfig,
    evals: Vec<EvalConfig>,
    client: &reqwest::Client,
) -> Vec<Result<EvalResult>> {
    let futures: Vec<_> = evals
        .iter()
        .map(|eval| run_eval(config, eval, client))
        .collect();

    let results = future::join_all(futures).await;
    // ...
}

This is important: join_all runs all evaluations concurrently (subject to async runtime limits), which can significantly speed up batch processing when you're testing multiple models or prompts.

Database Persistence with SQLx

The database layer uses SQLx with compile-time checked queries, though interestingly they're using the dynamic query interface rather than the query! macro:

pub async fn save_evaluation(pool: &SqlitePool, response: &ApiResponse) -> Result<(), sqlx::Error> {
    let (model, prompt, model_output, expected, /* ... */) = match &response.result {
        EvalResult::Success(res) => (
            Some(res.model.clone()),
            Some(res.prompt.clone()),
            Some(res.model_output.clone()),
            res.expected.clone(),
            // ... 15+ fields
        ),
        EvalResult::Error(err) => (
            None, None, None, None, /* ... all None except error_message */
        ),
    };

    sqlx::query(
        r#"INSERT INTO evaluations (id, status, model, prompt, ...) VALUES (?, ?, ?, ?, ...)#"
    )
    .bind(id)
    .bind(&status)
    .bind(&model)
    // ... many more binds
    .execute(pool)
    .await?;
}

That tuple destructuring is gnarly with 15+ fields, but it's explicit about mapping the domain model to the database schema. The migration system uses SQLx's migration runner:

sqlx::migrate!("./migrations")
    .run(&pool)
    .await?;

This loads SQL files from ./migrations at compile time and runs them on startup.

The API Layer: Actix-Web + WebSockets

The HTTP API is built on Actix-web, and the state management is clean:

#[derive(Clone)]
pub struct AppState {
    pub config: Arc<AppConfig>,
    pub client: Client,
    pub db_pool: Arc<Option<SqlitePool>>,
}

Wrapping the DB pool in Arc<Option<>> is interesting—it allows the app to start even if database initialization fails, though in practice you'd probably want to fail-fast if the DB isn't available.

WebSocket Broadcasting

The WebSocket implementation uses the Actor model via actix:

pub struct WsBroker {
    clients: Arc<RwLock<Vec<Addr<WsConnection>>>>,
}

impl WsBroker {
    pub async fn broadcast(&self, msg: EvalUpdate) {
        let clients = self.clients.read().await;
        for client in clients.iter() {
            client.do_send(msg.clone());
        }
    }
}

The Addr<WsConnection> is an actor address—Actix uses a message-passing model where you send messages to addresses rather than calling methods directly. The RwLock allows multiple readers (broadcasts can happen concurrently) but exclusive writes (when adding/removing clients).

The connection lifecycle is managed through actor hooks:

impl Actor for WsConnection {
    type Context = ws::WebsocketContext<Self>;

    fn started(&mut self, ctx: &mut Self::Context) {
        let addr = ctx.address();
        let broker = self.broker.clone();
        actix::spawn(async move {
            broker.register(addr).await;
        });
    }

    fn stopped(&mut self, ctx: &mut Self::Context) {
        // Unregister when connection closes
    }
}

This ensures connections are automatically cleaned up when clients disconnect.

Error Handling Strategy

The error types are defined using thiserror, which generates nice Display implementations:

#[derive(Error, Debug)]
pub enum EvalError {
    #[error("Failed to read file: {0}")]
    FileRead(#[from] std::io::Error),

    #[error("API request failed with status {status}: {body}")]
    ApiError { status: u16, body: String },

    #[error("Provider '{0}' not found")]
    ProviderNotFound(String),

    #[error("Judge model '{model}' failed: {source}")]
    JudgeFailure {
        model: String,
        #[source]
        source: Box<EvalError>,
    },
    // ...
}

The #[from] attribute automatically implements From<std::io::Error> for EvalError, enabling the ? operator. The JudgeFailure variant uses Box<EvalError> for recursive error types, and #[source] integrates with the error chaining system.

Static Asset Serving

Finally, there's embedded static assets via rust-embed:

#[derive(RustEmbed)]
#[folder = "static/"]
struct StaticAssets;

async fn static_file_handler(req: HttpRequest) -> impl Responder {
    let path = if req.path() == "/" {
        "index.html"
    } else {
        &req.path()[1..]
    };

    match StaticAssets::get(path) {
        Some(content) => {
            let mime = mime_guess::from_path(path).first_or_octet_stream();
            HttpResponse::Ok().content_type(mime.as_ref()).body(Cow::into_owned(content.data))
        }
        None => HttpResponse::NotFound().body("404 Not Found"),
    }
}

This compiles the entire frontend into the binary, making deployment a single executable. The Cow::into_owned converts from borrowed to owned data for the response body.

Conclusion

This codebase demonstrates several Rust best practices: trait-based abstraction for providers, comprehensive error handling with thiserror, async concurrency with tokio, type-safe database queries with SQLx, and actor-based WebSocket management. The "LLM-as-a-judge" pattern is particularly clever for automated evaluation workflows. Overall, it's a solid example of how to structure a production Rust web service with multiple external integrations.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions