ai-evaluation-framework

Here are 8 public repositories matching this topic...

firstlinesoftware / eval-ai-library

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Dec 10, 2025
Python

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Dec 10, 2025
Python

lalitkpal / VerifyAI

Star

VerifyAI is a simple UI application to test GenAI outputs

ai-evaluation llm generative-ai genai llm-test llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-testing ai-metrics ai-evaluation-framework generative-ai-evaluation

Updated Sep 5, 2025
Python

mbayers6370 / ALIGN-framework

Star

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

human-in-the-loop emotional-analysis contextual-ai llm-evaluation emotional-alignment ai-evaluation-framework

Updated Oct 29, 2025
Python

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

Star

Public Driftmap harness: public-safe CSV suites + rubrics + run logs for drift detection, refusal integrity, injection resistance, and uncertainty tracking.

benchmark-framework ai-framework ai-safety drift-detection ai-agent ai-evaluation red-teaming-tools ai-agents-framework llm-evaluation refusal llm-evaluation-framework ai-agent-tools ai-evaluation-framework

Updated Jan 1, 2026
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

ZhaoJackson / PsyChat

Star

Clinical trial application for mental health benchmark evaluation of AI responses in multi-turn conversations. Guides users to understand AI interaction patterns and resolve personal mental health issues through therapeutic AI assistance.

sentiment-analysis meteor clinical-trials rouge mental-health bleu-score ethical streamlit bert-fine-tuning azure-openai ai-evaluation-framework benchmark-evaluation-llms multi-turn-conversations

Updated Oct 23, 2025
Python

Improve this page

Add a description, image, and links to the ai-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation-framework

Here are 8 public repositories matching this topic...

firstlinesoftware / eval-ai-library

SS47816 / AGI-Elo

meshkovQA / Eval-ai-library

lalitkpal / VerifyAI

mbayers6370 / ALIGN-framework

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

PabloCabaleiro / pondera

ZhaoJackson / PsyChat

Improve this page

Add this topic to your repo