#

behavioral-testing

Here are 26 public repositories matching this topic...

Basaltlabs-app / Gauntlet

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

benchmark mcp community-driven model-evaluation ai-evaluation llm ollama sycophancy hallucination-detection llm-testing hardware-benchmark ai-trust trust-scoring behavioral-testing llm-benchmark deterministic-scoring

Updated May 4, 2026
Python

qualixar / agentassert-abc

Formal behavioral specification and runtime enforcement for autonomous AI agents. Agent Behavioral Contracts (ABC).

formal-verification ai-agents drift-detection behavioral-testing agent-reliability qualixar agent-contracts

Updated May 24, 2026
Python

stef41 / modeldiff

Behavioral regression testing for LLMs — diff, drift, fingerprint. Zero deps.

python nlp machine-learning evaluation regression-testing fingerprinting model-comparison drift-detection llm behavioral-testing

Updated Apr 10, 2026
Python

senaayy / Computational-Cognitive-Lab

python machine-learning neuroscience computational-neuroscience cognitive-science mne-python biomedical-engineering eeg-analysis stroop-test neurotechnology behavioral-testing erp-analysis

Updated Dec 12, 2025
Python

abdul-hamid-achik / cairntrace

Behavioral browser-spec layer for agent-in-session use. Specs declare intent+outcomes; agents execute + heal via agent-browser or Playwright. CLI + MCP server, agent-neutral.

typescript mcp browser-testing ai-agents bun e2e-testing playwright behavioral-testing agent-browser

Updated Jul 1, 2026
TypeScript

stef41 / modeldiffx

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

python testing regression-testing model-evaluation llm behavioral-testing

Updated Apr 11, 2026
Python

GenesisClawbot / llm-drift

LLM drift detector — know within 5 min when GPT-4o, Claude, or Gemini silently changes behaviour. Open source, self-hostable.

saas gemini openai regression-testing gpt claude mlops drift-detection production-ml model-testing ai-monitoring llm llmops prompt-testing llm-monitoring llm-observability behavioral-testing

Updated Jul 1, 2026
Python

Ufosxm34gt / Conversational-Red-Teaming-Casebook

Bots I broke and how I broke them to be a future conversational Red Teamer

nlp machine-learning natural-language-processing ai chatbot transformers artificial-intelligence openai language-models ai-safety conversational-ai red-teaming ethical-ai llm prompt-engineering behavioral-testing

Updated Jul 1, 2025

tpertner / Leak

Leak™ — behavioral constraint testing for AI agents. Find your drips before the puddle forms.

evaluation ai-safety trust-and-safety ai-agent llm prompt-engineering behavioral-testing

Updated Jun 8, 2026

RLASAF12 / agent-canary

🐦 Behavioral smoke tests for deployed AI agents — probes every 15 min, alerts on drift

smoke-tests observability ai-agents deno supabase llm agent-monitoring behavioral-testing

Updated Jun 11, 2026
HTML

harman-04 / mockito-spies-and-verification-demo

Advanced Mockito usage featuring Spies, Mocks, and behavioral verification to test a shopping cart checkout flow.

mockito junit5 java-testing behavioral-testing spy-vs-mock

Updated Feb 15, 2026
Java

Tubifix77 / llm-profiler

How does a model behave when nobody told it what to do? This protocol observes LLM defaults before asking about preferences, then packages the findings into a reusable profile. Works on local Ollama models and cloud APIs alike.

python benchmarking profiling model-evaluation claude llm prompt-engineering ollama behavioral-testing

Updated Apr 29, 2026
Python

JSLEEKR / agentspec

Agent behavioral testing -- YAML specs for tool calls, sequences, constraints

cli golang yaml mcp specification developer-tools testing-framework ai-agents active-project agent-testing behavioral-testing

Updated Mar 29, 2026
Go

yanuoma / b2t

Artifacts for arXiv:2606.28430. Task spec, prompts, 18-run agent corpus, and a deterministic audit tool from a study showing two production LLM coding agents (Copilot CLI · claude-opus-4.7, gpt-5.5) score near-perfect on a hidden 222-test oracle while leaving the requested library dead or absent.

software-engineering code-generation arxiv gpt claude fluent-ui copilot-cli llm-evaluation llm-agents coding-agents behavioral-testing agent-benchmark

Updated Jun 30, 2026
TypeScript

StanislavBG / stepproof

Regression testing CLI for AI agents — define expected behaviors in YAML, run in CI, fail deploys on behavioral drift

nodejs testing cli open-source devops typescript ci-cd developer-tools regression-testing ai-agents llm ai-testing behavioral-testing

Updated Apr 6, 2026
TypeScript

SadhanaSai / behaviorprobe

Behavioral regression testing across LLMs by task type

python model-versioning prompt-testing llm-evaluation llm-benchmarking behavioral-testing model-regression

Updated Jun 2, 2026
Python

YusufMalu001 / VeritasBench

Production-grade LLM evaluation framework measuring model behavior across 5 dimensions with human-vs-LLM judge agreement validation and Cohen's Kappa scoring

python natural-language-processing cohens-kappa huggingface streamlit human-evaluation instruction-following large-language-models rlhf llm-evaluation llm-benchmarking llm-as-judge behavioral-testing refusal-calibration

Updated Jun 8, 2026
Python

RLASAF12 / model-guard

ModelGuard — Behavioral Contract Monitor for LLMs. Paste your contracts, see which break when your model silently updates.

gemini openai ai-safety claude model-monitoring llm prompt-testing behavioral-testing

Updated Jun 28, 2026
HTML

sandeep-alluru / agentdelta

Diff and regression-detect LLM agent execution traces

python diff ai mcp devtools audit tracing regression-testing agents observability ai-agents llm langchain llmops agent-observability behavioral-testing agent-debugging trace-diff

Updated Jun 25, 2026
Python

ollieb89 / ai-workflow-evals

Catch AI behavioral regressions before merge. Run eval suites for prompts, agents, and workflows in GitHub Actions.

ci-cd developer-tools regression-testing eval github-actions ai-testing prompt-testing ai-quality llm-testing behavioral-testing

Updated Mar 22, 2026
TypeScript

Improve this page

Add a description, image, and links to the behavioral-testing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the behavioral-testing topic, visit your repo's landing page and select "manage topics."