You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
How does a model behave when nobody told it what to do? This protocol observes LLM defaults before asking about preferences, then packages the findings into a reusable profile. Works on local Ollama models and cloud APIs alike.
Artifacts for arXiv:2606.28430. Task spec, prompts, 18-run agent corpus, and a deterministic audit tool from a study showing two production LLM coding agents (Copilot CLI · claude-opus-4.7, gpt-5.5) score near-perfect on a hidden 222-test oracle while leaving the requested library dead or absent.
Production-grade LLM evaluation framework measuring model behavior across 5 dimensions with human-vs-LLM judge agreement validation and Cohen's Kappa scoring