Community benchmark database for running LLMs on Apple Silicon Macs
-
Updated
Apr 22, 2026 - Shell
Community benchmark database for running LLMs on Apple Silicon Macs
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
An Open Benchmark for AI in Cybersecurity Operations
ICRTL Benchmark: Industrial-level RTL design challenges for evaluating PPA optimization, code generation, and LLM applications in EDA.
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
The AI-native wire format for structured data. 100% comprehension on every frontier model. 50-92% fewer tokens than JSON. 43B+ lossless round-trips across 17 formats. Spec v3.2 Stable.
Открытый бенчмарк LLM: какая нейросеть лучше пишет код 1С:Предприятие (BSL). Объективная оценка LLM по методике SMOP с реальным исполнением в 1С — Claude, GPT, Gemini, DeepSeek, YandexGPT, GigaChat.
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
[ICML 2026] CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
Benchmark abierto en español de 141 LLMs (89 con 13K+ runs reales y juez Phi-4 independiente). Quality, costo, velocidad, long-context y fuga de credenciales como dimensiones separadas. Alternativas a Claude, GPT y Gemini para agentes n8n/OpenClaw. Calculadora interactiva con tus propios pesos.
The open-source benchmark for LLM memory decay. Measure how Naive, RAG, Chunked RAG, Cascading, and SummaryMemory degrade over 100 conversation turns. Ebbinghaus forgetting curves, 5-provider LLM eval, multi-seed CI. No API key needed.
Daily LLM value rankings - compare 300+ models by intelligence, speed and price. OpenRouter + Artificial Analysis. 大模型性价比排行榜
Testing how well LLMs can solve jigsaw puzzles
Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
Self-hosted LLM API benchmark, monitoring & playground. Compare latency, TTFT, throughput across OpenAI, Anthropic, Gemini & any OpenAI-compatible endpoint. Deploy with one command via Docker. | 自托管 LLM API 性能测试、监控与调试平台,一键 Docker 部署,支持多家服务商对比。
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
Local LLM BenchMarking
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."