Benchmark Suite

Four benchmarks.
One self-improving loop.

BEAM measures long-term conversational memory at scale (100K–10M tokens). LongMemEval measures realistic agent-memory workloads. GraphMemBench v2 measures what κ-aware topology actually contributes. PRISM ties them together — a self-improving evaluation loop that uses the other three as ground truth, scores across 9 CL dimensions, and evolves its own scenarios.

Public benchmark

BEAM

ICLR 2026 "Beyond a Million Tokens" benchmark: 2,000 questions across 10 memory abilities, 4 token-scale tiers (100K–10M), and 100 multi-session conversations. Graphonomous scores 95.0% on 100K — a +21.6pp lead over Hindsight SOTA.

View results → Public benchmark

LongMemEval

ICLR 2025 long-term memory benchmark: 500 questions across knowledge-update, temporal reasoning, abstention, single-session, and multi-session tracks. Graphonomous runs it fully local with a 500M-param embedder.

View results → Synthetic · κ-sensitive

GraphMemBench v2

Purpose-built synthetic benchmark for measuring κ-topology's contribution to retrieval. Eight tiers span κ=0 controls, simple cycles, dense multi-SCCs, contradiction resolution, mixed-κ discrimination, evidence path tracing, and causal DAG ordering.

View tiers → Self-improving · OS-009

PRISM

Protocol for Rating Iterative System Memory. The benchmark that benchmarks itself. 9 CL dimensions, 3-layer judging, IRT difficulty calibration, scenario evolution. We ran 6 cycles: 0.10 → 0.76 → 0.95 → 0.99 → adversarial crash → fix → generalization test. Uses the other three benchmarks as ground truth inputs.

Why all four

Scale + realism + causality + self-improvement

BEAM tells us whether Graphonomous can handle long-context conversational memory at scale (100K–10M tokens) and beat published systems. LongMemEval validates realistic agent-memory workloads. GraphMemBench T1–T6 tell us whether the κ-topology machinery is doing real work. PRISM ties them together: it scores across 9 CL dimensions, evolves its own scenarios, and feeds results back into Graphonomous — so each cycle makes both the benchmark and the memory sharper.

Property	BEAM	LongMemEval	GraphMemBench v2	PRISM
Source	ICLR 2026 public dataset	ICLR 2025 public dataset	Deterministic synthetic	Self-evolving (OS-009)
Scale	100K–10M tokens, 100 conversations	Fixed corpus, 500 questions	50–500 per tier (configurable)	15+ scenarios, grows each cycle
Abilities tested	10 (IE, MR, KU, TR, ABS, CR, EO, IF, PF, SUM)	5 tracks	8 tiers + 5 difficulty knobs	9 CL dimensions + IRT calibration
Measures κ-topology	No	No	Yes (topology on/off A/B)	Yes (via inner loop)
Domain content	Multi-session dialogue (20 topics)	Realistic dialogue	Synthetic cycles + DAGs	Code, business, research, ops
Validation gate	95.0% nugget score (100K)	92.6% overall QA	T1–T6: ≥3pp κ-delta; T7–T8: algorithm baselines	Weighted total + loop closure rate