Benchmark Suite

Four benchmarks.
One self-improving loop.

BEAM measures long-term conversational memory at scale (100K–10M tokens). LongMemEval measures realistic agent-memory workloads. GraphMemBench v2 measures what κ-aware topology actually contributes. PRISM ties them together — a self-improving evaluation loop that uses the other three as ground truth, scores across 9 CL dimensions, and evolves its own scenarios.

Public benchmark

BEAM

ICLR 2026 "Beyond a Million Tokens" benchmark: 2,000 questions across 10 memory abilities, 4 token-scale tiers (100K–10M), and 100 multi-session conversations. Graphonomous scores 95.0% on 100K — a +21.6pp lead over Hindsight SOTA.

95.0%
Nugget Score
+21.6pp
vs SOTA
10
Abilities
View results →
Public benchmark

LongMemEval

ICLR 2025 long-term memory benchmark: 500 questions across knowledge-update, temporal reasoning, abstention, single-session, and multi-session tracks. Graphonomous runs it fully local with a 500M-param embedder.

92.6%
QA Accuracy
500
Questions
5
Tracks
View results →
Synthetic · κ-sensitive

GraphMemBench v2

Purpose-built synthetic benchmark for measuring κ-topology's contribution to retrieval. Eight tiers span κ=0 controls, simple cycles, dense multi-SCCs, contradiction resolution, mixed-κ discrimination, evidence path tracing, and causal DAG ordering.

8
Tiers
5
Difficulty Knobs
+100pp
κ-delta gate
View tiers →
Self-improving · OS-009

PRISM

Protocol for Rating Iterative System Memory. The benchmark that benchmarks itself. 9 CL dimensions, 3-layer judging, IRT difficulty calibration, scenario evolution. We ran 6 cycles: 0.10 → 0.76 → 0.95 → 0.99 → adversarial crash → fix → generalization test. Uses the other three benchmarks as ground truth inputs.

9
CL Dimensions
6
Cycles Run
3-Layer
Judging
Explore PRISM →

Scale + realism + causality + self-improvement

BEAM tells us whether Graphonomous can handle long-context conversational memory at scale (100K–10M tokens) and beat published systems. LongMemEval validates realistic agent-memory workloads. GraphMemBench T1–T6 tell us whether the κ-topology machinery is doing real work. PRISM ties them together: it scores across 9 CL dimensions, evolves its own scenarios, and feeds results back into Graphonomous — so each cycle makes both the benchmark and the memory sharper.

Property BEAM LongMemEval GraphMemBench v2 PRISM
Source ICLR 2026 public dataset ICLR 2025 public dataset Deterministic synthetic Self-evolving (OS-009)
Scale 100K–10M tokens, 100 conversations Fixed corpus, 500 questions 50–500 per tier (configurable) 15+ scenarios, grows each cycle
Abilities tested 10 (IE, MR, KU, TR, ABS, CR, EO, IF, PF, SUM) 5 tracks 8 tiers + 5 difficulty knobs 9 CL dimensions + IRT calibration
Measures κ-topology No No Yes (topology on/off A/B) Yes (via inner loop)
Domain content Multi-session dialogue (20 topics) Realistic dialogue Synthetic cycles + DAGs Code, business, research, ops
Validation gate 95.0% nugget score (100K) 92.6% overall QA T1–T6: ≥3pp κ-delta; T7–T8: algorithm baselines Weighted total + loop closure rate