BEAM measures long-term conversational memory at scale (100K–10M tokens). LongMemEval measures realistic agent-memory workloads. GraphMemBench v2 measures what κ-aware topology actually contributes. PRISM ties them together — a self-improving evaluation loop that uses the other three as ground truth, scores across 9 CL dimensions, and evolves its own scenarios.
ICLR 2026 "Beyond a Million Tokens" benchmark: 2,000 questions across 10 memory abilities, 4 token-scale tiers (100K–10M), and 100 multi-session conversations. Graphonomous scores 95.0% on 100K — a +21.6pp lead over Hindsight SOTA.
ICLR 2025 long-term memory benchmark: 500 questions across knowledge-update, temporal reasoning, abstention, single-session, and multi-session tracks. Graphonomous runs it fully local with a 500M-param embedder.
Purpose-built synthetic benchmark for measuring κ-topology's contribution to retrieval. Eight tiers span κ=0 controls, simple cycles, dense multi-SCCs, contradiction resolution, mixed-κ discrimination, evidence path tracing, and causal DAG ordering.
Protocol for Rating Iterative System Memory. The benchmark that benchmarks itself. 9 CL dimensions, 3-layer judging, IRT difficulty calibration, scenario evolution. We ran 6 cycles: 0.10 → 0.76 → 0.95 → 0.99 → adversarial crash → fix → generalization test. Uses the other three benchmarks as ground truth inputs.
BEAM tells us whether Graphonomous can handle long-context conversational memory at scale (100K–10M tokens) and beat published systems. LongMemEval validates realistic agent-memory workloads. GraphMemBench T1–T6 tell us whether the κ-topology machinery is doing real work. PRISM ties them together: it scores across 9 CL dimensions, evolves its own scenarios, and feeds results back into Graphonomous — so each cycle makes both the benchmark and the memory sharper.
| Property | BEAM | LongMemEval | GraphMemBench v2 | PRISM |
|---|---|---|---|---|
| Source | ICLR 2026 public dataset | ICLR 2025 public dataset | Deterministic synthetic | Self-evolving (OS-009) |
| Scale | 100K–10M tokens, 100 conversations | Fixed corpus, 500 questions | 50–500 per tier (configurable) | 15+ scenarios, grows each cycle |
| Abilities tested | 10 (IE, MR, KU, TR, ABS, CR, EO, IF, PF, SUM) | 5 tracks | 8 tiers + 5 difficulty knobs | 9 CL dimensions + IRT calibration |
| Measures κ-topology | No | No | Yes (topology on/off A/B) | Yes (via inner loop) |
| Domain content | Multi-session dialogue (20 topics) | Realistic dialogue | Synthetic cycles + DAGs | Code, business, research, ops |
| Validation gate | 95.0% nugget score (100K) | 92.6% overall QA | T1–T6: ≥3pp κ-delta; T7–T8: algorithm baselines | Weighted total + loop closure rate |