BEAM Benchmark · ICLR 2026

Beyond a Million Tokens

BEAM is the ICLR 2026 benchmark for evaluating long-term conversational memory systems across 10 distinct abilities, 4 token-scale tiers, and 100 multi-session conversations. Graphonomous scores 95.0% on the 100K tier — a +21.6pp lead over Hindsight, the previous state-of-the-art — using only local retrieval with a 500M-param embedder and no LLM judge.

95.0%
Nugget Score (100K)
+21.6pp
vs Hindsight SOTA
400
Questions
10
Abilities Tested
1.7s
Mean Latency

Graphonomous vs. published baselines

All baselines are from the BEAM paper (arXiv:2510.27246). Graphonomous uses local ONNX inference (nomic-embed-text-v2-moe), BM25 + HNSW retrieval, and cross-encoder reranking. No cloud API calls, no LLM-as-judge.

Graphonomous
95.0%
Hindsight
73.4%
Honcho
63.0%
LIGHT
35.8%
RAG baseline
32.3%

10 memory abilities, tested independently

BEAM measures 10 distinct capabilities. Each conversation contributes 20 probing questions balanced across abilities. Scores use nugget-based rubric matching (0 / 0.5 / 1.0).

Abstention
100%
40 questions · ABS
Temporal Reasoning
100%
40 questions · TR
Contradiction Resolution
98.8%
40 questions · CR
Knowledge Update
97.5%
40 questions · KU
Information Extraction
95.0%
40 questions · IE
Multi-Session Reasoning
95.0%
40 questions · MR
Instruction Following
92.5%
40 questions · IF
Preference Following
91.2%
40 questions · PF
Summarization
91.2%
40 questions · SUM
Event Ordering
88.8%
40 questions · EO

Performance across token scales

BEAM defines 4 token tiers (100K–10M) with increasing conversation counts and complexity. This section will be updated as each tier completes.

System 100K 500K 1M 10M
Graphonomous 95.0% 96.9%
Hindsight 73.4% 71.1% 73.9% 64.1%
Honcho 63.0% 64.9% 63.1% 40.6%
LIGHT (Llama-4) 35.8% 35.9% 33.6% 26.6%
RAG baseline 32.3% 33.0% 30.7% 24.9%

Dashes indicate tiers not yet evaluated. Baseline numbers from the BEAM paper (ICLR 2026).

How the benchmark runs

BEAM Dataset

Graphonomous Pipeline

Reproduce locally

Citation

Why these scores are so high

Graphonomous beats published baselines by 20+pp. This section explains why — with data — and is transparent about what the comparison does and does not prove.

1. We measure retrieval, not generation

2. Hybrid retrieval catches what single-signal systems miss

3. Adaptive retrieval budgets per ability

4. Graph structure enables cross-session retrieval

5. Retrieval selectivity improves at scale

6. Rubric-first scoring eliminates generation noise

What this proves and what it doesn't

Why Graphonomous scores 95–97% on BEAM

An honest breakdown of what drives these results, what the scores actually measure, and the important methodological caveat versus the published baselines.

Methodological caveat: retrieval recall vs end-to-end generation

1. Triple-signal hybrid retrieval

2. Entity and fact extraction feeds BM25

3. Per-ability adaptive retrieval budgets

4. Rubric-based scoring matches BEAM's design intent

5. Perfect abstention from rubric design

6. Performance improves with scale (100K → 128K → 500K)

Bottom line