WAP Web Any Paper
Level 层级

Learning to Discover at Test Time

Keep learning during testing to beat the best known solution.

A high-school friendly walkthrough of what the model does, why it works, and where it helped.

Learning to Discover at Test Time

Test-time RL optimized for a single best solution, not average reward.

Graduate-level summary of the formal setup, objective, reuse strategy, and reported results.

测试时学习以“发现”为目标

在测试时持续学习,去超过当前最好解。

高中生也能读懂的说明:做什么、为什么有效、带来了哪些改进。

测试时学习以“发现”为目标

面向单一最优解的测试时强化学习,而非平均回报。

研究生级别的形式化定义、目标函数、复用策略与结果总结。

Test-Time Training Reinforcement Learning Search + Reuse

Paper facts

  • Title: Learning to Discover at Test Time
  • Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
  • Submitted: 22 Jan 2026 (v1)
  • Institutions: Stanford, NVIDIA, Astera Institute, UC San Diego, Together AI

论文信息

  • 标题:Learning to Discover at Test Time
  • 作者:Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
  • 提交日期:2026 年 1 月 22 日(v1)
  • 机构:Stanford、NVIDIA、Astera Institute、UC San Diego、Together AI

Problem setup (plain language)

  • Problem: a text description of a task (math, code, biology).
  • State: a candidate solution (code, kernel, or mathematical object).
  • Reward: a score you can compute (speed, accuracy, bound quality).
  • Discovery: find a solution that beats the current best score.

Problem setup: discovery as an MDP

  • Problem description d: task text fed to the policy.
  • State s: candidate solution (e.g., kernel code or step function).
  • Reward R(s): continuous score (e.g., 1/runtime, 1/MSE).
  • Best-known: s_sota with r_sota = R(s_sota).
  • Discovery: any s with R(s) > r_sota.

Rewards are continuous; invalid solutions get 0.

问题定义(通俗版)

  • 问题:一段描述任务的文本。
  • 状态:一个候选解(代码、kernel 或数学构造)。
  • 奖励:可计算的分数(速度、准确率、界的好坏)。
  • 发现:找到比分别最好更高的解。

问题定义:以 MDP 形式化“发现”

  • 问题描述 d:输入给策略的任务文本。
  • 状态 s:候选解(kernel 代码或分段函数)。
  • 奖励 R(s):连续评分(如 1/运行时间、1/MSE)。
  • 当前最优:s_sota 与 r_sota = R(s_sota)。
  • 发现:任何满足 R(s) > r_sota 的解。

奖励为连续值,无效解记为 0。

Search + reuse (intuition)

  • Try many solutions and keep the best ones.
  • Start from an empty solution to explore new ideas.
  • Reuse good past attempts to make deeper progress.
  • Also reuse notes/ideas from earlier attempts as hints.

Search + reuse baselines

  • Best-of-N: i.i.d. rollouts from πθ with s = s_sota or <empty>.
  • State reuse: sample s_i from a buffer H_i via reuse(H_i).
  • State-action reuse: reuse s_i plus context c_i derived from prior actions.
  • Effect: reuse adds extra timesteps and extends effective horizon.

搜索与复用(直观版)

  • 反复尝试,保留最好解。
  • 从空解开始以探索新思路。
  • 复用好解继续改进,进度更深。
  • 把以前的“想法和注释”也当作提示。

搜索与复用基线

  • Best-of-N:从 πθ 独立采样 N 次(s = s_sota 或 <empty>)。
  • 状态复用:从缓冲区 H_i 用 reuse(H_i) 采样 s_i。
  • 状态-动作复用:复用 s_i,并把历史动作整理成上下文 c_i。
  • 效果:复用相当于增加时间步,延长有效时长。

Method (big idea)

Instead of optimizing the average score, focus on the single best solution and search around it.

TTT-Discover: objective + reuse

Entropic objective emphasizes max reward; PUCT prioritizes high-potential states.

方法(大思路)

不追求平均分,而是集中资源找到“最强单解”。

TTT-Discover:目标与复用

熵目标强调最大回报,PUCT 负责挑选高潜力起点。

How it runs (simple loop)

  1. Pick a past attempt (or empty start).
  2. Generate a new solution.
  3. Score it with the task evaluator.
  4. Store it and train a bit using this feedback.
  5. Repeat and keep the best solution.

Algorithm loop (compressed)

  1. Sample (s_i, c_i) via reuse(H_i).
  2. Draw action a_i ~ π_{θ_i}(· | d, s_i, c_i).
  3. Parse/execute to get s'_i and reward r_i = R(s'_i).
  4. Add (s_i, a_i, s'_i, r_i) to H_{i+1}.
  5. Update θ with test-time RL; return best s by reward.

运行流程(简化)

  1. 挑一个历史解(或空解)。
  2. 生成新的解。
  3. 用评估器打分。
  4. 存入历史并做一次训练更新。
  5. 循环,保留最好解。

算法流程(压缩版)

  1. 用 reuse(H_i) 采样 (s_i, c_i)。
  2. 采样 a_i ~ π_{θ_i}(· | d, s_i, c_i)。
  3. 解析/执行得到 s'_i 与 r_i = R(s'_i)。
  4. 将 (s_i, a_i, s'_i, r_i) 追加到 H_{i+1}。
  5. 测试时 RL 更新 θ,输出最高分解。

Two key choices

  • Objective: reward the best attempt, not the average.
  • Reuse: pick starting points that are good but not all the same.

Objective + PUCT (from the paper)

Entropic objective:

Jβ(θ) = E_{s~reuse(H)} [ log E_{a~πθ(.|s)} exp(β(s)·R(s,a)) ]

Advantage with KL penalty:

A(a;s) = wβ(s)(a) − 1 − λ log(πθ(a|s) / πθ0(a|s))

PUCT reuse score:

score(s) = Q(s) + c · P(s) · sqrt(1 + T) / (1 + n(s))

Q(s) uses max child reward (not mean); β is set adaptively via KL constraints.

两个关键点

  • 目标:奖励“最好一次”,而不是平均值。
  • 复用:起点既偏向好解也保持探索。

目标函数与 PUCT(来自论文)

熵目标:

Jβ(θ) = E_{s~reuse(H)} [ log E_{a~πθ(.|s)} exp(β(s)·R(s,a)) ]

含 KL 惩罚的优势:

A(a;s) = wβ(s)(a) − 1 − λ log(πθ(a|s) / πθ0(a|s))

PUCT 评分:

score(s) = Q(s) + c · P(s) · sqrt(1 + T) / (1 + n(s))

Q(s) 取子节点最大奖励;β 通过 KL 约束自适应设置。

Results (high level)

The method improves best-known results in math, GPU kernels, algorithms, and biology.

Results (with numbers)

Scores and runtimes are from the paper’s summary table and task sections.

结果(概览)

在数学、GPU kernel、算法竞赛、生物任务上都有提升。

结果(含数值)

数值来自论文汇总表与任务章节。

Mathematics

Finds better bounds on classic problems by searching over constructive certificates.

Example: improves Erdős’ minimum overlap bound and one autocorrelation inequality.

Mathematics

  • Erdős min overlap: 0.380876 (best human 0.380927).
  • AC1 upper bound: 1.50286 (best human 1.50973).
  • AC2: best 0.959, not a discovery.
  • Construction: 600-piece step function (vs 51-piece human).

数学

通过搜索数学构造(证书)得到更好的界。

例如:改进 Erdős 最小重叠与一条自相关不等式。

数学

  • Erdős 最小重叠:0.380876(人类最好 0.380927)。
  • AC1 上界:1.50286(人类最好 1.50973)。
  • AC2:最好 0.959,但未达新发现。
  • 构造:600 段分段函数(人类最好 51 段)。

GPU kernels (TriMul)

Generates faster GPU kernels by fusing operations and reducing memory traffic.

Up to ~2× faster than prior art on GPUMode TriMul.

GPU kernels (TriMul)

  • A100: 2198 μs vs best human 4531 μs.
  • H100: 1161 μs vs best human 1371 μs.

GPU kernel(TriMul)

通过更深的融合与更低的内存流量得到更快的 kernel。

在 GPUMode TriMul 上可达约 2× 提升。

GPU kernel(TriMul)

  • A100:2198 μs(人类最好 4531 μs)。
  • H100:1161 μs(人类最好 1371 μs)。

Algorithm competitions

Writes stronger heuristic algorithms for AtCoder contests.

Exceeds the best known score on AHC039.

Algorithm competitions (AtCoder AHC)

  • Contests: ahc039 (geometry) and ahc058 (production).
  • AHC039: 567,062 vs best human 566,997.
  • Environment: C++ code; reward = score on local tests.

算法竞赛

自动写出更强的启发式算法。

在 AHC039 上超越人类最好成绩。

算法竞赛(AtCoder AHC)

  • 任务:ahc039(几何)与 ahc058(生产调度)。
  • AHC039:567,062(人类最好 566,997)。
  • 环境:C++ 代码;本地测试得分。

Biology (single-cell denoising)

Improves denoising quality on standard single-cell benchmarks.

Score improves from 0.64 to 0.71 on PBMC.

Biology (single-cell denoising)

  • Baseline: MAGIC.
  • Improvements: gene-adaptive ensembling, low-rank SVD, log-space polishing.
  • Score: 0.71 (PBMC) vs MAGIC 0.64.

生物(单细胞去噪)

在单细胞数据去噪上提升指标表现。

PBMC 得分从 0.64 提升到 0.71。

生物(单细胞去噪)

  • 基线:MAGIC。
  • 改进:基因自适应集成、低秩 SVD、对数空间抛光。
  • 得分:PBMC 0.71(MAGIC 0.64)。

Resources

Primary links from the paper and project page.

资源

论文与项目主页链接。

Suggested citation

Yuksekgonul et al., Learning to Discover at Test Time, arXiv:2601.16175 (2026).

建议引用

Yuksekgonul 等,Learning to Discover at Test Time,arXiv:2601.16175 (2026)。