Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

Bo Zheng; Chengruidong Zhang; Dayiheng Liu; Gong Cheng; Hao Luo; Huiqiang Jiang; Jingren Zhou; Langshi Chen; Man Yuan; Minmin Sun

arxiv: 2606.26560 · v1 · pith:PIGKJW6Jnew · submitted 2026-06-25 · 💻 cs.CL

Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

Xiao Li , Chengruidong Zhang , Hao Luo , Xi Lin , Zekun Wang , Zihan Qiu , Yunfei Mao , Langshi Chen

show 10 more authors

Man Yuan Minmin Sun Huiqiang Jiang Siqi Zhang Rui Men Wei Hu Gong Cheng Bo Zheng Dayiheng Liu Jingren Zhou

This is my paper

Pith reviewed 2026-06-26 05:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords Erase-then-Delta Attentiondelta-rule linear attentionrecurrent memorylanguage model pretraininglong-context modelingmemory managementlinear attention

0 comments

The pith

Decoupling erase and write addresses lets delta-rule linear attention actively remove stale memory before writing new content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Delta-rule linear attention corrects stored values at the current write address but cannot actively clear outdated information stored elsewhere. Erase-then-Delta Attention (EDA) adds a separate learned erase step that targets a different address before the standard corrective write occurs. This preserves the delta-rule correction while expanding the model's ability to manage recurrent memory. Pretraining runs on dense 2.5B and MoE 25B models show EDA outperforming prior delta-rule variants, with the advantage holding after 80B tokens of long-context midtraining and across 4k-to-128k evaluations.

Core claim

The central claim is that recurrent memory models benefit when the erase address is chosen independently of the write address: EDA first applies a targeted erase along a learned direction, then performs the usual delta-style correction at the write address, thereby keeping the corrective behavior intact while adding an explicit cleanup path that is most active when passive decay is weak.

What carries the argument

Erase-then-Delta Attention (EDA) update rule that performs a learned-direction erase step before the standard delta-rule corrective write.

If this is right

EDA achieves the lowest perplexity in both dense 2.5B and MoE 25B-A2.8B pretraining.
The advantage remains after 80B-token long-context midtraining of the MoE models.
EDA records the best results on long-context benchmarks spanning 4k to 128k contexts.
Memory-state probes indicate the extra erase path is used most when passive decay is weak while the delta correction stays unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling idea could be tested in other recurrent state-space or linear-attention variants that currently rely only on decay or overwriting.
Explicit erase directions might reduce reliance on learned decay rates or attention sinks in long-context settings.
If the erase direction can be made input-dependent at inference time, the method could support on-the-fly memory editing without retraining.

Load-bearing premise

A learned erase direction can be trained to target stale memory at an address independent of the write address without interfering with the delta-rule corrective update or introducing training instability.

What would settle it

If side-by-side pretraining runs of identical 2.5B or 25B models show that EDA does not reduce validation perplexity relative to standard delta-rule attention after the same token budget, the performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.26560 by Bo Zheng, Chengruidong Zhang, Dayiheng Liu, Gong Cheng, Hao Luo, Huiqiang Jiang, Jingren Zhou, Langshi Chen, Man Yuan, Minmin Sun, Rui Men, Siqi Zhang, Wei Hu, Xiao Li, Xi Lin, Yunfei Mao, Zekun Wang, Zihan Qiu.

**Figure 2.** Figure 2: EDA uses an independent cleanup path. (a) Gate-strength allocation by mean-retention bin. Independent erase becomes dominant when decay is weak (α¯ close to one); red percentages above bars denote the erase share. (b) Under the same erase gates, counterfactual erase directions cause larger local readout perturbations than the learned direction; bars show layerwise fold change relative to Actual, and µ deno… view at source ↗

read the original abstract

Delta-rule linear attention improves recurrent memory updates by correcting what is already stored at the current write address before writing new content. However, the active correction is still anchored to that same write address. As a result, stale information stored at a different address cannot be actively removed before new content is written elsewhere. We propose Erase-then-Delta Attention (EDA), a memory update rule that decouples where to erase from where to write. The key insight is that recurrent memory models should not only correct the current write, but also selectively suppress outdated memory at an independently chosen address. Concretely, our method first applies a targeted erase step along a learned erase direction, and then performs the standard delta-style corrective write along the current write direction. This preserves the corrective behavior of delta-rule updates while expanding their memory-management capacity. Language-model pretraining experiments across dense 2.5B and MoE 25B-A2.8B model families show that EDA performs best in both settings. The gain persists after 80B-token long-context midtraining of the MoE models, where EDA also performs best in long-context evaluations from 4k to 128k contexts. A compact update analysis and memory-state probes suggest why: EDA keeps the delta-rule corrective write intact while allocating an additional cleanup path most strongly when passive decay is weak. These results suggest that recurrent memory models should decide not only what to write, but also what stale information to erase and where.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EDA adds an independent erase step before delta-rule writes in linear attention and reports gains on 2.5B and 25B-scale models, but the decoupling may not survive joint training without further checks.

read the letter

The main takeaway is that Erase-then-Delta Attention adds an independent erase operation before the delta write in linear attention, allowing cleanup at a different memory address. This is presented as fixing a limitation where stale info at non-write addresses couldn't be actively removed.

The paper introduces the EDA rule with a learned erase direction followed by the standard delta correction. It reports better performance than baselines in pretraining both a 2.5B dense model and a 25B-A2.8B MoE model. Those gains hold after additional 80B token long-context training on the MoE version, with strong results on contexts from 4k to 128k. A brief analysis and memory probes are included to explain when the erase path is most active.

This approach is a direct extension of delta-rule ideas and the experiments cover relevant scales and settings for the subfield.

The main concern is whether the learned erase direction truly stays decoupled from the write direction without causing interference or instability. The provided analysis is high-level and doesn't fully address training dynamics or potential correlation between the two vectors. The lack of detailed ablations or quantitative breakdowns in the reported results makes it difficult to isolate the contribution of the erase step specifically. If the erase and write end up aligned, the claimed benefit reduces.

This paper is for people focused on improving recurrent memory in transformers for long sequences. It has a clear technical proposal and consistent empirical findings, so it should go through peer review rather than a desk reject. The idea is worth testing and refining even if the current evidence leaves some questions open.

Referee Report

2 major / 1 minor

Summary. The paper proposes Erase-then-Delta Attention (EDA), a memory update rule for delta-rule linear attention that first applies a targeted erase along a learned erase direction before performing the standard delta-style corrective write. This is intended to decouple erase and write addresses so that stale information at an independent address can be suppressed. Pretraining experiments on dense 2.5B and MoE 25B-A2.8B models report that EDA performs best in both settings; the advantage persists after 80B-token long-context midtraining, where EDA also leads on evaluations from 4k to 128k contexts. A compact update analysis and memory-state probes are offered to explain the behavior.

Significance. If the decoupling holds without destabilizing the delta-rule correction, the method would expand the memory-management capacity of recurrent linear attention beyond address-tied correction. The cross-scale experiments (2.5B dense and 25B MoE) and persistence through long-context midtraining constitute concrete empirical support; the memory-state probes add mechanistic insight. These elements would be strengths if the independence of the learned erase direction is rigorously verified.

major comments (2)

[Abstract] Abstract and update-analysis paragraph: the claim that the delta-rule corrective write 'remains intact' rests on a high-level analysis that does not quantify correlation between the jointly learned erase vector and the write vector, nor bound potential interference with the corrective update; this is load-bearing for the decoupling assertion.
[Experiments] Experiments section (performance claims): no quantitative effect sizes, ablation isolating the erase-direction contribution, or controls for training instability are reported in the abstract, and the full text must supply these to substantiate that observed gains arise from address decoupling rather than other unablated factors.

minor comments (1)

[Method] Notation for the learned erase direction should be introduced with an explicit equation and distinguished from the write direction to avoid ambiguity in the update rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments point by point below, offering clarifications from the manuscript and committing to targeted revisions that strengthen the evidence for decoupling without overstating current results.

read point-by-point responses

Referee: [Abstract] Abstract and update-analysis paragraph: the claim that the delta-rule corrective write 'remains intact' rests on a high-level analysis that does not quantify correlation between the jointly learned erase vector and the write vector, nor bound potential interference with the corrective update; this is load-bearing for the decoupling assertion.

Authors: We agree that the current update analysis is high-level and that explicit quantification would make the independence claim more rigorous. The manuscript's compact analysis shows the erase step acts along a learned direction prior to the write, and the memory-state probes indicate differential cleanup behavior when passive decay is weak. To directly address the concern, we will add measurements of the correlation between the jointly learned erase and write vectors plus a bound on interference in the revised version. revision: yes
Referee: [Experiments] Experiments section (performance claims): no quantitative effect sizes, ablation isolating the erase-direction contribution, or controls for training instability are reported in the abstract, and the full text must supply these to substantiate that observed gains arise from address decoupling rather than other unablated factors.

Authors: The full manuscript already reports consistent gains on 2.5B dense and 25B MoE models that persist after long-context midtraining. We acknowledge, however, that explicit effect sizes, an ablation isolating the erase direction, and training-stability controls are needed to rule out confounding factors. In revision we will insert quantitative effect sizes (e.g., perplexity deltas), a dedicated ablation on the erase-direction term, and stability metrics across runs. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method tested independently of inputs

full rationale

The paper introduces Erase-then-Delta Attention as a novel decoupling of erase and write addresses in delta-rule linear attention, then validates it via pretraining experiments on 2.5B dense and 25B-A2.8B MoE models plus long-context midtraining. No equations, fitted parameters, or self-citations are shown that reduce the performance claims or the update analysis to a definition or prior result by construction. The compact update analysis is presented as supporting evidence rather than a load-bearing derivation that collapses into the method definition itself. The chain remains self-contained through external empirical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits identification; the main addition is the learned erase direction as a new component whose benefit is assumed to be compatible with delta-rule updates.

free parameters (1)

learned erase direction
Direction used for the targeted erase step; learned during training to select the address for suppression.

axioms (1)

domain assumption Delta-rule linear attention remains stable and corrective when an independent erase operation is inserted before the write step.
Required for the claim that EDA preserves corrective behavior while adding cleanup.

invented entities (1)

erase direction no independent evidence
purpose: Provides an address for active suppression of stale memory independent of the current write address.
New postulated component introduced to enable the decoupling described in the method.

pith-pipeline@v0.9.1-grok · 5855 in / 1428 out tokens · 30828 ms · 2026-06-26T05:27:45.021577+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 1 canonical work pages

[1]

The Thirteenth International Conference on Learning Representations , year=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. The Thirteenth International Conference on Learning Representations , year=
[2]

2025 , eprint=

Kimi Linear: An Expressive, Efficient Attention Architecture , author=. 2025 , eprint=

2025
[3]

2026 , eprint=

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention , author=. 2026 , eprint=

2026
[4]

DeltaProduct: Improving State-Tracking in Linear

Julien Siems and Timur Carstensen and Arber Zela and Frank Hutter and Massimiliano Pontil and Riccardo Grazzi , booktitle=. DeltaProduct: Improving State-Tracking in Linear. 2026 , url=

2026
[5]

Improving Bilinear

Jiaxi Hu and Yongqi Pan and Jusen Du and Disen Lan and Xiaqiang Tang and Qingsong Wen and Yuxuan Liang and Weigao Sun , booktitle=. Improving Bilinear. 2026 , url=

2026
[6]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=
[7]

International Conference on Machine Learning , year=

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=
[8]

International Conference on Learning Representations , year=

Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=
[9]

First Conference on Language Modeling , year=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=
[10]

ArXiv , year=

Linformer: Self-Attention with Linear Complexity , author=. ArXiv , year=
[11]

ArXiv , year=

Retentive Network: A Successor to Transformer for Large Language Models , author=. ArXiv , year=
[12]

Forty-first International Conference on Machine Learning , year=

Gated Linear Attention Transformers with Hardware-Efficient Training , author=. Forty-first International Conference on Machine Learning , year=
[13]

Transformers are

Tri Dao and Albert Gu , booktitle=. Transformers are. 2024 , url=

2024
[14]

Maximilian Beck and Korbinian P. x. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[15]

International Conference on Machine Learning , year=

Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning , year=
[16]

2025 , eprint=

RWKV-7 "Goose" with Expressive Dynamic State Evolution , author=. 2025 , eprint=

2025
[17]

HiPPO: Recurrent Memory with Optimal Polynomial Projections , url =

Gu, Albert and Dao, Tri and Ermon, Stefano and Rudra, Atri and R\'. HiPPO: Recurrent Memory with Optimal Polynomial Projections , url =. Advances in Neural Information Processing Systems , editor =
[18]

Songlin Yang and Yikang Shen and Kaiyue Wen and Shawn Tan and Mayank Mishra and Liliang Ren and Rameswar Panda and Yoon Kim , booktitle=. Pa. 2025 , url=

2025
[19]

2024 , eprint=

Jamba: A Hybrid Transformer-Mamba Language Model , author=. 2024 , eprint=

2024
[20]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[21]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[22]

Learning to (Learn at Test Time):

Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , booktitle=. Learning to (Learn at Test Time):. 2025 , url=

2025
[23]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Titans: Learning to Memorize at Test Time , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[24]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[25]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =
[26]

ArXiv , year=

Measuring Massive Multitask Language Understanding , author=. ArXiv , year=
[27]

2024 , url=

Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

2024
[28]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=
[29]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

2021
[30]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824

work page doi:10.18653/v1/2023.findings-acl.824 2023
[31]

Is Your Code Generated by Chat

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat. 2023 , url=

2023
[32]

2024 , url=

Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

2024
[33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[1] [1]

The Thirteenth International Conference on Learning Representations , year=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. The Thirteenth International Conference on Learning Representations , year=

[2] [2]

2025 , eprint=

Kimi Linear: An Expressive, Efficient Attention Architecture , author=. 2025 , eprint=

2025

[3] [3]

2026 , eprint=

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention , author=. 2026 , eprint=

2026

[4] [4]

DeltaProduct: Improving State-Tracking in Linear

Julien Siems and Timur Carstensen and Arber Zela and Frank Hutter and Massimiliano Pontil and Riccardo Grazzi , booktitle=. DeltaProduct: Improving State-Tracking in Linear. 2026 , url=

2026

[5] [5]

Improving Bilinear

Jiaxi Hu and Yongqi Pan and Jusen Du and Disen Lan and Xiaqiang Tang and Qingsong Wen and Yuxuan Liang and Weigao Sun , booktitle=. Improving Bilinear. 2026 , url=

2026

[6] [6]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

[7] [7]

International Conference on Machine Learning , year=

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=

[8] [8]

International Conference on Learning Representations , year=

Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=

[9] [9]

First Conference on Language Modeling , year=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=

[10] [10]

ArXiv , year=

Linformer: Self-Attention with Linear Complexity , author=. ArXiv , year=

[11] [11]

ArXiv , year=

Retentive Network: A Successor to Transformer for Large Language Models , author=. ArXiv , year=

[12] [12]

Forty-first International Conference on Machine Learning , year=

Gated Linear Attention Transformers with Hardware-Efficient Training , author=. Forty-first International Conference on Machine Learning , year=

[13] [13]

Transformers are

Tri Dao and Albert Gu , booktitle=. Transformers are. 2024 , url=

2024

[14] [14]

Maximilian Beck and Korbinian P. x. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[15] [15]

International Conference on Machine Learning , year=

Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning , year=

[16] [16]

2025 , eprint=

RWKV-7 "Goose" with Expressive Dynamic State Evolution , author=. 2025 , eprint=

2025

[17] [17]

HiPPO: Recurrent Memory with Optimal Polynomial Projections , url =

Gu, Albert and Dao, Tri and Ermon, Stefano and Rudra, Atri and R\'. HiPPO: Recurrent Memory with Optimal Polynomial Projections , url =. Advances in Neural Information Processing Systems , editor =

[18] [18]

Songlin Yang and Yikang Shen and Kaiyue Wen and Shawn Tan and Mayank Mishra and Liliang Ren and Rameswar Panda and Yoon Kim , booktitle=. Pa. 2025 , url=

2025

[19] [19]

2024 , eprint=

Jamba: A Hybrid Transformer-Mamba Language Model , author=. 2024 , eprint=

2024

[20] [20]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[21] [21]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[22] [22]

Learning to (Learn at Test Time):

Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , booktitle=. Learning to (Learn at Test Time):. 2025 , url=

2025

[23] [23]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Titans: Learning to Memorize at Test Time , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[24] [24]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[25] [25]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =

[26] [26]

ArXiv , year=

Measuring Massive Multitask Language Understanding , author=. ArXiv , year=

[27] [27]

2024 , url=

Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

2024

[28] [28]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

[29] [29]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

2021

[30] [30]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824

work page doi:10.18653/v1/2023.findings-acl.824 2023

[31] [31]

Is Your Code Generated by Chat

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat. 2023 , url=

2023

[32] [32]

2024 , url=

Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

2024

[33] [33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=