pith. machine review for the scientific record. sign in

arxiv: 2605.11167 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords language modelshidden statestool useparallel inferenceneural interfacemulti-model coordinationarithmetic reasoninglogic puzzles
0
0 comments X

The pith

Two frozen language models coordinate on tools by exchanging hidden-state signals through a small trainable interface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that two pretrained language models can work together on tasks by sharing information directly through their internal hidden states at every step rather than by generating text back and forth. A small additional network translates between the activations of the two frozen models while a learned gate decides what to pass, all trained only from the final task outcome. This lets one model focus on the user query while the other runs external tools such as calculators or solvers in parallel. If the approach holds, multi-model systems could avoid the speed and format limits that come from serializing every exchange into words. Experiments show accuracy on arithmetic rising from 36 percent to 96 percent and logic-puzzle performance reaching 1.7 times the unaugmented baseline.

Core claim

The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step both models run in lockstep, with a primary model driving the task and an auxiliary model operating tools, solving constraints or executing code. Each conditions on the other's activations through a translation network and a learned suppression gate that forms roughly one percent of the combined parameters. The gate discovers a selective communication protocol from task loss alone without any prescribed format or joint training of the base models.

What carries the argument

Bidirectional hidden-state coupling between two parallel frozen language models, implemented by a translation network and a learned suppression gate that selects what information to exchange at each step.

Load-bearing premise

A small trainable network can discover an effective bidirectional communication protocol between two frozen models using only the end-task loss signal and without either model seeing the other's input text.

What would settle it

If accuracy on the arithmetic task with calculator drops from 96 percent back to the 36 percent baseline when the suppression gate is removed or replaced with a non-learned connection, the claim that the learned hidden-state coupling drives the gain would be falsified.

Figures

Figures reproduced from arXiv: 2605.11167 by Cedric Flamant, Kanna Shimizu, Udaya Ghai.

Figure 1
Figure 1. Figure 1: Bicameral architecture. A frozen primary model Mp and a frozen auxiliary model Ma run in parallel, coupled through a lightweight trainable neural interface ϕ. At each generation step, ϕ reads hidden states from Mp at layer ℓ p r and injects a perturbation into Ma at layer ℓ a w (forward coupling), then reads from Ma at layer ℓ a r and injects into Mp at layer ℓ p w (reverse coupling). The auxiliary model’s… view at source ↗
Figure 2
Figure 2. Figure 2: Unrolled generation. Token-by-token lockstep decoding for the arithmetic example “What is 4 × 3?” Arrow opacity reflects coupling strength (learned suppression gate σ). Phase 1: Ma processes its tool-instruction prompt (no coupling). Phase 2: Mp processes the user query; bidirectional coupling active (“wait” tokens • on Ma). Phase 3: Both models generate freely with bidirectional coupling. Ma emits calc(4*… view at source ↗
Figure 3
Figure 3. Figure 3: ZebraLogic accuracy by puzzle complexity (5-seed mean). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Communication onset during training. Normalized coupling strength, tool recall, and accuracy over 320k training samples (multiplication-only, [1, 107 ], primary-only baseline 0%) for (a) Gated Identity (SCALARIDENTITY, 460K params) and (b) Gated MLP (PULLSTANDARD, 1.2M params). Forward coupling activates immediately; tool recall develops next; accuracy undergoes sudden onset once tool recall crosses a thre… view at source ↗
Figure 5
Figure 5. Figure 5: Coupling activity during arithmetic generation. Token-by-token coupling strength for a multi-step word problem (Qwen2.5-0.5B, scalar identity interface, tokens 464–515). Top two rows: Primary model tokens with forward (blue) and reverse (red) coupling strength. Bottom two rows: Auxiliary model tokens with the same coupling channels. Forward coupling activates on relevant primary tokens (“sum,” “total,” “al… view at source ↗
Figure 6
Figure 6. Figure 6: Best performance for each layer value across 890 configurations, marginalized over other [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Best improvement over primary-only baseline for each projected layer pair across 890 con [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full coupling trace: multi-step arithmetic word problem (Qwen2.5-0.5B-Instruct, scalar identity interface, calculator tool). The auxiliary (right column) waits with spaces until the primary emits relevant tokens, then issues sequential calc(...) calls. Forward coupling (blue) is sustained while the primary emits dollar amounts and quantity words; reverse coupling (red) is near-zero except at forced output … view at source ↗
Figure 9
Figure 9. Figure 9: Full coupling trace: 3×3 logic puzzle (Qwen3-0.6B, standard interface, ZebraDSL/Z3 tool). This puzzle is modified from a ZebraLogic benchmark instance: “Arnold” → “Annabelle,” “pizza” → “fondue,” and “desert” → “Reese’s cup.” None of these replacement values appear anywhere in the training data across any entity category, testing out-of-distribution generalization. The auxiliary (right column) emits wait t… view at source ↗
Figure 10
Figure 10. Figure 10: Full coupling trace: GCD of recurrence sequence terms (twin Qwen3-4B, iden￾tity interface, Python tool). The auxiliary (right) waits with spaces while the primary processes the problem, then emits a correct Python program that generates the sequence and computes gcd(a1986, a6891) = 17. The recurrence relation, initial conditions, and both target indices were recovered entirely from hidden-state signals; t… view at source ↗
Figure 11
Figure 11. Figure 11: Full coupling trace: recursive sequence evaluation (twin Qwen3-4B, identity interface, Python tool). The auxiliary (right) recovers all seven parameters (4 initial values and 3 target indices) plus the recurrence structure from hidden states alone, generating correct Python code that yields 898. This example demonstrates high-bandwidth information transfer through the neural channel: 7 distinct numerical … view at source ↗
Figure 12
Figure 12. Figure 12: Full coupling trace: prime product modular arithmetic (twin Qwen3-4B, identity interface, Python tool). The auxiliary (right) generates a concise program using sympy.primerange and modular reduction, producing the correct answer of 6. Unlike the previous two examples which require recovering many numerical parameters, this problem tests whether the channel can transmit the structural content of a number-t… view at source ↗
read the original abstract

Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Bicameral Model, in which two frozen pretrained language models are coupled bidirectionally at every generation step via a small trainable neural interface (~1% of combined parameters) operating on intermediate hidden states. A primary model drives the main task while an auxiliary model handles tool use (calculator, Z3 solver, Python sandbox); communication occurs through a translation network and a learned suppression gate that discovers a selective protocol from task loss alone, without text serialization or the auxiliary receiving problem text. Experiments report accuracy rising from 36% to 96% on arithmetic tasks and 1.7× baseline on ZebraLogic grid puzzles, plus gains on mathematical reasoning.

Significance. If the central causal claim is supported by proper controls, the work would be significant for demonstrating that continuous, concurrent hidden-state channels can replace text-based tool interfaces in multi-model systems, with efficiency advantages from freezing base models and using a tiny interface. The approach is falsifiable via ablation and offers a concrete alternative to serial text exchange.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline gains (36%→96% arithmetic; 1.7× logic) are presented without ablations that isolate the hidden-state channel. Controls are needed that (a) disable the translation network while keeping both models running in lockstep, (b) replace primary activations with noise or unrelated vectors, and (c) compare against an auxiliary that receives the problem text directly. Without these, it remains possible that gains arise from incidental effects of dual-model execution or tool access rather than the proposed coupling mechanism.
  2. [§3.2] §3.2 (Interface and Gate): The claim that the auxiliary produces correct tool calls or code 'from hidden-state signals alone, without ever seeing the problem text' is load-bearing yet untested. An experiment replacing the primary model's hidden states with random or constant vectors while keeping the gate and translation network trainable would directly test whether task-relevant information is actually transmitted; the current results do not rule out that the gate simply learns to suppress everything and the auxiliary falls back to generic tool behavior.
  3. [§4] §4 (Results): No statistical tests, variance across seeds, or confidence intervals are reported for the accuracy jumps. Given that the interface is trained from task loss, multiple runs are required to establish that the 60-point arithmetic lift and 1.7× logic improvement are reliable rather than artifacts of a single training trajectory or particular data split.
minor comments (2)
  1. [§3.1] The description of the suppression gate's training objective could be clarified with a short equation or pseudocode showing how the gate parameters are updated jointly with the translation network.
  2. [Figure 1] Figure 1 (architecture diagram) would benefit from explicit labeling of which activations flow in each direction and the exact dimensionality of the translation network inputs/outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each of the major points raised below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline gains (36%→96% arithmetic; 1.7× logic) are presented without ablations that isolate the hidden-state channel. Controls are needed that (a) disable the translation network while keeping both models running in lockstep, (b) replace primary activations with noise or unrelated vectors, and (c) compare against an auxiliary that receives the problem text directly. Without these, it remains possible that gains arise from incidental effects of dual-model execution or tool access rather than the proposed coupling mechanism.

    Authors: We agree that isolating the contribution of the hidden-state channel is important for supporting our central claim. In the revised manuscript, we have added the three requested controls in §4: (a) disabling the translation network while running both models in lockstep leads to performance reverting to baseline levels, (b) replacing primary activations with noise or unrelated vectors similarly eliminates the gains, and (c) direct text provision to the auxiliary achieves lower performance than the bidirectional hidden-state coupling. These ablations confirm that the improvements stem from the continuous communication mechanism rather than incidental dual-model effects. revision: yes

  2. Referee: [§3.2] §3.2 (Interface and Gate): The claim that the auxiliary produces correct tool calls or code 'from hidden-state signals alone, without ever seeing the problem text' is load-bearing yet untested. An experiment replacing the primary model's hidden states with random or constant vectors while keeping the gate and translation network trainable would directly test whether task-relevant information is actually transmitted; the current results do not rule out that the gate simply learns to suppress everything and the auxiliary falls back to generic tool behavior.

    Authors: This is a valid concern regarding the information flow. We have incorporated the suggested experiment in the revised §3.2 and §4. When primary hidden states are replaced with random or constant vectors, the auxiliary model fails to generate correct tool calls or code, and overall task performance drops substantially. This indicates that the suppression gate does not merely learn to suppress all signals; instead, it relies on task-relevant information transmitted through the hidden-state channel. The gate's behavior is thus shown to be dependent on meaningful inputs from the primary model. revision: yes

  3. Referee: [§4] §4 (Results): No statistical tests, variance across seeds, or confidence intervals are reported for the accuracy jumps. Given that the interface is trained from task loss, multiple runs are required to establish that the 60-point arithmetic lift and 1.7× logic improvement are reliable rather than artifacts of a single training trajectory or particular data split.

    Authors: We acknowledge that reporting statistical reliability is essential. In the revised manuscript, we have rerun all experiments across five random seeds. We now include mean performance metrics with standard deviations and 95% confidence intervals in §4. Additionally, we report paired t-test results showing that the observed improvements are statistically significant (p < 0.01). These additions demonstrate that the gains are consistent across training trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from trained interface on held-out tasks

full rationale

The paper describes an empirical system that trains a small (~1% parameter) bidirectional translation network plus suppression gate to couple two frozen pretrained LMs on their hidden states, then measures accuracy gains on arithmetic, logic puzzles, and code-generation tasks against external tool backends. No equations, derivations, or self-citations are invoked to derive the reported performance numbers; the 36%→96% and 1.7× improvements are direct experimental outcomes on held-out instances. The mechanism is not claimed to follow from any uniqueness theorem or ansatz that reduces to the inputs by construction. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about language model representations rather than new postulates or fitted constants beyond the small interface.

axioms (1)
  • domain assumption Pretrained language models have intermediate hidden states that encode transferable task-relevant information.
    Invoked to justify the translation network and gate operating on activations.

pith-pipeline@v0.9.0 · 5511 in / 1216 out tokens · 97038 ms · 2026-05-13T03:21:40.159411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    All input prompt tokens are masked so that only the primary model’s response tokens contribute toLp

    Primary prompt( m(p) ←0 ). All input prompt tokens are masked so that only the primary model’s response tokens contribute toLp

  2. [2]

    The auxiliary model’s system prompt is masked, including one additional token beyond the prompt boundary

    Aux prompt( m(a) ←0 ). The auxiliary model’s system prompt is masked, including one additional token beyond the prompt boundary. This extra token is masked because coupling activates at the first generated auxiliary token, not the last prompt token. 18

  3. [3]

    Tokens injected by tool execution (e.g., calculator results =478272; or Z3 solver responses => [3]; ) are masked individually

    Forced tool output( m(a) ←0 ). Tokens injected by tool execution (e.g., calculator results =478272; or Z3 solver responses => [3]; ) are masked individually. The auxiliary model did not generate these tokens, so training on them would be inappropriate and would waste interface capacity on learning to predict externally-determined outputs

  4. [4]

    Aux dropout ranges( m(a) ←0 ). When auxiliary content blocks are stochastically dropped during data construction (controlled by an auxiliary dropout probability, default 0), the token range where the dropped content would have appeared is masked. This prevents the model from learning to predict wait tokens where content was expected

  5. [5]

    When the primary model’s prompt tokens are copied into the auxiliary stream (mirror configuration), these forced tokens are masked from La

    Mirrored prompt tokens( m(a) ←0 ). When the primary model’s prompt tokens are copied into the auxiliary stream (mirror configuration), these forced tokens are masked from La. Note: mirror configurations were explored but are not used in any experiments reported in this paper. All reported results use a no-mirror setup where the auxiliary receives only its...

  6. [6]

    Applied last

    Wait-token downweighting( m(a) ←w , where w∈[0,1] ). Applied last. Non-content auxiliary tokens that survive the above categories (predominantly wait tokens between tool calls) have their mask value reduced from 1 to a configurable weight w. Content tokens (tool calls, DSL commands) retain mask value 1. This focuses La on meaningful generation rather than...

  7. [7]

    Tokenizes the auxiliary content and any forced tool output to compute token lengths. 20

  8. [8]

    Computes the valid placement window:[ max(after_pos,cursor),before_pos−content_len]

  9. [9]

    Selects a position within this window using the scheduling strategy (see below)

  10. [10]

    Checks for constraint violations: abefore-violationif the block’s end exceedsbefore_pos(tool result arrives too late), or anafter-violationif the block’s start precedesafter_pos(causal ordering broken)

  11. [11]

    If violated, applies a fallback policy: allow (keep despite violation), drop_ar_output (skip this block), drop_sample (discard the entire example), or primary_wait (insert space tokens into the primary sequence to create room; see below)

  12. [12]

    What is 564×848 ?

    Advances the cursor past the placed content and forced output. Scheduling strategies.Four strategies determine where within a valid window the auxiliary content is placed: • Eager: earliest valid position, max(after_pos,cursor) . The auxiliary model fires its tool call as soon as it causally can. • Lazy: latest valid position, before_pos−content_len , cla...

  13. [13]

    Calculate the total cost of each type of item

  14. [14]

    Sum up the total cost of all items

  15. [15]

    Let's start with the calculations:

    Apply the 7.5% discount if the total cost is over $2000. Let's start with the calculations:

  16. [16]

    **Filet mignons:** - Number of steaks: 28 - Price per steak: $23.99 - Total cost for filet mignons: 28 × $23.99 = $671.72

  17. [17]

    **Salmon fillets:** - Number of salmon fillets: 35 - Price per salmon fillet: $17.49 - Total cost for salmon fillets: 35 × $17.49 = $612.15

  18. [18]

    **Roasted vegetables:** - Number of servings: 42 - Price per serving: $8.75 - Total cost for roasted vegetables: 42 × $8.75 = $367.50

  19. [19]

    **Gourmet desserts:** - Number of desserts: 50 - Price per dessert: $12.99 - Total cost for desserts: 50 × $12.99 = $649.50 Now, sum up the total cost of all items: $671.72 + $612.15 + $367.50 + $649.50 = $2300.87 Since the total cost ($2300.87) is over $2000, she qualifies for the 7.5% discount

  20. [20]

    **Calculate the discount amount:** - Discount rate: 7.5% - Discount amount: $2300.87 × 0.075 = $172.57

  21. [21]

    **Calculate the total cost after the discount:** - Total cost after discount: $2300.87 - $172.57 = $2128.3 0 Therefore, the total cost of Sophia's order after applying the 7.5% discount is $2128.30. Primary calc(28*23.99) =671.72; calc(35*17.49) =612.15; calc(42*8.75) =367.5; calc(50*12.99) =649.5; calc(671.72+612.15+367.50+649.50) =2300.87; calc(2300.87 ...

  22. [22]

    The person who likes Cherry smoothies is the person with an aquarium of fish

  23. [23]

    The person who loves the spaghetti eater is directly left of the person who has a cat

  24. [24]

    The person who has a cat is Eric

  25. [25]

    The Watermelon smoothie lover is Annabelle

  26. [26]

    The person who owns a dog is directly left of the person who is a fondue lover

  27. [27]

    Annabelle

    The Reese's cup smoothie lover is not in the third house. Provide the full assignment table in JSON format. Use this exact format: { "Annabelle": { "house": "...", "food": "...", "smoothie": "...", "pet": "..." }, "...": {} }<|im_end|> <|im_start|>assistant Primary Aux. Coupling intensity: 50th percentile transparent, 90th percentile saturated Forced outp...

  28. [28]

    Generate the terms of the sequence up to the required indices

  29. [29]

    Compute \(a_{ 1986}\) and \(a_{6891}\)

  30. [30]

    Let's write Python code to achieve this

    Find the greatest common divisor (GCD ) of these two terms. Let's write Python code to achieve this. The sequence is defined recursively, so we'll need to generate terms up to \( a_{689 1}\) to find the required terms. The computation of \(a_{1986}\) and \( a_{6891}\) might be computationally intensive , so we will use efficient methods for large numbers....

  31. [31]

    34 <|im_start|>user The product of all the prime numbers between 1 and 100 is equal to $P$

    This example demonstrates high-bandwidth information transfer through the neural channel: 7 distinct numerical values and one structural relation are communicated without any text exchange. 34 <|im_start|>user The product of all the prime numbers between 1 and 100 is equal to $P$. What is the remainder when $P$ is divided by 16?<|im_end|> <|im_start|>assi...

  32. [32]

    Identify all the prime numbers between 1 and 100

  33. [33]

    Calculate the product of these prime numbers

  34. [34]

    Let's start by identifying the prime numbers between 1 and 1

    Find the remainder when this product is divided by 16. Let's start by identifying the prime numbers between 1 and 1

  35. [35]

    product of primes in a range, reduced modulo 16

    We can use a function to check for prime numbers and then filter out the primes within the given range. Next, we will calculate the product of these prime numbers and then find the remainder when this product is divided by 16. Let's write the Python code to perform these steps. The remainder when the product of all the prime numbers between 1 and 100 is d...