arxiv: 2404.14219 · v4 · submitted 2024-04-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin , Jyoti Aneja , Hany Awadalla , Ahmed Awadallah , Ammar Ahmad Awan , Nguyen Bach , Amit Bahree , Arash Bakhtiari , Jianmin Bao , Harkirat Behl , Alon Benhaim , Misha Bilenko , Johan Bjorck , S\'ebastien Bubeck , Martin Cai , Qin Cai , Vishrav Chaudhary , Dong Chen , Dongdong Chen , Weizhu Chen , Yen-Chun Chen , Yi-Ling Chen , Hao Cheng , Parul Chopra , Xiyang Dai , Matthew Dixon , Ronen Eldan , Victor Fragoso , Jianfeng Gao , Mei Gao , Min Gao , Amit Garg , Allie Del Giorno , Abhishek Goswami , Suriya Gunasekar , Emman Haider , Junheng Hao , Russell J. Hewett , Wenxiang Hu , Jamie Huynh , Dan Iter , Sam Ade Jacobs , Mojan Javaheripi , Xin Jin , Nikos Karampatziakis , Piero Kauffmann , Mahoud Khademi , Dongwoo Kim , Young Jin Kim , Lev Kurilenko , James R. Lee , Yin Tat Lee , Yuanzhi Li , Yunsheng Li , Chen Liang , Lars Liden , Xihui Lin , Zeqi Lin , Ce Liu , Liyuan Liu , Mengchen Liu , Weishung Liu , XiaoDong Liu , Chong Luo , Piyush Madan , Ali Mahmoudzadeh , David Majercak , Matt Mazzola , Caio C\'esar Teodoro Mendes , Arindam Mitra , Hardik Modi , Anh Nguyen , Brandon Norick , Barun Patra , Daniel Perez-Becker , Thomas Portet , Reid Pryzant , Heyang Qin , Marko Radmilac , Liliang Ren , Gustavo de Rosa , Corby Rosset , Sambudha Roy , Olatunji Ruwase , Olli Saarikivi , Amin Saied , Adil Salim , Michael Santacroce , Shital Shah , Ning Shang , Hiteshi Sharma , Yelong Shen , Swadheen Shukla , Xia Song , Masahiro Tanaka , Andrea Tupini , Praneetha Vaddamanu , Chunyu Wang , Guanhua Wang , Lijuan Wang , Shuohang Wang , Xin Wang , Yu Wang , Rachel Ward , Wen Wen , Philipp Witte , Haiping Wu , Xiaoxia Wu , Michael Wyatt , Bin Xiao , Can Xu , Jiahang Xu , Weijian Xu , Jilong Xue , Sonali Yadav , Fan Yang , Jianwei Yang , Yifan Yang , Ziyi Yang , Donghan Yu , Lu Yuan , Chenruidong Zhang , Cyril Zhang , Jianwen Zhang , Li Lyna Zhang , Yi Zhang , Yue Zhang , Yunan Zhang , Xiren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords small language modelson-device inferencefiltered web datasynthetic datamixture of expertsmultimodal reasoningMMLU benchmarkparameter scaling

0 comments

The pith

A 3.8 billion parameter model matches the performance of much larger models like Mixtral 8x7B while running on a phone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents phi-3-mini, a 3.8B parameter language model trained on 3.3 trillion tokens of filtered web and synthetic data that reaches 69 percent on MMLU and 8.38 on MT-bench. These scores place it on par with Mixtral 8x7B and GPT-3.5 even though the model is compact enough for direct phone deployment. Scaling to 7B and 14B versions yields further gains, while the phi-3.5 series adds a mixture-of-experts model and a vision variant that handle reasoning, math, code, and image prompts at competitive levels. The work shows that careful data selection can produce strong capability in models small enough to run without cloud support.

Core claim

Phi-3-mini is a 3.8 billion parameter language model trained on 3.3 trillion tokens whose performance rivals Mixtral 8x7B and GPT-3.5, achieving 69 percent on MMLU and 8.38 on MT-bench while fitting on a phone. Its training set is a scaled-up version of the phi-2 dataset built from heavily filtered public web data and synthetic data, followed by alignment for safety and chat use. The 7B and 14B phi-3 variants reach 75 percent and 78 percent on MMLU respectively, and the phi-3.5-MoE model with 6.6 billion active parameters exceeds similar-scale open models on language, math, and code tasks.

What carries the argument

The phi-3-mini model and its training dataset of heavily filtered web data plus synthetic data, which together enable high benchmark scores at small parameter count.

If this is right

The 7B and 14B phi-3 models deliver higher scores than phi-3-mini on the same benchmarks, reaching 75 percent and 78 percent on MMLU.
Phi-3.5-MoE with 6.6 billion active parameters outperforms other open models of similar scale on reasoning, math, and code.
Phi-3.5-Vision handles both single-image and multi-image prompts in reasoning tasks at competitive levels.
All models support local deployment, removing the need for constant cloud connectivity for inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Emphasis on data quality over raw scale may shift research priorities toward curation techniques that work across different model sizes.
On-device models of this class could support privacy-sensitive applications where user data never leaves the phone.
The same data-filtering recipe might be tested on other base architectures to check whether the performance-to-size gains transfer.

Load-bearing premise

The filtered web and synthetic data produce real capability gains instead of benchmark-specific optimization or undetected contamination.

What would settle it

Performance of phi-3-mini falling well below the reported levels on a new set of benchmarks created after the training data cutoff and free of any possible overlap with the training corpus.

read the original abstract

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phi-3 shows that filtered web-plus-synthetic data can push a 3.8B model to near Mixtral/GPT-3.5 levels on MMLU and MT-bench, but the paper leaves the decontamination steps underspecified.

read the letter

The main point is that Microsoft has trained a 3.8B model on 3.3T tokens of filtered web and synthetic data that reaches 69% MMLU and 8.38 MT-bench, putting it roughly on par with Mixtral 8x7B and GPT-3.5 while fitting on a phone. They also release 7B and 14B versions plus phi-3.5 variants that add MoE, vision, and multilingual capabilities, with the MoE version claiming to match or beat similar-scale open models on reasoning and code tasks. The numbers are presented clearly and the scaling trend from mini to medium is straightforward to read off the tables. That part is useful if you want concrete evidence that data quality can substitute for raw parameter count in the current regime. The work extends their own phi-2 line without claiming a new training framework, which keeps expectations realistic. The soft spot is the data pipeline. The abstract says the corpus is heavily filtered public web plus synthetic data, yet it gives no n-gram overlap numbers, embedding decontamination procedure, or per-benchmark contamination rates. If even modest test-set leakage exists, the headline comparisons lose force. The paper also omits error bars and full evaluation protocols in the summary sections. Those gaps make the central performance claims harder to take at face value without independent checks. There are no new derivations or self-referential math, just standard scaling and benchmark reporting. Citations follow the usual pattern for model releases. This paper is aimed at people who care about efficient inference and local deployment rather than novel theory. A reader who needs up-to-date small-model numbers will get value from the reported scores and the parameter-scaling curves. It is worth sending to peer review because the empirical claims are specific enough to be tested and could affect practical choices about model size versus data effort, even if the data-cleaning details need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Phi-3 family of language models. Phi-3-mini (3.8B parameters, trained on 3.3T tokens of filtered web + synthetic data) is reported to reach 69% MMLU and 8.38 MT-bench, rivaling Mixtral 8x7B and GPT-3.5 while fitting on a phone. Larger Phi-3-small/medium variants (7B/14B) and the Phi-3.5 series (mini, 16x3.8B MoE with 6.6B active params, and Vision) are also presented, with claims of superior or competitive performance on reasoning, math, code, multilingual, and multimodal tasks relative to open models of similar scale.

Significance. If the headline performance numbers reflect genuine generalization rather than contamination or benchmark-specific optimization, the work provides concrete evidence that carefully filtered and synthetic data can close much of the capability gap between small and large models, with direct implications for on-device deployment. The inclusion of parameter-scaling curves (mini to medium) and the MoE/vision extensions supplies useful empirical data points on efficiency trade-offs.

major comments (2)

[Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.
[§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.

minor comments (2)

[Figure 1] Figure 1 and scaling plots: axis labels and legend entries for the 3.8B/7B/14B curves could be made more explicit to avoid ambiguity when comparing token counts and model sizes.
[§3.2] §3.2 (Phi-3.5-MoE): the active-parameter count (6.6B) is stated but the routing details and expert utilization statistics during inference are not provided, which would help readers interpret the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on data transparency and evaluation robustness. We address each major comment below and will revise the manuscript accordingly to improve clarity without altering the core technical claims.

read point-by-point responses

Referee: [Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.

Authors: We agree that greater transparency on potential contamination would strengthen the presentation. In the revised version we will expand §2 with additional description of the filtering pipeline, including the specific heuristics and quality classifiers applied to the web data and the generation process for the synthetic portion. However, exhaustive n-gram overlap statistics or per-benchmark contamination rates were not computed at the full 3.3 T token scale, as doing so is computationally prohibitive; we therefore cannot supply those exact figures. The synthetic data is produced by teacher models whose training cutoffs predate the relevant benchmarks, which substantially reduces the risk of direct leakage. We view this as a partial but honest response to the concern. revision: partial
Referee: [§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.

Authors: We accept that the current presentation lacks sufficient detail on evaluation methodology. In the revision we will add an explicit subsection in §4 describing the evaluation protocol (standard benchmark implementations, prompt formats, and any data-exclusion rules applied). Because each model variant was trained only once, standard deviations across independent training runs are unavailable; this is standard practice for models of this scale given the prohibitive compute cost. Where internal multi-prompt or few-shot variance estimates exist, we will report them as approximate indicators of stability. These additions should allow readers to assess the robustness of the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential fits

full rationale

The paper is a technical report on training and evaluating Phi-3 language models. It reports empirical performance numbers (e.g., 69% MMLU, 8.38 MT-bench for phi-3-mini) obtained from external benchmarks after training on filtered web + synthetic data. No mathematical derivations, equations, predictions from fitted parameters, or first-principles results are present. There are no self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via citation. All central claims rest on independent external evaluations rather than quantities defined internally by the paper itself. The data composition is described at a high level without any internal fitting loops that could create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on model training and evaluation. No mathematical derivations, fitted constants in equations, or postulated entities appear.

pith-pipeline@v0.9.0 · 6234 in / 1168 out tokens · 55992 ms · 2026-05-10T20:15:08.916274+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.
Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, rivaling Mixtral 8x7B and GPT-3.5

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
cs.CR 2026-05 unverdicted novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
cs.LG 2026-05 unverdicted novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
cs.AI 2026-05 unverdicted novelty 7.0

DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-depend...
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
cs.AI 2026-05 conditional novelty 7.0

FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
cs.CR 2026-04 unverdicted novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation
cs.CL 2026-04 conditional novelty 7.0

Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion
cs.IR 2026-04 conditional novelty 7.0

Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.
Towards Unconstrained Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
cs.CL 2024-10 unverdicted novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Language-Conditioned Visual Grounding with CLIP Multilingual
cs.CL 2026-05 unverdicted novelty 6.0

Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models
cs.CV 2026-05 unverdicted novelty 6.0

Hybrid transformer-SSM networks found by multi-objective search run 1.17x to 3.4x faster on edge CPUs for image restoration tasks with competitive quality.
DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion
cs.SE 2026-05 unverdicted novelty 6.0

DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.
Test-Time Safety Alignment
cs.CL 2026-04 unverdicted novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 6.0

Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
cs.CL 2026-04 unverdicted novelty 6.0

SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
cs.DC 2026-04 unverdicted novelty 6.0

In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
cs.CV 2026-04 unverdicted novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
cs.CL 2026-04 unverdicted novelty 6.0

MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
cs.DC 2026-04 unverdicted novelty 6.0

Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
MLLM-as-a-Judge Exhibits Model Preference Bias
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
cs.AR 2026-04 unverdicted novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
cs.CV 2026-04 unverdicted novelty 6.0

Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
cs.LG 2026-04 unverdicted novelty 6.0

A feedforward graph of heterogeneous frozen LLMs linked by linear projections in a shared latent space outperforms single models on ARC-Challenge, OpenBookQA, and MMLU using just 17.6M trainable parameters.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
cs.CL 2026-03 unverdicted novelty 6.0

Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
SAM 3D: 3Dfy Anything in Images
cs.CV 2025-11 unverdicted novelty 6.0

SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
cs.CL 2024-06 unverdicted novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
cs.CV 2026-05 conditional novelty 5.0

Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction
cs.CL 2026-05 unverdicted novelty 5.0

Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
cs.AI 2026-05 unverdicted novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI
cs.CL 2026-05 unverdicted novelty 5.0

LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
cs.LG 2026-04 unverdicted novelty 5.0

Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 73 Pith papers · 15 internal anchors

[1]

Program Synthesis with Large Language Models

[AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Piqa: Reasoning about physical commonsense in natural language

[BZGC19] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641 ,

work page arXiv 1911
[3]

Training Verifiers to Solve Math Word Problems

[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Boolq: Exploring the surprising difficulty of natural yes/no questions

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

work page 2019
[5]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

[CWT+24] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 ,

work page internal anchor Pith review arXiv
[6]

InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

[DZZ+24b] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512,

work page arXiv
[7]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

[FDL+24] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

work page internal anchor Pith review arXiv
[8]

Blink: Multimodal large language models can see but not perceive

[FHL+24] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390 ,

work page arXiv
[9]

Textbooks Are All You Need

[GZA+23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´ esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivia, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S´ ebastien Bubeck, Ronen El- dan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv pre...

work page internal anchor Pith review arXiv
[10]

Training Compute-Optimal Large Language Models

17 [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Eliza Ruther- ford Trevor Cai, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and ...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Scaling Laws for Neural Language Models

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Textbooks Are All You Need II: phi-1.5 technical report

[LBE+23] Yuanzhi Li, S´ ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463,

work page internal anchor Pith review arXiv
[13]

Bridging discrete and backpropagation: Straight-through and beyond

[LDL+23] Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, and Jianfeng Gao. Bridging discrete and backpropagation: Straight-through and beyond. arXiv:2304.08612,

work page arXiv
[14]

Sparse backpropagation for moe training

[LGC23] Liyuan Liu, Jianfeng Gao, and Weizhu Chen. Sparse backpropagation for moe training. arXiv:2310.00811,

work page arXiv
[15]

Improved Baselines with Visual Instruction Tuning

[LLLL23] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 ,

work page internal anchor Pith review arXiv
[16]

Red teaming visual language models

[LLY+24] Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. arXiv preprint arXiv:2401.12915 ,

work page arXiv
[17]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

19 [LZZ+24] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chun- yuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 ,

work page internal anchor Pith review arXiv
[18]

Mm1: Methods, analysis & insights from multimodal llm pre-training

[MGF+24] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu H` e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark...

work page arXiv
[19]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

[MLT+22] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Find- ings of the Association for Computational Linguistics: ACL 2022 , pages 2263–2279, Dublin, Ireland, May

work page 2022
[20]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

Association for Computational Linguistics. [MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 ,

work page arXiv
[21]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

[SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641 ,

work page internal anchor Pith review arXiv 1907
[22]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

20 [SRR+22] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 ,

work page internal anchor Pith review arXiv
[23]

Gemini: A Family of Highly Capable Multimodal Models

[TAB+23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 ,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

LLaMA: Open and Efficient Foundation Language Models

[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2402.02207 , year=

[ZBY+24] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207,

work page arXiv
[26]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

21 [ZCS+23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,

work page internal anchor Pith review Pith/arXiv arXiv