pith. machine review for the scientific record. sign in

arxiv: 2404.14219 · v4 · submitted 2024-04-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin , Jyoti Aneja , Hany Awadalla , Ahmed Awadallah , Ammar Ahmad Awan , Nguyen Bach , Amit Bahree , Arash Bakhtiari , Jianmin Bao , Harkirat Behl , Alon Benhaim , Misha Bilenko , Johan Bjorck , S\'ebastien Bubeck , Martin Cai , Qin Cai , Vishrav Chaudhary , Dong Chen , Dongdong Chen , Weizhu Chen , Yen-Chun Chen , Yi-Ling Chen , Hao Cheng , Parul Chopra , Xiyang Dai , Matthew Dixon , Ronen Eldan , Victor Fragoso , Jianfeng Gao , Mei Gao , Min Gao , Amit Garg , Allie Del Giorno , Abhishek Goswami , Suriya Gunasekar , Emman Haider , Junheng Hao , Russell J. Hewett , Wenxiang Hu , Jamie Huynh , Dan Iter , Sam Ade Jacobs , Mojan Javaheripi , Xin Jin , Nikos Karampatziakis , Piero Kauffmann , Mahoud Khademi , Dongwoo Kim , Young Jin Kim , Lev Kurilenko , James R. Lee , Yin Tat Lee , Yuanzhi Li , Yunsheng Li , Chen Liang , Lars Liden , Xihui Lin , Zeqi Lin , Ce Liu , Liyuan Liu , Mengchen Liu , Weishung Liu , XiaoDong Liu , Chong Luo , Piyush Madan , Ali Mahmoudzadeh , David Majercak , Matt Mazzola , Caio C\'esar Teodoro Mendes , Arindam Mitra , Hardik Modi , Anh Nguyen , Brandon Norick , Barun Patra , Daniel Perez-Becker , Thomas Portet , Reid Pryzant , Heyang Qin , Marko Radmilac , Liliang Ren , Gustavo de Rosa , Corby Rosset , Sambudha Roy , Olatunji Ruwase , Olli Saarikivi , Amin Saied , Adil Salim , Michael Santacroce , Shital Shah , Ning Shang , Hiteshi Sharma , Yelong Shen , Swadheen Shukla , Xia Song , Masahiro Tanaka , Andrea Tupini , Praneetha Vaddamanu , Chunyu Wang , Guanhua Wang , Lijuan Wang , Shuohang Wang , Xin Wang , Yu Wang , Rachel Ward , Wen Wen , Philipp Witte , Haiping Wu , Xiaoxia Wu , Michael Wyatt , Bin Xiao , Can Xu , Jiahang Xu , Weijian Xu , Jilong Xue , Sonali Yadav , Fan Yang , Jianwei Yang , Yifan Yang , Ziyi Yang , Donghan Yu , Lu Yuan , Chenruidong Zhang , Cyril Zhang , Jianwen Zhang , Li Lyna Zhang , Yi Zhang , Yue Zhang , Yunan Zhang , Xiren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords small language modelson-device inferencefiltered web datasynthetic datamixture of expertsmultimodal reasoningMMLU benchmarkparameter scaling
0
0 comments X

The pith

A 3.8 billion parameter model matches the performance of much larger models like Mixtral 8x7B while running on a phone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents phi-3-mini, a 3.8B parameter language model trained on 3.3 trillion tokens of filtered web and synthetic data that reaches 69 percent on MMLU and 8.38 on MT-bench. These scores place it on par with Mixtral 8x7B and GPT-3.5 even though the model is compact enough for direct phone deployment. Scaling to 7B and 14B versions yields further gains, while the phi-3.5 series adds a mixture-of-experts model and a vision variant that handle reasoning, math, code, and image prompts at competitive levels. The work shows that careful data selection can produce strong capability in models small enough to run without cloud support.

Core claim

Phi-3-mini is a 3.8 billion parameter language model trained on 3.3 trillion tokens whose performance rivals Mixtral 8x7B and GPT-3.5, achieving 69 percent on MMLU and 8.38 on MT-bench while fitting on a phone. Its training set is a scaled-up version of the phi-2 dataset built from heavily filtered public web data and synthetic data, followed by alignment for safety and chat use. The 7B and 14B phi-3 variants reach 75 percent and 78 percent on MMLU respectively, and the phi-3.5-MoE model with 6.6 billion active parameters exceeds similar-scale open models on language, math, and code tasks.

What carries the argument

The phi-3-mini model and its training dataset of heavily filtered web data plus synthetic data, which together enable high benchmark scores at small parameter count.

If this is right

  • The 7B and 14B phi-3 models deliver higher scores than phi-3-mini on the same benchmarks, reaching 75 percent and 78 percent on MMLU.
  • Phi-3.5-MoE with 6.6 billion active parameters outperforms other open models of similar scale on reasoning, math, and code.
  • Phi-3.5-Vision handles both single-image and multi-image prompts in reasoning tasks at competitive levels.
  • All models support local deployment, removing the need for constant cloud connectivity for inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Emphasis on data quality over raw scale may shift research priorities toward curation techniques that work across different model sizes.
  • On-device models of this class could support privacy-sensitive applications where user data never leaves the phone.
  • The same data-filtering recipe might be tested on other base architectures to check whether the performance-to-size gains transfer.

Load-bearing premise

The filtered web and synthetic data produce real capability gains instead of benchmark-specific optimization or undetected contamination.

What would settle it

Performance of phi-3-mini falling well below the reported levels on a new set of benchmarks created after the training data cutoff and free of any possible overlap with the training corpus.

read the original abstract

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Phi-3 family of language models. Phi-3-mini (3.8B parameters, trained on 3.3T tokens of filtered web + synthetic data) is reported to reach 69% MMLU and 8.38 MT-bench, rivaling Mixtral 8x7B and GPT-3.5 while fitting on a phone. Larger Phi-3-small/medium variants (7B/14B) and the Phi-3.5 series (mini, 16x3.8B MoE with 6.6B active params, and Vision) are also presented, with claims of superior or competitive performance on reasoning, math, code, multilingual, and multimodal tasks relative to open models of similar scale.

Significance. If the headline performance numbers reflect genuine generalization rather than contamination or benchmark-specific optimization, the work provides concrete evidence that carefully filtered and synthetic data can close much of the capability gap between small and large models, with direct implications for on-device deployment. The inclusion of parameter-scaling curves (mini to medium) and the MoE/vision extensions supplies useful empirical data points on efficiency trade-offs.

major comments (2)
  1. [Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.
  2. [§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.
minor comments (2)
  1. [Figure 1] Figure 1 and scaling plots: axis labels and legend entries for the 3.8B/7B/14B curves could be made more explicit to avoid ambiguity when comparing token counts and model sizes.
  2. [§3.2] §3.2 (Phi-3.5-MoE): the active-parameter count (6.6B) is stated but the routing details and expert utilization statistics during inference are not provided, which would help readers interpret the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on data transparency and evaluation robustness. We address each major comment below and will revise the manuscript accordingly to improve clarity without altering the core technical claims.

read point-by-point responses
  1. Referee: [Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.

    Authors: We agree that greater transparency on potential contamination would strengthen the presentation. In the revised version we will expand §2 with additional description of the filtering pipeline, including the specific heuristics and quality classifiers applied to the web data and the generation process for the synthetic portion. However, exhaustive n-gram overlap statistics or per-benchmark contamination rates were not computed at the full 3.3 T token scale, as doing so is computationally prohibitive; we therefore cannot supply those exact figures. The synthetic data is produced by teacher models whose training cutoffs predate the relevant benchmarks, which substantially reduces the risk of direct leakage. We view this as a partial but honest response to the concern. revision: partial

  2. Referee: [§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.

    Authors: We accept that the current presentation lacks sufficient detail on evaluation methodology. In the revision we will add an explicit subsection in §4 describing the evaluation protocol (standard benchmark implementations, prompt formats, and any data-exclusion rules applied). Because each model variant was trained only once, standard deviations across independent training runs are unavailable; this is standard practice for models of this scale given the prohibitive compute cost. Where internal multi-prompt or few-shot variance estimates exist, we will report them as approximate indicators of stability. These additions should allow readers to assess the robustness of the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential fits

full rationale

The paper is a technical report on training and evaluating Phi-3 language models. It reports empirical performance numbers (e.g., 69% MMLU, 8.38 MT-bench for phi-3-mini) obtained from external benchmarks after training on filtered web + synthetic data. No mathematical derivations, equations, predictions from fitted parameters, or first-principles results are present. There are no self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via citation. All central claims rest on independent external evaluations rather than quantities defined internally by the paper itself. The data composition is described at a high level without any internal fitting loops that could create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on model training and evaluation. No mathematical derivations, fitted constants in equations, or postulated entities appear.

pith-pipeline@v0.9.0 · 6234 in / 1168 out tokens · 55992 ms · 2026-05-10T20:15:08.916274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.

  • Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, rivaling Mixtral 8x7B and GPT-3.5

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

    cs.CR 2026-05 unverdicted novelty 8.0

    Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...

  2. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  3. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  4. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  5. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  6. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  7. Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...

  8. DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-depend...

  9. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  10. FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

    cs.AI 2026-05 conditional novelty 7.0

    FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.

  11. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  12. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...

  13. MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

    cs.CR 2026-04 unverdicted novelty 7.0

    MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...

  14. Evaluating Temporal Consistency in Multi-Turn Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

  15. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

  16. Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

    cs.CL 2026-04 conditional novelty 7.0

    Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.

  17. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

    cs.CL 2026-04 unverdicted novelty 7.0

    Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

  18. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

    cs.IR 2026-04 conditional novelty 7.0

    Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.

  19. Towards Unconstrained Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

  20. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  21. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  22. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    cs.CL 2024-10 unverdicted novelty 7.0

    LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

  23. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    cs.LG 2024-10 accept novelty 7.0

    LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.

  24. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  25. Language-Conditioned Visual Grounding with CLIP Multilingual

    cs.CL 2026-05 unverdicted novelty 6.0

    Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.

  26. MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

    cs.AR 2026-05 unverdicted novelty 6.0

    MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

  27. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  28. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  29. Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Hybrid transformer-SSM networks found by multi-objective search run 1.17x to 3.4x faster on edge CPUs for image restoration tasks with competitive quality.

  30. DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion

    cs.SE 2026-05 unverdicted novelty 6.0

    DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.

  31. Test-Time Safety Alignment

    cs.CL 2026-04 unverdicted novelty 6.0

    Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

  32. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.

  33. Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

    cs.CL 2026-04 unverdicted novelty 6.0

    SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.

  34. Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  35. Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving

    cs.DC 2026-04 unverdicted novelty 6.0

    In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.

  36. SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

    cs.CV 2026-04 unverdicted novelty 6.0

    SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

  37. Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

    cs.LG 2026-04 unverdicted novelty 6.0

    DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

  38. From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.

  39. MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

    cs.CL 2026-04 unverdicted novelty 6.0

    MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.

  40. Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.

  41. MLLM-as-a-Judge Exhibits Model Preference Bias

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.

  42. EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

    cs.AR 2026-04 unverdicted novelty 6.0

    A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...

  43. Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...

  44. Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    A feedforward graph of heterogeneous frozen LLMs linked by linear projections in a shared latent space outperforms single models on ARC-Challenge, OpenBookQA, and MMLU using just 17.6M trainable parameters.

  45. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

    cs.IR 2026-04 unverdicted novelty 6.0

    ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

  46. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  47. Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.

  48. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  49. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  50. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  51. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    cs.CL 2024-06 unverdicted novelty 6.0

    FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

  52. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  53. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  54. A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

    cs.CV 2026-05 conditional novelty 5.0

    Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...

  55. Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

    cs.SE 2026-05 unverdicted novelty 5.0

    Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.

  56. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

  57. A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

    cs.CL 2026-05 unverdicted novelty 5.0

    Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.

  58. Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

    cs.AI 2026-05 unverdicted novelty 5.0

    Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...

  59. RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

    cs.CL 2026-05 unverdicted novelty 5.0

    LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

  60. Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

    cs.LG 2026-04 unverdicted novelty 5.0

    Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 73 Pith papers · 15 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    [AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,

  2. [2]

    Piqa: Reasoning about physical commonsense in natural language

    [BZGC19] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641 ,

  3. [3]

    Training Verifiers to Solve Math Word Problems

    [CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  4. [4]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    [CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  5. [5]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    [CWT+24] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 ,

  6. [6]

    InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    [DZZ+24b] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512,

  7. [7]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    [FDL+24] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

  8. [8]

    Blink: Multimodal large language models can see but not perceive

    [FHL+24] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390 ,

  9. [9]

    Textbooks Are All You Need

    [GZA+23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´ esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivia, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S´ ebastien Bubeck, Ronen El- dan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv pre...

  10. [10]

    Training Compute-Optimal Large Language Models

    17 [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Eliza Ruther- ford Trevor Cai, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and ...

  11. [11]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

  12. [12]

    Textbooks Are All You Need II: phi-1.5 technical report

    [LBE+23] Yuanzhi Li, S´ ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463,

  13. [13]

    Bridging discrete and backpropagation: Straight-through and beyond

    [LDL+23] Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, and Jianfeng Gao. Bridging discrete and backpropagation: Straight-through and beyond. arXiv:2304.08612,

  14. [14]

    Sparse backpropagation for moe training

    [LGC23] Liyuan Liu, Jianfeng Gao, and Weizhu Chen. Sparse backpropagation for moe training. arXiv:2310.00811,

  15. [15]

    Improved Baselines with Visual Instruction Tuning

    [LLLL23] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 ,

  16. [16]

    Red teaming visual language models

    [LLY+24] Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. arXiv preprint arXiv:2401.12915 ,

  17. [17]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    19 [LZZ+24] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chun- yuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 ,

  18. [18]

    Mm1: Methods, analysis & insights from multimodal llm pre-training

    [MGF+24] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu H` e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark...

  19. [19]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    [MLT+22] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Find- ings of the Association for Computational Linguistics: ACL 2022 , pages 2263–2279, Dublin, Ireland, May

  20. [20]

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

    Association for Computational Linguistics. [MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 ,

  21. [21]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    [SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641 ,

  22. [22]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    20 [SRR+22] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 ,

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    [TAB+23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 ,

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,

  25. [25]

    arXiv preprint arXiv:2402.02207 , year=

    [ZBY+24] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207,

  26. [26]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    21 [ZCS+23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,