arxiv: 2605.04084 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI· cs.AR

Recognition: unknown

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Ye Qiao , Yian Wang , Zhiheng Chen , Hyoukjun Kwon , Sitao Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR

keywords LLM compressionproduct quantizationcalibration-freeinference accelerationCUDA kernelssubspace quantizationmodel size reductiondecode throughput

0 comments

The pith

Product quantization on LLM weights with two tunable parameters delivers calibration-free compression that exceeds 4-bit GPTQ and AWQ accuracy at 37-42% model size while accelerating decode past FP16 speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASQ as a framework that applies product quantization directly to the weight matrices of large language models. By adjusting only sub-vector size and codebook cardinality, it creates a continuous range of model sizes from 27 to 49 percent of the original FP16 footprint without requiring any calibration data. This approach claims to match or surpass the accuracy of established calibrated methods such as GPTQ and AWQ on models like Meta-Llama-3-8B, Qwen3-8B, and Qwen3.5-9B-Base. Custom CUDA kernels enable the practical speedups, including a direct-compute GEMV for decoding that pushes throughput beyond FP16 tensor-core performance on a single RTX 3090 GPU.

Core claim

FASQ demonstrates that product quantization applied to LLM weight matrices, controlled solely by sub-vector size and codebook cardinality, produces compressed models with 27-49% of original FP16 size that achieve higher average accuracy than 4-bit GPTQ and AWQ at 37-42% size on Meta-Llama-3-8B while delivering 45.2 tokens per second decode at effective 4-bit compression, surpassing the 43.9 tokens per second of FP16 tensor cores through LUT-free direct-compute GEMV and output-stationary double-buffered LUT GEMM kernels with split-K parallelism.

What carries the argument

Product quantization applied directly to LLM weight matrices, tuned by sub-vector size and codebook cardinality, executed via custom CUDA kernels consisting of a LUT-free direct-compute GEMV for decode and an output-stationary double-buffered LUT GEMM for prefill with split-K parallelism.

If this is right

Exposes continuous compression ratios between fixed bit-width points that scalar quantization cannot reach.
Achieves higher inference throughput than FP16, GPTQ, AWQ, and RTN on consumer GPUs for both prefill and decode phases.
Delivers consistent accuracy and speed results across multiple 8B-scale models without model-specific calibration.
Enables single-GPU real-time inference of compressed LLMs at effective 3-bit and 4-bit densities with 2.5-5x throughput gains over prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Calibration data may be unnecessary for high-accuracy quantization when subspace methods replace scalar ones.
The two-parameter control could allow automated selection of compression points based on target hardware memory limits.
Kernel-level optimizations shown here might transfer to other matrix operations in transformer inference pipelines.

Load-bearing premise

Product quantization applied directly to LLM weight matrices using only sub-vector size and codebook cardinality can maintain or exceed the accuracy of calibrated methods without any calibration data or post-training adjustments.

What would settle it

Measuring whether FASQ accuracy falls below 4-bit GPTQ or AWQ on Meta-Llama-3-8B at 37-42% size, or whether decode throughput drops below 43.9 tokens per second on the RTX 3090, when using the same two-parameter settings.

Figures

Figures reproduced from arXiv: 2605.04084 by Hyoukjun Kwon, Sitao Huang, Ye Qiao, Yian Wang, Zhiheng Chen.

**Figure 1.** Figure 1: Example FASQ quantization process with 𝑁𝑠𝑠 = 4. ①: The weight matrix is split into 𝑁𝑠𝑠 subspaces along the chosen dimension; ②: K-means clustering is applied independently within each subspace; ③: Centroids and indices are stored as the compressed representation. 𝑑𝑖𝑚𝑑𝑝 . Each subspace slice has shape: 𝑊𝑠𝑠 ∈    R 𝑑 𝑁𝑠𝑠 ×𝑑 ′ , if 𝑑𝑖𝑚 = 0 R 𝑑× 𝑑 ′ 𝑁𝑠𝑠 , if 𝑑𝑖𝑚 = 1 (2) with subspace size 𝑆𝑍𝑠𝑠 = 𝑑𝑖𝑚𝑠𝑠/𝑁… view at source ↗

**Figure 2.** Figure 2: FASQ GEMV kernel design for autoregressive decode (𝐵=1). Left: the 3D grid layout, where grid-𝑦 = ⌈𝐹out/128⌉ partitions outputs into 128-thread blocks and grid-𝑧 implements Split-K by dividing 𝑁𝑠𝑠 subspaces across 𝐾split blocks (auto-tuned for ∼8 blocks/SM). Right: per-thread computation, each thread reads a uint8 index from 𝑇index, performs a direct dot product with the corresponding centroid from 𝑇cluste… view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of FASQ kernels on a single 4096×4096 layer (RTX 3090). 5.1 Experimental Settings 5.1.1 Baselines. We compare against five post-training quantization methods: GPTQ [1] (Hessian-guided INT4/INT3), AWQ [2] (activation-aware weight quantization), SmoothQuant [3] (joint weight-activation quantization at W8A8, W6A6, and W4A4), QuIP [23] (incoherence processing before quantization), and RTN… view at source ↗

**Figure 4.** Figure 4: plots each configuration by model size (% of FP16) versus average zero-shot accuracy, with scalar quantization baselines overlaid for reference. Configurations with 𝑆𝑍𝑠𝑠≤2 form a smooth 1https://github.com/ist-daslab/gptq 2https://github.com/mit-han-lab/llm-awq 3https://github.com/mit-han-lab/smoothquant 26 30 34 38 42 46 50 54 58 Model Size (% of FP16) 55.0 57.5 60.0 62.5 65.0 67.5 70.0 Avg. Zero-shot Ac… view at source ↗

read the original abstract

Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typically requires calibration data. We present FASQ (Flexible Accelerated Subspace Quantization), a calibration-free framework that applies product quantization to LLM weight matrices. By tuning two parameters, sub-vector size and codebook cardinality, FASQ exposes a continuous design space spanning 27-49% of the original FP16 model size, filling compression gaps that fixed-bit schemes cannot reach. On Meta-Llama-3-8B, FASQ surpasses 4-bit GPTQ and AWQ in accuracy (67.1-67.7 avg.) at 37-42% model size, with consistent results on Qwen3-8B and Qwen3.5-9B-Base. To make product quantization practical at inference time, we design custom CUDA kernels: a LUT-free direct-compute GEMV for decode and an output-stationary double-buffered LUT GEMM for prefill, both with split-K parallelism. On an RTX~3090, FASQ achieves 45.2 tok/s decode at effective 4-bit (2.56x memory reduction) and 51.8 tok/s at effective 3-bit (2.80x), both surpassing FP16 tensor-core performance (43.9 tok/s) and delivering 1.6 to 1.8x the throughput of AWQ, 2.5 to 2.5x of GPTQ, and 4.3 to 5x of RTN. FASQ is the only compressed method that accelerates decode beyond FP16, offering calibration-free compression, continuous size-quality trade-offs, and real-time inference on a single consumer GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FASQ shows product quantization on weights alone with two parameters can beat calibrated 4-bit methods in accuracy while its kernels push decode past FP16 on a 3090, but the no-calibration claim needs direct checks on tuning and evaluation details.

read the letter

The main takeaway is that FASQ applies product quantization directly to LLM weights using only two tunable parameters for a continuous compression range, skips calibration, and still reports higher accuracy than 4-bit GPTQ and AWQ on Llama-3-8B at 37-42% size, while its custom CUDA kernels deliver decode speeds above FP16 on consumer GPUs. The paper does a good job on the practical side. The LUT-free GEMV kernel for decode and the output-stationary double-buffered GEMM for prefill with split-K parallelism are designed to minimize overheads that usually make product quantization slow at inference. The throughput numbers, like 45.2 tokens per second at effective 4-bit on an RTX 3090, exceed the FP16 tensor core baseline and the other quantized baselines. Showing consistent behavior on Qwen models adds some breadth. The accuracy superiority without calibration is the part that stands out as needing more evidence. Product quantization with k-means on sub-vectors has been around, but outperforming methods that use activation statistics or Hessians is not the usual outcome. The two parameters allow flexibility, but the paper must show that selecting them did not involve any hidden use of data that functions as calibration. The average scores of 67.1-67.7 should be compared task-by-task with identical settings to the baselines to rule out differences in evaluation. Overall the work is solid on the kernel implementation and the idea of continuous trade-offs, but the central no-calibration claim is the one that could shift if the details do not check out. This is for practitioners deploying LLMs on single GPUs who want to avoid calibration datasets and get both size reduction and speed. Readers working on quantization or inference optimization will find the kernel descriptions and results useful. It deserves peer review because the hardware results are specific and the flexible approach addresses a real deployment need. I would send it to referees.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FASQ, a calibration-free product quantization framework for LLM weight matrices that exposes a continuous compression space (27-49% of FP16 size) by tuning only sub-vector size and codebook cardinality. On Meta-Llama-3-8B it reports higher average accuracy (67.1-67.7) than 4-bit GPTQ/AWQ at 37-42% model size, with analogous results on Qwen models; custom CUDA kernels (LUT-free direct-compute GEMV for decode, output-stationary double-buffered LUT GEMM for prefill) are claimed to deliver 45.2 tok/s decode at effective 4-bit and 51.8 tok/s at effective 3-bit on an RTX 3090, exceeding FP16 tensor-core throughput.

Significance. If the central accuracy and speedup claims are reproducible, the work would be significant for demonstrating that a simple two-parameter subspace quantization approach can match or exceed calibrated methods while providing flexible ratios and inference acceleration beyond FP16 on consumer GPUs, reducing dependence on calibration data and fixed-bit schemes.

major comments (2)

[Abstract / Experimental section] Abstract: the central claim that product quantization with only two global parameters (sub-vector size, codebook cardinality) yields higher task accuracy than activation-aware 4-bit GPTQ and AWQ at comparable effective bit-width is load-bearing for the calibration-free advantage. The manuscript must explicitly document in the experimental section (including any hyper-parameter search procedure) that parameter selection used no validation or test data that could function as implicit calibration, and must report error bars or multiple runs on the identical zero-shot/few-shot suites used for the baselines.
[Abstract / Kernel implementation section] Abstract: the reported decode throughput of 45.2 tok/s (effective 4-bit) and 51.8 tok/s (effective 3-bit) surpassing FP16 tensor-core performance (43.9 tok/s) rests on the custom kernels; the paper should provide pseudocode or kernel launch parameters and confirm that the comparison uses identical batch size, sequence length, and hardware configuration for all methods.

minor comments (2)

[Abstract] Clarify the precise definition of 'effective 4-bit' and 'effective 3-bit' (including any overhead from codebooks or indices) when stating model-size percentages and throughput numbers.
[Abstract] The abstract states 'consistent results on Qwen3-8B and Qwen3.5-9B-Base' but does not list the numerical averages; adding a compact table or explicit numbers would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our calibration-free claims and kernel details.

read point-by-point responses

Referee: [Abstract / Experimental section] Abstract: the central claim that product quantization with only two global parameters (sub-vector size, codebook cardinality) yields higher task accuracy than activation-aware 4-bit GPTQ and AWQ at comparable effective bit-width is load-bearing for the calibration-free advantage. The manuscript must explicitly document in the experimental section (including any hyper-parameter search procedure) that parameter selection used no validation or test data that could function as implicit calibration, and must report error bars or multiple runs on the identical zero-shot/few-shot suites used for the baselines.

Authors: We agree that explicit documentation is essential to substantiate the calibration-free nature of FASQ. In the revised experimental section we will add a dedicated paragraph describing the hyper-parameter selection procedure: sub-vector size and codebook cardinality were selected solely from target compression ratios and matrix dimensions, with no access to any validation or test data at any stage. No implicit calibration was performed. We will also re-run the zero-shot and few-shot evaluations on the same benchmarks used for GPTQ and AWQ baselines across three independent random seeds for codebook initialization and report mean accuracy with standard deviation error bars. revision: yes
Referee: [Abstract / Kernel implementation section] Abstract: the reported decode throughput of 45.2 tok/s (effective 4-bit) and 51.8 tok/s (effective 3-bit) surpassing FP16 tensor-core performance (43.9 tok/s) rests on the custom kernels; the paper should provide pseudocode or kernel launch parameters and confirm that the comparison uses identical batch size, sequence length, and hardware configuration for all methods.

Authors: We will expand the kernel implementation section with pseudocode for both the LUT-free direct-compute GEMV decode kernel and the output-stationary double-buffered LUT GEMM prefill kernel, including the exact CUDA launch parameters (block sizes, grid dimensions, and split-K factors). We confirm and will explicitly state that all throughput numbers—including the FP16 tensor-core baseline—were measured under identical conditions: batch size 1, prefill sequence length 2048, generation of 128 tokens, on the same RTX 3090 with identical CUDA 12.1 and driver settings. These details will be added to both the abstract-adjacent experimental description and the kernel section. revision: yes

Circularity Check

0 steps flagged

No circularity: FASQ presents an empirical method with independent parameters and kernels.

full rationale

The paper introduces FASQ as a calibration-free product quantization approach on LLM weights, controlled by two explicit tunable parameters (sub-vector size and codebook cardinality) plus custom CUDA kernels for inference. No equations, derivations, or self-citations in the provided text reduce the accuracy or throughput claims to fitted inputs by construction, self-definitional loops, or load-bearing prior work by the same authors. The reported results on Llama-3-8B, Qwen models, and comparisons to GPTQ/AWQ are framed as direct empirical measurements from applying the method, not as predictions forced by the inputs themselves. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that product quantization preserves sufficient accuracy for LLMs when tuned only by sub-vector size and codebook cardinality, plus the engineering claim that the described CUDA kernels deliver the stated speedups; no new physical entities or first-principles derivations are introduced.

free parameters (2)

sub-vector size
Tunable integer parameter that sets the dimensionality of each quantized sub-vector and directly controls the compression ratio.
codebook cardinality
Tunable integer parameter that sets the number of codewords per codebook and directly controls the compression ratio.

axioms (1)

domain assumption Product quantization applied to LLM weight matrices can achieve competitive accuracy without calibration data when sub-vector size and codebook cardinality are chosen appropriately.
The framework treats this as given for the reported accuracy results on Llama-3 and Qwen models.

pith-pipeline@v0.9.0 · 5660 in / 1359 out tokens · 42930 ms · 2026-05-10T00:31:11.853269+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · 6 internal anchors

[1]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review arXiv 2022
[2]

Awq: Activation- aware weight quantization for llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. InMLSys, 2024

2024
[3]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pages 38087– 38099. PMLR, 2023

2023
[4]

APEX-Q: Arbitrary- dimension product-EXtension quantization for accelerated LLM deployment

Yuxuan Wang, Ye Qiao, Sheldon Huang, and Hyoukjun Kwon. APEX-Q: Arbitrary- dimension product-EXtension quantization for accelerated LLM deployment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, page 41424, 2026

2026
[5]

Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

1984
[6]

Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

2010
[7]

Multiplying matrices without multiplying

Davis Blalock and John Guttag. Multiplying matrices without multiplying. In International Conference on Machine Learning, pages 992–1004. PMLR, 2021

2021
[8]

Pecan: A product-quantized content addressable memory network

Jie Ran, Rui Lin, Jason Chun Lok Li, Jiajun Zhou, and Ngai Wong. Pecan: A product-quantized content addressable memory network. In2023 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE), pages 1–6. IEEE, 2023

2023
[9]

Differentiable product quantization for end-to-end embedding compression

Ting Chen, Lala Li, and Yizhou Sun. Differentiable product quantization for end-to-end embedding compression. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1617–1626. PMLR, 13–18 Jul 2020

2020
[10]

Lut-nn: Empower efficient neural network inference with centroid learning and table lookup

Xiaohu Tang, Yang Wang, Ting Cao, Li Lyna Zhang, Qi Chen, Deng Cai, Yunxin Liu, and Mao Yang. Lut-nn: Empower efficient neural network inference with centroid learning and table lookup. InProceedings of the 29th Annual International Conference on Mobile Computing and Networking, pages 1–15, 2023

2023
[11]

Pqa: Exploring the potential of product quantization in dnn hardware acceleration.ACM Transactions on Reconfigurable Technology and Systems, 18(1):1–29, 2024

Ahmed Abouelhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas Lane, and Mohamed Abdelfattah. Pqa: Exploring the potential of product quantization in dnn hardware acceleration.ACM Transactions on Reconfigurable Technology and Systems, 18(1):1–29, 2024

2024
[12]

Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator

Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M Sabry Aly, and Mao Yang. Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 671–684. IEEE, 2025

2025
[13]

Tellme: An efficient end-to-end ternary llm prefill and decode accelerator with table-lookup matmul on edge fpgas

Ye Qiao, Zhiheng Chen, Yifan Zhang, Yian Wang, and Sitao Huang. Tellme: An efficient end-to-end ternary llm prefill and decode accelerator with table-lookup matmul on edge fpgas. InProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 247–257, 2026

2026
[14]

PD-Swap: Prefill-decode logic swapping for end-to-end LLM inference on edge FPGAs via dynamic partial reconfiguration.arXiv preprint arXiv:2512.11550, 2025

Yufei Zhang, Zheyu Chen, Ye Qiao, and Sheldon Huang. PD-Swap: Prefill-decode logic swapping for end-to-end LLM inference on edge FPGAs via dynamic partial reconfiguration.arXiv preprint arXiv:2512.11550, 2025

work page arXiv 2025
[15]

Cobra: Algorithm-architecture co-optimized binary transformer accelerator for edge inference

Ye Qiao, Zhiheng Chen, Yian Wang, Yifan Zhang, Yunzhe Deng, and Sitao Huang. Cobra: Algorithm-architecture co-optimized binary transformer accelerator for edge inference. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–8. IEEE, 2025

2025
[16]

Micronas: Zero-shot neural architecture search for mcus

Ye Qiao, Haocheng Xu, Yifan Zhang, and Sitao Huang. Micronas: Zero-shot neural architecture search for mcus. In2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–2. IEEE, 2024

2024
[17]

MONAS: Efficient zero-shot neural architecture search for MCUs

Ye Qiao, Hongwei Xu, Yufei Zhang, and Sheldon Huang. MONAS: Efficient zero-shot neural architecture search for MCUs. InInternational Joint Conference on Neural Networks (IJCNN), pages 1–8, 2025

2025
[18]

TG-NAS: Generalizable zero-cost proxies with operator description embedding and graph learning for efficient neural architecture search.arXiv preprint arXiv:2404.00271, 2024

Ye Qiao, Jingyi Li, Hongwei Xu, and Sheldon Huang. TG-NAS: Generalizable zero-cost proxies with operator description embedding and graph learning for efficient neural architecture search.arXiv preprint arXiv:2404.00271, 2024

work page arXiv 2024
[19]

BNN an ideal architecture for accel- eration with resistive in memory computation.IEEE Transactions on Emerging Topics in Computing, 11(2):281–291, 2023

Ye Qiao, Ao Ding, and Nader Bagherzadeh. BNN an ideal architecture for accel- eration with resistive in memory computation.IEEE Transactions on Emerging Topics in Computing, 11(2):281–291, 2023

2023
[20]

Optimized spatial architecture mapping flow for transformer accelerators.arXiv preprint arXiv:2410.07407, 2024

Hongwei Xu, Fatemeh Tahmasebi, Ye Qiao, Haoyu Tian, Hyoukjun Kwon, and Sheldon Huang. Optimized spatial architecture mapping flow for transformer accelerators.arXiv preprint arXiv:2410.07407, 2024

work page arXiv 2024
[21]

Q-ROAR: Outlier-aware rescaling for RoPE posi- tion interpolation in quantized long-context LLMs

Ye Qiao and Sheldon Huang. Q-ROAR: Outlier-aware rescaling for RoPE posi- tion interpolation in quantized long-context LLMs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, page 41359, 2026

2026
[22]

Rethinking RoPE scaling in quantized LLM: Theory, outlier, and channel-band analysis with weight rescaling.arXiv preprint arXiv:2510.00028, 2025

Ye Qiao, Hongwei Xu, Xing Zhang, and Sheldon Huang. Rethinking RoPE scaling in quantized LLM: Theory, outlier, and channel-band analysis with weight rescaling.arXiv preprint arXiv:2510.00028, 2025

work page arXiv 2025
[23]

Quip: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[24]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Qwen3, April 2025

Qwen Team. Qwen3, April 2025

2025
[26]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review arXiv 2016
[28]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review arXiv 1905
[30]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[31]

Wino- grande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Wino- grande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[32]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024