super hub Mixed citations

Title resolution pending

Mistral 7B · 2023 · cs.CL · arXiv 2310.06825

Mixed citation behavior. Most common role is background (62%).

668 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 668 citing papers more from Mistral 7B arXiv PDF

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

abstract

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 58 method 15 baseline 10 other 6 dataset 2

citation-polarity summary

background 56 use method 15 baseline 10 unclear 8 use dataset 2

claims ledger

abstract We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and auto

authors

author = Mistral 7B

co-cited works

representative citing papers

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0 · 2 refs

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

TW-LegalBench: Measuring Taiwanese Legal Understanding

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.

Entropy-Gated Latent Recursion

cs.LG · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

cs.CL · 2026-05-16 · accept · novelty 8.0

RTI-Bench is the first publicly released structured dataset of CIC administrative decisions with outcome labels, exemption citations, IRAC reasoning, and timelines, built from 1,218 corpus cases and 298 PDFs, achieving 95.3% label precision on manual review and 57.3% accuracy on a Mistral 7B zero-Sh

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

cs.LG · 2026-05-04 · conditional · novelty 8.0 · 2 refs

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

Backdoor Attacks on Decentralised Post-Training

cs.CR · 2026-03-31 · conditional · novelty 8.0

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.

CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs

cs.CR · 2025-11-27 · conditional · novelty 8.0

CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

cs.CL · 2025-07-28 · accept · novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

Information Dynamics of Language Communication

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

The paper defines STE and SPID, two information-theoretic measures of semantic flow and decomposition in language exchanges, and applies them to four dialogue datasets.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

cs.CL · 2026-06-28 · conditional · novelty 7.0

Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.

MultiHashFormer: Hash-based Generative Language Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.

Structure Before Collapse: Transient semantic geometry in next-token prediction

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

cs.LG · 2026-06-24 · unverdicted · novelty 7.0

HRM adapters via Hankel reduced-order modeling outperform LoRA on long-context tasks in Mistral-7B when used as SSM residual modules with FFT-based parallel scan.

The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

Zero-dimensional persistent homology on transformer layer hidden states yields three descriptors per layer whose concatenation improves ill-posedness classification and enables topology-conditioned activation steering across three LLMs.

citing papers explorer

Showing 50 of 58 citing papers after filters.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning cs.LG · 2026-05-04 · conditional · none · ref 4 · 2 links · internal anchor
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Backdoor Attacks on Decentralised Post-Training cs.CR · 2026-03-31 · conditional · none · ref 9 · internal anchor
An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.
CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs cs.CR · 2025-11-27 · conditional · none · ref 43 · internal anchor
CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? cs.CV · 2024-08-23 · conditional · none · ref 28 · internal anchor
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 27 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings cs.CL · 2026-06-28 · conditional · none · ref 9 · internal anchor
Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.
4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking cs.CV · 2026-06-21 · conditional · none · ref 19 · internal anchor
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention cs.LG · 2026-06-01 · conditional · none · ref 12 · internal anchor
Fixed block causal masks create reachability boundaries where representations depend only on block prefixes, formalized via dependency sets and phase-conditioned coverage functions, with a parameter-free boundary bridge repair.
Widening the Gap: Exploiting LLM Quantization via Outlier Injection cs.LG · 2026-05-14 · conditional · none · ref 33 · internal anchor
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization math.OC · 2026-05-12 · conditional · none · ref 42 · internal anchor
Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts cs.AI · 2026-05-12 · conditional · none · ref 43 · internal anchor
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
SoK: Unlearnability and Unlearning for Model Dememorization cs.LG · 2026-05-12 · conditional · none · ref 89 · internal anchor
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation cs.IR · 2026-04-30 · conditional · none · ref 15 · internal anchor
InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling stable rankings in one forward pass.
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels cs.LG · 2026-04-27 · conditional · none · ref 11 · internal anchor
COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines on LLaMA and Mistral models.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · conditional · none · ref 30 · 2 links · internal anchor
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
Provably Secure Steganography Based on List Decoding cs.CR · 2026-04-23 · conditional · none · ref 9 · internal anchor
List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation cs.CL · 2026-04-22 · conditional · none · ref 23 · internal anchor
Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 58 · internal anchor
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data cs.CL · 2025-12-11 · conditional · none · ref 2 · internal anchor
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation cs.LG · 2025-12-09 · conditional · none · ref 27 · internal anchor
LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
Soft Head Selection for Injecting ICL-Derived Task Embeddings cs.CL · 2025-07-28 · conditional · none · ref 8 · internal anchor
SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.
SpinQuant: LLM quantization with learned rotations cs.LG · 2024-05-26 · conditional · none · ref 7 · internal anchor
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
Jamba: A Hybrid Transformer-Mamba Language Model cs.CL · 2024-03-28 · conditional · none · ref 23 · internal anchor
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
KTO: Model Alignment as Prospect Theoretic Optimization cs.LG · 2024-02-02 · conditional · none · ref 10 · internal anchor
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software cs.CR · 2026-06-18 · conditional · none · ref 36 · internal anchor
A curated temporal-split benchmark shows LLMs achieve at most 52.1% vulnerability detection accuracy after fine-tuning because backbone directional biases resist correction and contamination provides no benefit.
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation cs.CL · 2026-06-06 · conditional · none · ref 30 · internal anchor
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
Caliper: Probing Lexical Anchors versus Causal Structure in LLMs cs.CL · 2026-06-03 · conditional · none · ref 15 · internal anchor
Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.
Do Transformers Need Three Projections? Systematic Study of QKV Variants cs.LG · 2026-06-01 · conditional · none · ref 73 · internal anchor
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence cs.CL · 2026-05-30 · conditional · none · ref 41 · internal anchor
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation cs.CL · 2026-05-21 · conditional · none · ref 138 · 2 links · internal anchor
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing cs.CL · 2026-05-09 · conditional · none · ref 12 · internal anchor
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.
CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments cs.HC · 2026-04-30 · conditional · none · ref 48 · 2 links · internal anchor
CoNewsReader integrates user comments with an LLM to improve critical news reading on social media, with a 24-participant study showing gains in comprehension and critical thinking over baseline interfaces.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation cs.LG · 2026-04-26 · conditional · none · ref 19 · 2 links · internal anchor
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · conditional · none · ref 43 · internal anchor
Cross-modal representational alignment between text, image, audio, and video models degrades substantially when evaluated at million-sample scale and under realistic many-to-many pairing assumptions.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 32 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference cs.IR · 2026-04-03 · conditional · none · ref 9 · internal anchor
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
Neuro-Symbolic Proof Generation for Scaling Systems Software Verification cs.AI · 2026-03-20 · conditional · none · ref 12 · internal anchor
A neuro-symbolic system using LLM-guided best-first search and Isabelle tools proves up to 77.6% of theorems on the seL4 benchmark, outperforming prior LLM methods and Sledgehammer.
Why Attend to Everything? Focus is the Key cs.CL · 2026-03-12 · conditional · none · ref 15 · internal anchor
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
Graph-Based Alternatives to LLMs for Human Simulation cs.CL · 2025-11-03 · conditional · none · ref 36 · internal anchor
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token cs.CL · 2025-07-31 · conditional · none · ref 41 · internal anchor
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
Accelerating Prefilling via Decoding-time Contribution Sparsity cs.CL · 2025-07-29 · conditional · none · ref 9 · internal anchor
TriangleMix exploits decoding-time contribution sparsity via a training-free static attention pattern to accelerate LLM prefilling with nearly lossless performance.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 295 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling cs.CL · 2024-06-04 · conditional · none · ref 14 · internal anchor
PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 20 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 26 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 18 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 8 · internal anchor
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 65 · internal anchor
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! cs.IR · 2023-12-05 · conditional · none · ref 14 · internal anchor
RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 30 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

Title resolution pending

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer