Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
hub
Language Models are Multilingual Chain-of-Thought Reasoners
32 Pith papers cite this work. Polarity classification is still indexing.
abstract
We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Fine-tuning on matched native and English-pivoted multilingual reasoning datasets across six languages reduces the native reasoning gap to 1.9-3.5%; layer swap of English mid-layers largely closes the remaining gap while preserving target-language CoT.
UL-XCoT maintains competitive accuracy on multilingual benchmarks while cutting decoding tokens by over 50% through per-query language selection and logic-space trajectory pruning.
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
SuperValid synthesizes capability-aligned OOD validation data to produce a training-free loss metric that correlates with downstream benchmark performance across model architectures, scales, and data distributions.
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU, GPQA, HumanEval+, MATH, MGSM and ARC.
MGSM-Pro creates five digit-and-context varied instantiations per MGSM question to expose robustness gaps in multilingual LLM math reasoning, with larger drops in low-resource languages.
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
FLy is a training-free method that speeds up LLM generation by accepting semantically correct but non-exact draft tokens via an entropy gate and deferred verification window.
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
citing papers explorer
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
Rethinking the Multilingual Reasoning Gap with Layer Swap
Fine-tuning on matched native and English-pivoted multilingual reasoning datasets across six languages reduces the native reasoning gap to 1.9-3.5%; layer swap of English mid-layers largely closes the remaining gap while preserving target-language CoT.
-
Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework
UL-XCoT maintains competitive accuracy on multilingual benchmarks while cutting decoding tokens by over 50% through per-query language selection and logic-space trajectory pruning.
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
-
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
-
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling
SuperValid synthesizes capability-aligned OOD validation data to produce a training-free loss metric that correlates with downstream benchmark performance across model architectures, scales, and data distributions.
-
PACE: Two-Timescale Self-Evolution for Small Language Model Agents
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
x1: Learning to Think Adaptively Across Languages and Cultures
x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.
-
English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.
-
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
-
Sensitivity-Positional Co-Localization in GQA Transformers
In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU, GPQA, HumanEval+, MATH, MGSM and ARC.
-
MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
MGSM-Pro creates five digit-and-context varied instantiations per MGSM question to expose robustness gaps in multilingual LLM math reasoning, with larger drops in low-resource languages.
-
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
-
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
FLy is a training-free method that speeds up LLM generation by accepting semantically correct but non-exact draft tokens via an entropy gate and deferred verification window.
-
Routing-Based Continual Learning for Multimodal Large Language Models
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
-
Attention Residuals
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
-
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
GanitLLM improves Bengali math accuracy by 6-8 points over its base model by training on a difficulty-tagged corpus with Curriculum-GRPO that boosts Bengali reasoning tokens from 14% to 88%.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data
JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
Baichuan 2: Open Large-scale Language Models
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.