hub Canonical reference

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al · 1901

Canonical reference. 92% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 92% of classified citations

browse 44 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 13

citation-polarity summary

background 12 unclear 1

representative citing papers

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

State-Centric Decision Process

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

BadDLM: Backdooring Diffusion Language Models with Diverse Targets

cs.CR · 2026-05-10 · unverdicted · novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.

Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contamination, then applies it to create a more diverse lineage-aware dataset.

Scaling Vision Transformers for Functional MRI with Flat Maps

cs.CV · 2025-10-15 · conditional · novelty 7.0

CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connectivity baseline on subject-level trait prediction.

Pre-trained Large Language Models Learn Hidden Markov Models In-context

cs.LG · 2025-06-08 · unverdicted · novelty 7.0

Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

cs.CL · 2025-06-08 · unverdicted · novelty 7.0

VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

cs.AI · 2025-06-04 · unverdicted · novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

cs.LG · 2024-10-05 · unverdicted · novelty 7.0

TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

LAION-5B: An open large-scale dataset for training next generation image-text models

cs.CV · 2022-10-16 · accept · novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

cs.LG · 2022-05-27 · accept · novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.

In-context learning enables continental-scale subsurface temperature prediction from sparse local observations

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

A transformer-based in-context learning model predicts continental-scale subsurface temperatures from sparse borehole observations, outperforming physics and interpolation baselines while adapting to new regions with 20 examples.

Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.

citing papers explorer

Showing 44 of 44 citing papers.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing cs.CR · 2026-04-07 · unverdicted · none · ref 15
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 3
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
State-Centric Decision Process cs.AI · 2026-05-12 · unverdicted · none · ref 7
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 4
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets cs.CR · 2026-05-10 · unverdicted · none · ref 2
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents cs.RO · 2026-05-08 · unverdicted · none · ref 35 · 2 links
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations cs.CL · 2026-05-06 · unverdicted · none · ref 2
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs cs.AI · 2026-04-12 · unverdicted · none · ref 5
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contamination, then applies it to create a more diverse lineage-aware dataset.
Scaling Vision Transformers for Functional MRI with Flat Maps cs.CV · 2025-10-15 · conditional · none · ref 9
CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connectivity baseline on subject-level trait prediction.
Pre-trained Large Language Models Learn Hidden Markov Models In-context cs.LG · 2025-06-08 · unverdicted · none · ref 7
Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs cs.CL · 2025-06-08 · unverdicted · none · ref 6
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games cs.AI · 2025-06-04 · unverdicted · none · ref 24
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 23
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 2
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis cs.LG · 2024-10-05 · unverdicted · none · ref 21
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 5
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Ring Attention with Blockwise Transformers for Near-Infinite Context cs.CL · 2023-10-03 · unverdicted · none · ref 3
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
LAION-5B: An open large-scale dataset for training next generation image-text models cs.CV · 2022-10-16 · accept · none · ref 8
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness cs.LG · 2022-05-27 · accept · none · ref 5
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.
In-context learning enables continental-scale subsurface temperature prediction from sparse local observations cs.LG · 2026-05-15 · unverdicted · none · ref 21
A transformer-based in-context learning model predicts continental-scale subsurface temperatures from sparse borehole observations, outperforming physics and interpolation baselines while adapting to new regions with 20 examples.
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction cs.CV · 2026-05-15 · unverdicted · none · ref 32
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models cs.CR · 2026-05-15 · unverdicted · none · ref 4
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment cs.AI · 2026-05-05 · unverdicted · none · ref 75
CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists cs.AI · 2026-04-30 · unverdicted · none · ref 14
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 5
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention cs.CL · 2025-06-16 · unverdicted · none · ref 2
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers cs.LG · 2025-03-18 · unverdicted · none · ref 4
LLM-FE is a framework that treats feature engineering as LLM-driven program search with data feedback, reporting consistent gains over baselines on classification and regression tabular tasks.
Learning Spatial-Preserving Hierarchical Representations for Digital Pathology cs.CV · 2024-06-13 · unverdicted · none · ref 5
SPAN is a hierarchical attention framework that constructs multi-scale pyramid representations from single-scale patch inputs for WSI classification and segmentation while preserving spatial relationships.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation cs.CV · 2024-04-22 · unverdicted · none · ref 10
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
ConfusionPrompt: Practical Private Inference for Online Large Language Models cs.CR · 2023-12-30 · unverdicted · none · ref 1
ConfusionPrompt enables private black-box LLM inference via prompt decomposition and pseudo-prompt mixing, claiming better privacy-utility trade-off than perturbation methods and lower memory use than open-source local models.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 5
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents cs.SI · 2023-07-27 · unverdicted · none · ref 6
S³ uses LLM agents to simulate social networks by modeling emotion, attitude, and interaction, producing emergent propagation phenomena with promising accuracy on real data.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 12
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 4
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 4
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents cs.AI · 2026-05-11 · unverdicted · none · ref 3
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
Higher-Order LaSDI: Reduced Order Modeling with Multiple Time Derivatives cs.LG · 2025-12-17 · unverdicted · none · ref 9
Higher-order LaSDI uses a high-order finite-difference scheme and rollout loss to improve long-term prediction accuracy in reduced-order models for parameterized PDEs, shown on the 2D Burgers equation.
GENIUS: An Agentic AI Framework for Autonomous Design and Execution of Simulation Protocols cs.AI · 2025-12-06 · unverdicted · none · ref 40
GENIUS is an agentic AI framework that automates generation, validation, and repair of Quantum ESPRESSO DFT input files, succeeding on ~80% of 295 benchmarks with 76% autonomous repairs and lower cost than LLM-only baselines.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 50
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 4
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 8
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures cs.HC · 2024-04-01 · unverdicted · none · ref 7
Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.
PaLI-X: On Scaling up a Multilingual Vision and Language Model cs.CV · 2023-05-29 · unverdicted · none · ref 13
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba cs.LG · 2025-03-22 · unverdicted · none · ref 39
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.

Language models are few-shot learners

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer