super hub Canonical reference

GPT-4 Technical Report

Florencia Leoni Aleman, Ilge Akkaya, Josh Achiam, Lama Ahmad, Sandhini Agarwal, Steven Adler · 2023 · cs.CL · arXiv 2303.08774

Canonical reference. 75% of citing Pith papers cite this work as background.

865 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 865 citing papers more from Florencia Leoni Aleman arXiv PDF

abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 baseline 5

citation-polarity summary

background 15 baseline 5

claims ledger

abstract We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core compone

authors

Floren- cia Leoni Aleman Ilge Akkaya Josh Achiam Lama Ahmad Sandhini Agarwal Steven Adler

co-cited works

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

Pretraining Exposure Explains Popularity Judgments in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

A standard Transformer with O(ε^{-d0/α}) blocks can approximate any bounded d0-dimensional Hölder function of smoothness α to accuracy ε, but at least Ω(ε^{-d0/(4α)}) blocks are required.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.

From Context to Skills: Can Language Models Learn from Context Skillfully?

cs.AI · 2026-04-30 · unverdicted · novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

cs.AI · 2026-04-20 · accept · novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Disentangling MLP Neuron Weights in Vocabulary Space

cs.CL · 2026-04-07 · unverdicted · novelty 8.0

ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

cs.CV · 2026-04-04 · unverdicted · novelty 8.0

ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Adaptive Stopping for Multi-Turn LLM Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

citing papers explorer

Showing 50 of 865 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 1 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Pretraining Exposure Explains Popularity Judgments in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts cs.CV · 2026-05-12 · unverdicted · none · ref 1 · 2 links · internal anchor
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
A standard Transformer with O(ε^{-d0/α}) blocks can approximate any bounded d0-dimensional Hölder function of smoothness α to accuracy ε, but at least Ω(ε^{-d0/(4α)}) blocks are required.
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
LLM Translation of Compiler Intermediate Representation cs.PL · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Nearly Optimal Attention Coresets cs.DS · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Efficient Preference Poisoning Attack on Offline RLHF cs.LG · 2026-05-04 · unverdicted · none · ref 38 · internal anchor
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Characterizing the Expressivity of Local Attention in Transformers cs.CL · 2026-05-01 · unverdicted · none · ref 26 · internal anchor
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.
From Context to Skills: Can Language Models Learn from Context Skillfully? cs.AI · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 5 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering cs.CL · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval cs.AI · 2026-04-20 · accept · none · ref 36 · internal anchor
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 48 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 35 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 64 · internal anchor
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing cs.CV · 2026-04-10 · accept · none · ref 1 · internal anchor
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Disentangling MLP Neuron Weights in Vocabulary Space cs.CL · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos cs.CV · 2026-04-04 · unverdicted · none · ref 23 · internal anchor
ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations physics.chem-ph · 2026-04-03 · conditional · none · ref 7 · internal anchor
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks cs.AI · 2026-04-01 · unverdicted · none · ref 2 · internal anchor
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Adaptive Stopping for Multi-Turn LLM Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 1 · internal anchor
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 93 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 40 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 28 · internal anchor
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
The Linear Representation Hypothesis and the Geometry of Large Language Models cs.CL · 2023-11-07 · conditional · none · ref 20 · internal anchor
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models cs.CL · 2023-05-17 · accept · none · ref 24 · internal anchor
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction cs.LG · 2026-05-13 · unverdicted · none · ref 33 · internal anchor
GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
Sampling from Flow Language Models via Marginal-Conditioned Bridges cs.LG · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and reducing denoising error versus conditional-mean bridges.
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models cs.LG · 2026-05-13 · conditional · none · ref 38 · internal anchor
DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations cs.CL · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition cs.CV · 2026-05-13 · conditional · none · ref 69 · internal anchor
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models cs.CV · 2026-05-13 · conditional · none · ref 1 · internal anchor
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without any training.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems cs.DB · 2026-05-13 · conditional · none · ref 1 · internal anchor
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
ImageAttributionBench: How Far Are We from Generalizable Attribution? cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge stat.ML · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
State-Centric Decision Process cs.AI · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference cs.IT · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Online Continual Learning with Dynamic Label Hierarchies cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
HALO improves online continual learning under evolving label hierarchies by adaptively combining classification heads regularized with organized learnable prototypes for better adaptation and reduced forgetting.
SoK: Unlearnability and Unlearning for Model Dememorization cs.LG · 2026-05-12 · conditional · none · ref 6 · internal anchor
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments cs.RO · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
Kairos: A Scalable Serving System for Physical AI cs.RO · 2026-05-12 · unverdicted · none · ref 41 · internal anchor
Kairos is the first multi-robot serving system that treats the generate-execute loop as a first-class citizen and reduces average task latency by 31.8-66.5% versus digital AI serving systems.

GPT-4 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer