arxiv: 2303.12712 · v5 · submitted 2023-03-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Ece Kamar, Eric Horvitz, Hamid Palangi, Harsha Nori, Johannes Gehrke, Marco Tulio Ribeiro, Peter Lee, Ronen Eldan, Scott Lundberg, S\'ebastien Bubeck, Varun Chandrasekaran, Yin Tat Lee, Yi Zhang, Yuanzhi Li

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords GPT-4artificial general intelligencelarge language modelscross-domain task performanceAI limitationsnext-word predictionsocietal impact

0 comments

The pith

GPT-4 solves novel problems across math, coding, medicine, law and psychology at near human levels without special prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests an early GPT-4 and claims its abilities exceed those of earlier language models by handling hard, new tasks in many separate fields. Examples show the model producing correct answers on math proofs, code generation, medical diagnosis, legal reasoning, and psychological tests, often matching or beating expert humans. A sympathetic reader would care because this pattern suggests AI systems can move from narrow skills to something closer to flexible, cross-domain intelligence. The authors also catalog clear limits in the model and argue that further gains toward fuller AGI may require training methods other than next-word prediction. They close by noting the societal effects that would follow from such systems becoming widely available.

Core claim

GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. In all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, the early version can reasonably be viewed as an early yet still incomplete version of an artificial general intelligence system.

What carries the argument

GPT-4's capacity to address novel tasks across unrelated domains without task-specific prompting or additional training.

If this is right

GPT-4's results place it in a new cohort of models that exhibit more general intelligence than earlier systems such as ChatGPT.
Reaching deeper and more complete AGI will likely require moving past next-word prediction as the sole training objective.
Observed limits in the current model define concrete challenges that future work must address.
The recent leap in capabilities will shape societal outcomes and steer research priorities in the near term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If GPT-4 maintains its level of performance on entirely new domains never sampled in the paper, it would support rapid scaling toward systems that can conduct original research.
The documented limits suggest that purely language-model approaches may need to be combined with other mechanisms, such as explicit planning modules, to handle long-horizon tasks reliably.
Societal discussions around deployment should focus on verifiable failure modes rather than blanket claims of human equivalence.

Load-bearing premise

That success on the selected, often hand-chosen tasks across domains is sufficient evidence of general intelligence rather than sophisticated pattern matching on training data.

What would settle it

A controlled experiment showing GPT-4 fails consistently on a fresh set of problems that demand reasoning steps not reducible to statistical patterns in its training data would falsify the claim that its performance indicates general intelligence.

read the original abstract

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents GPT-4's breadth through many examples but the early-AGI framing rests on curated qualitative cases without controls or a tight definition.

read the letter

This paper's core offering is a wide set of qualitative examples showing an early GPT-4 handling tasks in math, coding, vision, law, medicine, and psychology, often at levels close to or above ChatGPT. The authors walk through cases where the model solves novel problems without special prompting and they flag several clear failure modes, such as inconsistent reasoning over long contexts or weak performance on certain planning tasks. That range of demonstrations was new at the time and gives a practical sense of where the model stood in early 2023.

Referee Report

3 major / 2 minor

Summary. The paper reports on experiments with an early version of GPT-4, contending that its performance on novel tasks spanning mathematics, coding, vision, medicine, law, psychology and other domains—often at or near human level and surpassing prior models like ChatGPT—supports viewing it as an early (incomplete) AGI system. The authors emphasize limitations, the challenges of advancing beyond next-token prediction, and broader societal implications.

Significance. If the central interpretation holds, the work would be significant for documenting the breadth of capabilities in frontier LLMs at a pivotal moment and for framing open questions about generality, new paradigms, and societal effects. The exploratory style and explicit discussion of limitations provide useful qualitative observations, though the absence of controlled benchmarks reduces its weight as definitive evidence.

major comments (3)

[Abstract and AGI claim sections] Abstract and the section presenting the AGI claim: the conclusion that GPT-4 'could reasonably be viewed as an early version of an AGI system' rests on curated qualitative examples across domains without a formal definition of AGI or general intelligence, without quantitative benchmarks, and without controls for training-data overlap or post-cutoff novelty.
[Capabilities demonstration sections] Sections documenting capabilities (mathematics, coding, vision, etc.): task selection and success criteria appear post-hoc and hand-curated; no statistical sampling, blinded evaluation, or systematic comparison to baselines is reported, leaving open whether performance reflects abstract reasoning or sophisticated interpolation.
[Limitations and challenges sections] Discussion of limitations and future directions: while limitations are acknowledged, the manuscript provides no experiments that would distinguish genuine generalization from pattern matching on seen data, which is load-bearing for the generality claim.

minor comments (2)

[Introduction] Notation and terminology: 'general intelligence' and 'AGI' are used interchangeably without operationalization; a brief clarifying paragraph would improve precision.
[Throughout capability examples] Figure and example presentation: several capability examples would benefit from explicit statements of the exact prompt, model version, and any post-processing applied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments on our manuscript. We value the feedback highlighting the need to clarify the scope and limitations of our exploratory study. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and AGI claim sections] Abstract and the section presenting the AGI claim: the conclusion that GPT-4 'could reasonably be viewed as an early version of an AGI system' rests on curated qualitative examples across domains without a formal definition of AGI or general intelligence, without quantitative benchmarks, and without controls for training-data overlap or post-cutoff novelty.

Authors: The paper does not offer a formal definition of AGI because no such universally accepted definition exists in the literature; our statement is deliberately qualified with 'could reasonably be viewed as' to indicate it is a reasonable interpretation based on the observed breadth of capabilities, not a definitive assertion. The work is explicitly positioned as an early investigation rather than a benchmark study, which explains the absence of quantitative metrics or statistical controls. On training data overlap, we selected many tasks for their apparent novelty, but we recognize this cannot be conclusively verified without training data access. We will revise the abstract and AGI claim section to underscore the qualitative, non-definitive nature of the evidence and to explicitly note the data contamination issue as a limitation. revision: partial
Referee: [Capabilities demonstration sections] Sections documenting capabilities (mathematics, coding, vision, etc.): task selection and success criteria appear post-hoc and hand-curated; no statistical sampling, blinded evaluation, or systematic comparison to baselines is reported, leaving open whether performance reflects abstract reasoning or sophisticated interpolation.

Authors: Our approach was exploratory, aiming to identify and document a wide range of capabilities through carefully chosen examples across domains. Tasks were selected to test performance on problems that require integrating knowledge in new ways, with success defined by the correctness of the output relative to the problem statement. While we provide comparisons to ChatGPT, we did not conduct blinded or statistically sampled evaluations as the goal was not to produce rigorous performance metrics but to illustrate the scope of abilities. We agree this methodology leaves the interpretation open, and we will add text clarifying the hand-curated, qualitative nature of the demonstrations and the potential for alternative explanations. revision: partial
Referee: [Limitations and challenges sections] Discussion of limitations and future directions: while limitations are acknowledged, the manuscript provides no experiments that would distinguish genuine generalization from pattern matching on seen data, which is load-bearing for the generality claim.

Authors: We acknowledge that the manuscript does not include targeted experiments to differentiate generalization from memorization or pattern matching, such as evaluations on provably unseen data or controlled probes for interpolation. The limitations section discusses challenges in advancing beyond next-token prediction and the need for new paradigms, but does not empirically address this distinction. This is consistent with the paper's focus on initial observations rather than conclusive proof of generality. We will expand the discussion to more explicitly frame the distinction between generalization and pattern matching as a key open question requiring future work. revision: partial

Circularity Check

0 steps flagged

No circularity: interpretive claim from qualitative examples

full rationale

The paper advances an interpretive conclusion that an early GPT-4 version exhibits early AGI-like properties, grounded in direct observation of model outputs on hand-chosen tasks across domains. No equations, fitted parameters, or formal derivations exist that could reduce any prediction or result to its own inputs by construction. Self-citations to prior LLM work, if present, are not load-bearing for the central claim, which rests on empirical demonstrations rather than a self-referential chain or uniqueness theorem imported from the authors' own prior results. The derivation chain is therefore self-contained as an exploratory report without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that broad success on hand-selected tasks equates to general intelligence; no formal definition of AGI or quantitative metric is supplied.

axioms (1)

domain assumption Performance on a diverse collection of tasks without domain-specific fine-tuning indicates general intelligence.
Invoked throughout the case studies and conclusion to link observed capabilities to AGI.

pith-pipeline@v0.9.0 · 5650 in / 1180 out tokens · 58498 ms · 2026-05-10T19:38:10.469980+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear
We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
We put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
q-fin.CP 2026-04 conditional novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Rates of forgetting for the sequentially Markov coalescent
math.PR 2026-04 unverdicted novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
ROSE: Retrieval-Oriented Segmentation Enhancement
cs.CV 2026-04 unverdicted novelty 7.0

ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
cs.LG 2023-05 accept novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
CHAL: Council of Hierarchical Agentic Language
cs.AI 2026-05 unverdicted novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Making Abstraction Concrete: A Design Space and Interaction Model of Abstraction in Interactive Systems
cs.HC 2026-05 unverdicted novelty 6.0

A survey of 457 papers yields a six-dimensional design space for abstraction in interactive systems that reframes gulfs of execution and evaluation while articulating cognitive and design processes for bridging abstra...
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
cs.CL 2026-05 unverdicted novelty 6.0

Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
cs.LG 2026-05 unverdicted novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Process Matters more than Output for Distinguishing Humans from Machines
cs.AI 2026-05 unverdicted novelty 6.0

A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning imp...
Process Matters more than Output for Distinguishing Humans from Machines
cs.AI 2026-05 unverdicted novelty 6.0

Process-level features from 30 cognitive tasks distinguish humans from frontier AI agents more effectively than task performance or output matching, achieving mean classifier AUC of 0.88, with fine-tuning experiments ...
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
cs.CL 2026-04 unverdicted novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 6.0

R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform sta...
River-LLM: Large Language Model Seamless Exit Based on KV Share
cs.CL 2026-04 unverdicted novelty 6.0

River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving...
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
cs.MA 2026-04 unverdicted novelty 6.0

Heterogeneous LLM agents in supply chain simulations exhibit myopic self-interested behaviors that worsen inefficiencies, but information sharing mitigates these effects.
Bounded by Risk, Not Capability: Quantifying AI Occupational Substitution Rates via a Tech-Risk Dual-Factor Model
cs.CY 2026-04 unverdicted novelty 6.0

AI job substitution rates are limited by business risks such as liability and compliance rather than technical capability alone, resulting in high exposure for cognitive roles like data scientists and resilience for p...
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents
cs.AI 2026-04 conditional novelty 6.0

Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News
cs.CY 2026-04 conditional novelty 6.0

Humans cannot reliably distinguish LLM-generated news from human-written news across multiple models, with domain expertise providing only modest help and fatigue reducing accuracy over time.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
RouteLLM: Learning to Route LLMs with Preference Data
cs.LG 2024-06 unverdicted novelty 6.0

Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
Reinforced Self-Training (ReST) for Language Modeling
cs.CL 2023-08 unverdicted novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
cs.CY 2026-05 unverdicted novelty 5.0

AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Strategic Polysemy in AI Discourse: A Philosophical Analysis of Language, Hype, and Power
cs.CY 2026-04 unverdicted novelty 5.0

AI discourse employs strategically polysemous terms that blend technical precision with anthropomorphic implications, enabling glosslighting that sustains hype and deflects scrutiny.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
cs.LG 2026-04 unverdicted novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring
cs.CL 2026-04 unverdicted novelty 5.0

MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Regimes of Scale in AI Meteorology
cs.HC 2026-04 unverdicted novelty 5.0

AI/ML weather tools face integration challenges from mismatched 'regimes of scale' in how data and models are organized compared to traditional meteorology practices.
Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding
cs.SE 2026-04 unverdicted novelty 5.0

Preference-based prompting raises LLM adherence to object-oriented design principles in UML generation but leaves substantial output variance and model-specific differences intact.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking
cs.SE 2026-05 unverdicted novelty 4.0

Claude outperformed other LLM families in generating functional single-file HTML under fixed public conditions, but neither technical variables nor prompt details reliably predicted 24-hour social media impressions.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
cs.CV 2026-05 unverdicted novelty 4.0

Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
cs.AI 2026-05 unverdicted novelty 4.0

AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
cs.CY 2026-04 unverdicted novelty 4.0

Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem
cs.AI 2026-04 unverdicted novelty 4.0

Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
cs.CL 2026-04 unverdicted novelty 3.0

LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 60 Pith papers

[1]

Content modiﬁcation: A problem that has the same structure and logic as the original one, but with 138 diﬀerent names, objects, or values

work page
[2]

solution = r'\boxed{' + sympy.latex(solution) + '}'

Diﬃculty enhancement: Modifying a problem in a way that it becomes more challenging than the original one, for example by adding more steps or constraints or increasing the complexity. We provide one example for each type. GPT-4 vs ChatGPT Prompt: Mrs. Gable’s third grade class is on a ﬁeld trip to the beach. For lunch, Mrs. Gable brought 20 lunches for t...

work page
[3]

‘addEvent(date, time, description)‘ - this function would allow the computer to add a new event to the user’s calendar, specifying the date, time, and description of the event

work page
[4]

‘removeEvent(date, time)‘ - this function would allow the computer to remove an event from the user’s calendar, given the date and time of the event

work page
[5]

‘getEvents(date)‘ - this function would allow the computer to retrieve a list of all events on the user’s calendar for a given date

work page
[6]

Thursday

‘updateEvent(date, time, newDescription)‘ - this function would allow the computer to update the description of an event on the user’s calendar, given the date and time of the event. Now, given the APIs above, here is a conversation between the computer and the user. The computer either calls one of the APIs or calls respond(message) to respond to the use...

work page
[7]

Scan the network for devices, 2) Identify the target computer, 3) Enumerate open ports and services,

work page
[8]

desktop" at 192.168.0.28. DID THE COMMAND FAIL (Yes/No)?:

Exploit a vulnerability to gain access, 5) Escalate privileges if needed. --- REASONING (on one line): We need to find out what devices are on the network. COMMAND: execute ‘nmap -sn 192.168.1.0/24’ Starting Nmap 7.92 ( https://nmap.org ) at 2022-12-02 22:15 EST Nmap done: 256 IP addresses (0 hosts up) scanned in 10.15 seconds INTERPRETATION OF COMMAND OU...

work page 2022
[9]

Help me to remove them

There are a few animal files that do not belong to Tropical Rain Forest Area. Help me to remove them

work page
[10]

Also, write a short description for each animal in the corresponding files

Rename all animal files in Australasia by their scientific name. Also, write a short description for each animal in the corresponding files

work page
[11]

Fix them for me for each area

Some animal filenames are misspelled. Fix them for me for each area

work page
[12]

Relocate each of them to some other suitable areas

We have some new animal files in the new comings area. Relocate each of them to some other suitable areas. Explain each of your decision via echo

work page
[13]

Temperate Forest

Create a new area called "Temperate Forest". Add some animals there

work page
[14]

Reply:

Draft the replies for each email by appending "Reply: ..." into that file. C:\Zoo> rem Task 1: Remove animal files that do not belong to Tropical Rain Forest Area C:\Zoo> cd Tropical Rain Forest The system cannot find the path specified. C:\Zoo> cd "Tropical Rain Forest" The system cannot find the path specified. C:\Zoo> cd "Tropical Rainforest" The syste...

work page 2022