hub Canonical reference

Sparks of Artificial General Intelligence: Early experiments with GPT-4

· 2023 · cs.CL · arXiv 2303.12712

Canonical reference. 89% of citing Pith papers cite this work as background.

63 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 63 citing papers arXiv PDF

abstract

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 8 use method 1

claims ledger

abstract Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example)

co-cited works

representative citing papers

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Rates of forgetting for the sequentially Markov coalescent

math.PR · 2026-04-22 · unverdicted · novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.

ROSE: Retrieval-Oriented Segmentation Enhancement

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

cs.CL · 2024-06-06 · accept · novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

CHAL: Council of Hierarchical Agentic Language

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Making Abstraction Concrete: A Design Space and Interaction Model of Abstraction in Interactive Systems

cs.HC · 2026-05-11 · unverdicted · novelty 6.0

A survey of 457 papers yields a six-dimensional design space for abstraction in interactive systems that reframes gulfs of execution and evaluation while articulating cognitive and design processes for bridging abstraction gaps.

When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.

FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

Process Matters more than Output for Distinguishing Humans from Machines

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.

R$^3$AG: Retriever Routing for Retrieval-Augmented Generation

cs.IR · 2026-04-22 · unverdicted · novelty 6.0

R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.

River-LLM: Large Language Model Seamless Exit Based on KV Share

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving generation quality.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

citing papers explorer

Showing 50 of 63 citing papers.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 7 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Generative Agents: Interactive Simulacra of Human Behavior cs.HC · 2023-04-07 · accept · none · ref 22 · internal anchor
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Rates of forgetting for the sequentially Markov coalescent math.PR · 2026-04-22 · unverdicted · none · ref 95 · internal anchor
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
ROSE: Retrieval-Oriented Segmentation Enhancement cs.CV · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL · 2026-04-03 · unverdicted · none · ref 5 · internal anchor
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques cs.CL · 2024-06-06 · accept · none · ref 5 · internal anchor
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 108 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V cs.CV · 2023-10-17 · accept · none · ref 3 · internal anchor
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 2 · internal anchor
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 9 · internal anchor
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 56 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 192 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 142 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Making Abstraction Concrete: A Design Space and Interaction Model of Abstraction in Interactive Systems cs.HC · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
A survey of 457 papers yields a six-dimensional design space for abstraction in interactive systems that reframes gulfs of execution and evaluation while articulating cognitive and design processes for bridging abstraction gaps.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews cs.CL · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 65 · internal anchor
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Process Matters more than Output for Distinguishing Humans from Machines cs.AI · 2026-05-07 · unverdicted · none · ref 36 · 2 links · internal anchor
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis cs.CL · 2026-04-29 · unverdicted · none · ref 38 · internal anchor
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws cs.LG · 2026-04-27 · unverdicted · none · ref 21 · internal anchor
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation cs.IR · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
River-LLM: Large Language Model Seamless Exit Based on KV Share cs.CL · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving generation quality.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 167 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation cs.MA · 2026-04-19 · unverdicted · none · ref 10 · internal anchor
Heterogeneous LLM agents in supply chain simulations exhibit myopic self-interested behaviors that worsen inefficiencies, but information sharing mitigates these effects.
Bounded by Risk, Not Capability: Quantifying AI Occupational Substitution Rates via a Tech-Risk Dual-Factor Model cs.CY · 2026-04-06 · unverdicted · none · ref 16 · internal anchor
AI job substitution rates are limited by business risks such as liability and compliance rather than technical capability alone, resulting in high exposure for cognitive roles like data scientists and resilience for physical trades.
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents cs.AI · 2026-04-05 · conditional · none · ref 16 · internal anchor
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News cs.CY · 2026-04-04 · conditional · none · ref 1 · internal anchor
Humans cannot reliably distinguish LLM-generated news from human-written news across multiple models, with domain expertise providing only modest help and fatigue reducing accuracy over time.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 10 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
RouteLLM: Learning to Route LLMs with Preference Data cs.LG · 2024-06-26 · unverdicted · none · ref 9 · internal anchor
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 86 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 54 · internal anchor
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 6 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Textbooks Are All You Need II: phi-1.5 technical report cs.CL · 2023-09-11 · unverdicted · none · ref 3 · internal anchor
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 4 · internal anchor
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 5 · internal anchor
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena cs.CL · 2023-06-09 · accept · none · ref 5 · internal anchor
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Gorilla: Large Language Model Connected with Massive APIs cs.CL · 2023-05-24 · conditional · none · ref 6 · internal anchor
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 13 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy cs.CY · 2026-05-05 · unverdicted · none · ref 28 · internal anchor
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
Optimized Deferral for Imbalanced Settings cs.LG · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classification and LLM routing tasks.
Strategic Polysemy in AI Discourse: A Philosophical Analysis of Language, Hype, and Power cs.CY · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
AI discourse employs strategically polysemous terms that blend technical precision with anthropomorphic implications, enabling glosslighting that sustains hype and deflects scrutiny.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training cs.LG · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring cs.CL · 2026-04-19 · unverdicted · none · ref 39 · internal anchor
MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
Calibrating Model-Based Evaluation Metrics for Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 118 · internal anchor
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unverdicted · none · ref 3 · internal anchor
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Regimes of Scale in AI Meteorology cs.HC · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
AI/ML weather tools face integration challenges from mismatched 'regimes of scale' in how data and models are organized compared to traditional meteorology practices.
Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding cs.SE · 2026-04-01 · unverdicted · none · ref 13 · internal anchor
Preference-based prompting raises LLM adherence to object-oriented design principles in UML generation but leaves substantial output variance and model-specific differences intact.
HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization cs.CL · 2026-03-23 · unverdicted · none · ref 2 · internal anchor
HUOZIIME is an on-device LLM-powered input method with post-training on synthesized data and hierarchical memory that achieves efficient execution and memory-driven personalization.

Sparks of Artificial General Intelligence: Early experiments with GPT-4

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer