A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Deepseek-v3.2: Pushing the frontier of open large language models
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
baseline 1polarities
baseline 1representative citing papers
Region4Web shows that shifting web agent observation to functional regions instead of element-level granularity produces shorter, more effective state representations and raises task success on WebArena across multiple LLMs and agent methods.
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
citing papers explorer
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
Region4Web: Rethinking Observation Space Granularity for Web Agents
Region4Web shows that shifting web agent observation to functional regions instead of element-level granularity produces shorter, more effective state representations and raises task success on WebArena across multiple LLMs and agent methods.
-
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.