hub

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al · 2025 · arXiv 2504.19314

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating substantial benchmark-induced measurement error in prior multilingual evaluations.

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

MARCA: A Checklist-Based Benchmark for Multilingual Web Search

cs.CL · 2026-04-15 · accept · novelty 6.0

MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

Towards Knowledgeable Deep Research: Framework and Benchmark

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

LightThinker++: From Reasoning Compression to Memory Management

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

cs.CL · 2025-12-02 · unverdicted · novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

cs.AI · 2025-09-02 · conditional · novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

citing papers explorer

Showing 13 of 13 citing papers.

Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 84
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation cs.CL · 2026-04-27 · unverdicted · none · ref 7
A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating substantial benchmark-induced measurement error in prior multilingual evaluations.
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents cs.AI · 2026-05-06 · unverdicted · none · ref 16
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data cs.LG · 2026-04-21 · unverdicted · none · ref 15
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0) cs.CL · 2026-04-18 · unverdicted · none · ref 27
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search cs.CL · 2026-04-15 · accept · none · ref 32
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
Towards Knowledgeable Deep Research: Framework and Benchmark cs.AI · 2026-04-09 · unverdicted · none · ref 45
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
LightThinker++: From Reasoning Compression to Memory Management cs.CL · 2026-04-04 · unverdicted · none · ref 54
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 56
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 63
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 17
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 89
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration cs.AI · 2026-04-07 · unverdicted · none · ref 30
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer