hub Canonical reference

Batch-ICL: Effective, efficient, and order-agnostic in-context learning

Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, RenJing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yunshi Lan, Xiang Li · 2024 · DOI 10.18653/v1/2024.findings-

Canonical reference. 100% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 100% of classified citations

open at publisher browse 21 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses

cs.OS · 2026-06-23 · unverdicted · novelty 7.0

ActPlane enforces agent-declared policies at OS level using IFC DSL and eBPF, improving compliance on indirect paths with 1.9-8.4% overhead.

A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

White-box method ReXTrust achieves highest AUC (peak 93.0) on Gut-VLM across five VLMs, outperforming alternatives by statistically significant margins while black-box and some gray-box methods collapse on certain models.

Self-Improving In-Context Learning

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.

Code Generation by Differential Test Time Scaling

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

PRISM benchmark finds LLMs match or exceed humans on isolated review dimensions like novelty verification but none achieve the balanced performance of human reviewers across depth, flaw prioritization, and constructiveness.

Tracking Capabilities for Safer Agents

cs.AI · 2026-03-01 · unverdicted · novelty 6.0

AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

cs.CL · 2025-09-09 · unverdicted · novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.

What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience

cs.HC · 2026-05-18 · unverdicted · novelty 5.0

GPT produces click distributions significantly different from real humans in 53% of UX first-click tasks, with prompting techniques like personas and chain-of-thought failing to improve alignment.

Context Convergence Improves Answering Inferential Questions

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

Passages made from high-convergence sentences improve LLM performance on inferential questions compared to cosine similarity selection.

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

cs.MM · 2026-04-28 · unverdicted · novelty 5.0

CUCI-Net abstracts context-utterance dependency into an interpretation cue that combines local modality signals with global context and feeds it into the final multimodal interaction for context-conditioned predictions.

STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation

cs.IR · 2026-04-11 · unverdicted · novelty 5.0

STAR is a semantic-tuned and tail-adaptive retriever for GraphRAG that uses cross-attention interaction learning and path-weighted contrastive learning to mitigate Semantic Shortcut Bias and Long-Tail Path Bias, reporting 1.8% retrieval and 2.2% QA gains.

A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems

cs.CY · 2026-04-06 · unverdicted · novelty 5.0

A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

cs.AI · 2026-03-12 · unverdicted · novelty 5.0

Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

cs.AI · 2025-01-27 · unverdicted · novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search

cs.IR · 2026-05-13 · unverdicted · novelty 4.0

An LLM framework with RAG predicts query-specific validity horizons for web content expiration and shows gains in production A/B tests.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

cs.CL · 2025-07-21

citing papers explorer

Showing 1 of 1 citing paper after filters.

What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience cs.HC · 2026-05-18 · unverdicted · none · ref 36
GPT produces click distributions significantly different from real humans in 53% of UX first-click tasks, with prompting techniques like personas and chain-of-thought failing to improve alignment.

Batch-ICL: Effective, efficient, and order-agnostic in-context learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer