SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
Title resolution pending
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
HORIZON creates a cross-domain, long-horizon user modeling benchmark from Amazon Reviews that tests generalization across time, domains, and unseen users, exposing gaps in sequential and LLM-based recommendation models.
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
BaLoRA is a Bayesian LoRA variant with input-adaptive noise that improves accuracy over standard LoRA and supplies well-calibrated uncertainty estimates on language, vision, and scientific prediction tasks.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Agentic AI systems with DAG topologies are claimed to deliver exponentially superior generalization and sample efficiency compared to monolithic scaling for achieving AGI.
citing papers explorer
-
HORIZON: A Benchmark for In-the-wild User Behaviour Modeling
HORIZON creates a cross-domain, long-horizon user modeling benchmark from Amazon Reviews that tests generalization across time, domains, and unseen users, exposing gaps in sequential and LLM-based recommendation models.