Signature filtering learns unreliable tokens with MILP and removes them at detection time, raising true positive rates from 8-31% to 78-99% across Kgw, Sweet, Unigram, and Exp watermarks on multiple corpora and LLMs while controlling false positives.
hub Mixed citations
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Mixed citation behavior. Most common role is background (67%).
abstract
Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
On heterogeneous document collections, only query expansion and a newly introduced per-source calibrated corrector (SSCC) deliver reliable gains beyond a strong cross-encoder reranker; other common retrieval enhancements do not.
HybridCodeAuthorship is a new benchmark dataset of interleaved human-AI Python code that shows existing detection algorithms reach at most 0.48 F1 at chunk level and 0.56 F1 at line level.
CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.
SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.
VISTA is a new benchmark for end-to-end visual spec-to-web-app generation by LLM agents, featuring five prompt conditions, manual UI annotations, multi-metric evaluation, and results on four agent systems showing partial decoupling of visual and functional performance.
Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Real developer IDE traces differ substantially from LLM simulations in behavior and structure; current proactive assistants are unreliable on real traces, and simulated data cannot substitute for real data in training.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on type inference, comment generation, and variable renaming.
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.
Natural backdoors are prevalent in CodeLMs; the authors propose ScanNBT to detect them after analyzing differences from injected backdoors, transferability, and causes.
UniRTL unifies RTL code and CDFG through mutual masked modeling and hierarchical training with a graph-aware tokenizer, outperforming prior single-modality methods on performance prediction and code retrieval.
Code-QA-Bench uses an answer-first pipeline and three-condition experiments to generate 628 tasks across 10 Python repositories and quantify that code access drives most performance gains while documentation adds only modest benefit on doc-dependent tasks.
citing papers explorer
-
Test-Time Speculation
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
-
Nomic Embed: Training a Reproducible Long Context Text Embedder
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.
-
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completion, math programming, and retrieval tasks including new SoTA on HumanEval with the 1
-
Text and Code Embeddings by Contrastive Pre-Training
Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.
-
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks than prior encoder-only or decoder-only models.
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
-
Search-R3: Unifying Reasoning and Embedding in Large Language Models
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
Towards General Text Embeddings with Multi-stage Contrastive Learning
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
-
Retrieval-Augmented Generation for Large Language Models: A Survey
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.