super hub Mixed citations

Scalable training of

Andrew, booktitle=, Galen and Gao, Jianfeng

Mixed citation behavior. Most common role is unclear (64%).

132 Pith papers citing it

unclear 64% of classified citations

browse 132 citing papers more from Andrew

hub tools

JSON dossier citing papers JSON

citation-role summary

background 9 other 2

citation-polarity summary

unclear 7 background 4

claims ledger

background vey of graph meets large language model: progress and future directions. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pages 8123-8131. Andrés Montoyo, Patricio Martínez-Barco, and Alexan- dra Balahur. 2012. Subjectivity and sentiment analy- sis: An overview of the current state of the area and envisaged developments.Decision Support Systems, 53(4):675-679. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classificatio
background whether u is semantically broader than k. The two samples are expressed as X={x i}n i=1 andY={y j}m j=1,(3) where xi =x u,i and yj =x k,j, with n, m fixed (typically, we subsample to a common size to con- trol the variance across words). A natural null hypothesis is that the two words have the same dispersion but different mean directions. H0 :disp(X) =disp(Y) withE[X]̸=E[Y]allowed.(4) This is because the mean direction is a strong nui- sance factor in contextual embedding spaces. Even if two wo
background domain. Given Xv ∈R Sv×Dv and Xt ∈R St×Dt, the goal is to refine Xv by aggregating contextual information across scales. We define N scales with two adapter sets: G= {G1, . . . ,GN } (MGFA) and C={C 1, . . . ,CN } (MCFA). At each scale n, features are reshaped to a grid X (0) v ∈R H×W×D v and downsampled by Down(·,2 n−1): X (n) v = Down(X(0) v ,2 n−1).(4) Let Xv,n = Seq(X (n) v ) denote the flattened se- quence. We then refine and fuse: Gn =G n(Xv,n), C n =C n(Xv,n, Xt),(5) ˜Xv,n =G n +w C n,(6)
other Question: Eukaryotic genes tend to consist of coding regions (exons) and non-coding regions (introns). The figure shows how such a gene leads to the production of a protein. Which of the following statements is true? A. Thymine content of (1) and (2) is approximately equal. B. The process occurring between (2) and (3) takes place in the cytosol. C. (4) can hybridise with (2). D. The number of amino acid residues in (5) must equal the number of nucleotide residues in (2). E. All processes occurri
background Question: Eukaryotic genes tend to consist of coding regions (exons) and non-coding regions (introns). The figure shows how such a gene leads to the production of a protein. Which of the following statements is true? A. Thymine content of (1) and (2) is approximately equal. B. The process occurring between (2) and (3) takes place in the cytosol. C. (4) can hybridise with (2). D. The number of amino acid residues in (5) must equal the number of nucleotide residues in (2). E. All processes occurri
other sharing & image reaction functions are integrated to add a multi-modal dimension to the long-term dialogues.2 The image sharing function is called when the agent decides to send an image. This process includes: (1) Generate a caption c for the intended image using M; (2) Convert the caption c into relevant keywords w using M; (3) Use the keywords k to find an image through web search W EB(k)3; (4) Share the chosen image. Con- versely, the image reaction function is triggered upon receiving an im

authors

Andrew booktitle= Galen and Gao Jianfeng

co-cited works

representative citing papers

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

An 8B autoregressive LM implements a language-switching backdoor via a three-phase circuit with early trigger composition, orthogonal mid-layer propagation, and final-layer MLP conversion, routed through a single-position serial bottleneck.

An Efficient Streaming Video Understanding Framework with Agentic Control

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

Semantic Softmax aggregates probabilities from semantic synonyms around target labels to correct renormalization bias in zero-shot LLM classification, lowering calibration error and raising AUROC and F1.

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.

Accurate and Efficient Statistical Testing for Word Semantic Breadth

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Structure Guided Retrieval-Augmented Generation for Factual Queries

cs.IR · 2026-04-21 · unverdicted · novelty 7.0

SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.

From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.

Cell-Based Representation of Relational Binding in Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.

citing papers explorer

Showing 50 of 74 citing papers after filters.

Evaluating Very Long-Term Conversational Memory of LLM Agents cs.CL · 2024-02-27 · unverdicted · none · ref 4
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents cs.CL · 2026-05-21 · unverdicted · none · ref 7
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Language-Switching Triggers Take a Latent Detour Through Language Models cs.CL · 2026-05-18 · unverdicted · none · ref 4
An 8B autoregressive LM implements a language-switching backdoor via a three-phase circuit with early trigger composition, orthogonal mid-layer propagation, and final-layer MLP conversion, routed through a single-position serial bottleneck.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 4
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking cs.CL · 2026-05-13 · unverdicted · none · ref 47
LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability cs.CL · 2026-05-12 · unverdicted · none · ref 4
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods cs.CL · 2026-05-10 · unverdicted · none · ref 22
Semantic Softmax aggregates probabilities from semantic synonyms around target labels to correct renormalization bias in zero-shot LLM classification, lowering calibration error and raising AUROC and F1.
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL · 2026-05-08 · unverdicted · none · ref 4
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
Accurate and Efficient Statistical Testing for Word Semantic Breadth cs.CL · 2026-05-08 · unverdicted · none · ref 4
A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.
Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 21
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models cs.CL · 2026-05-02 · unverdicted · none · ref 4
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CL · 2026-04-23 · unverdicted · none · ref 130
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition cs.CL · 2026-04-22 · unverdicted · none · ref 17
SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context cs.CL · 2026-04-22 · unverdicted · none · ref 4
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 4
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation cs.CL · 2026-04-20 · unverdicted · none · ref 4
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL · 2026-04-19 · accept · none · ref 4
Cross-modal agreement between chain-of-thought and program-of-thought reasoning enables self-consistency with only two LLM samples, reducing sampling cost by 9.3x while improving accuracy.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 5
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
Longformer: The Long-Document Transformer cs.CL · 2020-04-10 · accept · none · ref 30
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
On the Limits of Steering Vectors for Preference-Aligned Generation cs.CL · 2026-07-02 · unverdicted · none · ref 79
Empirical evaluation on the PLUME benchmark shows steering vectors vary widely in trait expressibility, degrade on task transfer, and lose effectiveness when multiple vectors are composed.
Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions cs.CL · 2026-07-01 · unverdicted · none · ref 48
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors cs.CL · 2026-07-01 · unverdicted · none · ref 82
MetaHOPE is an error severity-aware annotation framework for metaphor translations, applied to three MT/LLM systems on English-Chinese and Chinese-English metaphor corpora with new parallel resources created.
YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese cs.CL · 2026-07-01 · unverdicted · none · ref 28
YOMI-Bench is a new benchmark of four tasks for kanji reading and phonological understanding in LLMs, showing low performance even for Japanese-specific and commercial models.
Multi-Turn Agentic Scientific Literature Search via Workflow Induction cs.CL · 2026-07-01 · unverdicted · none · ref 8
PaperPilot induces executable DAG workflows for multi-turn literature search and trains via imitation plus preference optimization, raising Hit@5 from 58.0 to 77.0 over a baseline agent.
GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis cs.CL · 2026-05-21 · unverdicted · none · ref 4
GHI introduces an incidence-based structural reasoning layer using Graphormer on conditioned hypergraphs for ABSA, reporting outperformance on SemEval benchmarks, near-parity with 11B models at 247M parameters, and robustness on ARTS.
Token-weighted Direct Preference Optimization with Attention cs.CL · 2026-05-21 · unverdicted · none · ref 4
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation cs.CL · 2026-05-20 · unverdicted · none · ref 7
DPR-BAG generates biomedical abstracts from full texts via BOMRC decomposition, parallel LLM summarization, and refinement, showing higher abstractive novelty than baselines while preserving factual consistency on a 46k-article PMC dataset.
ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation cs.CL · 2026-05-19 · unverdicted · none · ref 21
ContextRAG constructs extraction-free hierarchical graphs via residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic on embeddings, using 30 LLM calls and 22k tokens to reach 33.6% F1 on a 130-task UltraDomain subset.
AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code cs.CL · 2026-05-18 · unverdicted · none · ref 4
AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory cs.CL · 2026-05-15 · unverdicted · none · ref 4
Proposes a three-level taxonomy of Cultural Awareness, Cultural Sensitivity, and Cultural Competence for AI evaluation, grounded in intercultural communication scholarship to improve validity in multicultural contexts.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents cs.CL · 2026-05-13 · unverdicted · none · ref 196 · 2 links
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes cs.CL · 2026-05-13 · unverdicted · none · ref 4
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation cs.CL · 2026-05-12 · unverdicted · none · ref 37
Summing outputs from separately trained QLoRA PEFT modules provides strong performance for attribute-controlled text generation, often matching or exceeding single-task modules even on single-attribute tests.
Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling cs.CL · 2026-05-09 · unverdicted · none · ref 11
Context-Aligned Contrastive Regression combines cross-view context alignment and ordinal soft contrastive learning with ridge ensembles to improve lexical difficulty prediction across L1 backgrounds on three datasets.
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls cs.CL · 2026-05-04 · unverdicted · none · ref 95
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 4
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
DWTSumm: Discrete Wavelet Transform for Document Summarization cs.CL · 2026-04-22 · unverdicted · none · ref 4
DWT decomposes sentence- or word-level embeddings into multi-resolution components that preserve semantics for direct or LLM-guided summarization, yielding up to 97% fidelity and gains in BERTScore and semantic metrics over GPT-4o baselines on clinical and legal benchmarks.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 173
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 31
mllm-shap is an open-source Python platform extending Shapley Value explainability to text-audio Multimodal LLMs via modality-aware masking and phonetic token grouping.
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs cs.CL · 2026-04-21 · unverdicted · none · ref 179
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA cs.CL · 2026-04-19 · unverdicted · none · ref 57
Social identity markers in medical questions degrade LLM accuracy and uncertainty calibration, producing a calibration crisis that is non-additive for intersectional cases.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback cs.CL · 2023-09-01 · conditional · none · ref 60
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 63
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation cs.CL · 2021-09-02 · conditional · none · ref 81
CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks than prior encoder-only or decoder-only models.
BamiBERT: A New BERT-based Language Model for Vietnamese cs.CL · 2026-07-02 · unverdicted · none · ref 49
BamiBERT is a new base-sized Vietnamese BERT model trained on raw text that outperforms PhoBERT on 11 of 15 metrics across 8 benchmarks.
FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning cs.CL · 2026-07-01 · unverdicted · none · ref 4
FaithMed applies reinforcement learning with process-level rewards derived from evidence-based medicine rubrics to improve both task performance and reasoning faithfulness in medical LLMs.
CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models cs.CL · 2026-07-01 · unverdicted · none · ref 51
CAT uses intrinsic confidence signals in preference optimization to adapt reasoning length in LRMs, outperforming uniform compression baselines on accuracy across benchmarks.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media cs.CL · 2026-05-20 · unverdicted · none · ref 4
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 4
Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling cs.CL · 2026-05-19 · unverdicted · none · ref 4 · 2 links
PromptRad reformulates multi-label radiology report classification as masked language modeling and enriches verbalizers with UMLS synonyms, outperforming baselines with only 32 training examples.

Scalable training of

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer