super hub Mixed citations

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Bin Xu, Bowen Wang, Chenhui Zhang, Dan Zhang, Da Yin, Team GLM: Aohan Zeng · 2024 · cs.CL · arXiv 2406.12793

Mixed citation behavior. Most common role is background (58%).

131 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 131 citing papers more from Bin Xu arXiv PDF

abstract

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 baseline 9 method 2 other 1

citation-polarity summary

background 15 baseline 9 use method 2

claims ledger

abstract We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achiev

authors

Bin Xu Bowen Wang Chenhui Zhang Dan Zhang Da Yin Team GLM: Aohan Zeng

co-cited works

representative citing papers

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

Asuka-Bench is a new benchmark of 50 web tasks with 784 criteria that evaluates 8 LLMs in 2 frameworks on multi-round refinement, finding a 38-point spread in weighted task pass rate and a top score of only 52% after three rounds.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Language models show a scale-dependent switch from anticorrelated to correlated reasoning-truthfulness coupling at a family-specific critical parameter count, with architecture and data choices shifting the transition point.

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

cs.CL · 2026-05-10 · conditional · novelty 7.0

K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

cs.OS · 2026-05-05 · unverdicted · novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.

OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.

citing papers explorer

Showing 50 of 131 citing papers.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio cs.CL · 2026-06-15 · unverdicted · none · ref 11 · 2 links · internal anchor
MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
CHASM: Unveiling Covert Advertisements on Chinese Social Media cs.LG · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 16 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV · 2026-02-04 · unverdicted · none · ref 12 · internal anchor
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 18 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration cs.CV · 2026-06-26 · unverdicted · none · ref 2 · internal anchor
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes cs.CR · 2026-06-24 · unverdicted · none · ref 33 · internal anchor
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement cs.SE · 2026-06-04 · unverdicted · none · ref 45 · internal anchor
Asuka-Bench is a new benchmark of 50 web tasks with 784 criteria that evaluates 8 LLMs in 2 frameworks on multi-round refinement, finding a 38-point spread in weighted task pass rate and a top score of only 52% after three rounds.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 18 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 78 · internal anchor
AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.
Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models cs.LG · 2026-06-01 · unverdicted · none · ref 21 · internal anchor
Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.
RWGBench: Evaluating Scholarly Positioning in Related Work Generation cs.DL · 2026-05-30 · unverdicted · none · ref 14 · internal anchor
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 97 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation cs.LG · 2026-05-18 · unverdicted · none · ref 36 · internal anchor
Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 37 · internal anchor
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling cs.LG · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Language models show a scale-dependent switch from anticorrelated to correlated reasoning-truthfulness coupling at a family-specific critical parameter count, with architecture and data choices shifting the transition point.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments cs.RO · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs cs.CL · 2026-05-10 · conditional · none · ref 5 · internal anchor
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 23 · 2 links · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 135 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving cs.OS · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice cs.CL · 2026-05-02 · unverdicted · none · ref 41 · internal anchor
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios cs.CL · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction cs.CV · 2026-04-26 · unverdicted · none · ref 7 · internal anchor
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning for VRDU.
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs cs.CV · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 74 · internal anchor
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment cs.CL · 2026-04-17 · unverdicted · none · ref 11 · internal anchor
C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to synthesize better instruction data.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice cs.CL · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace cs.AI · 2026-04-09 · unverdicted · none · ref 16 · internal anchor
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill cs.CL · 2026-04-08 · conditional · none · ref 14 · internal anchor
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation cs.CL · 2026-03-05 · unverdicted · none · ref 1 · internal anchor
IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than existing benchmarks.
GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning cs.AI · 2026-03-02 · unverdicted · none · ref 16 · internal anchor
GraphScout trains LLMs to autonomously synthesize structured training data from knowledge graphs via flexible exploration tools, enabling a 4B model to outperform larger LLMs by 16.7% on average with fewer inference tokens and strong cross-domain transfer.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 30 · internal anchor
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition cs.AI · 2025-11-26 · unverdicted · none · ref 25 · internal anchor
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 16 · internal anchor
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 57 · internal anchor
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation cs.CL · 2024-12-19 · unverdicted · none · ref 14 · internal anchor
S^2-Bench is a new one-to-many benchmark for natural language-driven molecule generation with three tasks, and OpenMolIns is an instruction dataset enabling Llama3.1-8B to outperform GPT-4o and Claude-3.5 on it.
Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code cs.SE · 2026-06-28 · unverdicted · none · ref 15 · internal anchor
Mandatory per-line citations in SDD frameworks reduce LLM output determinism but enable reliable automated hallucination detection (TDR 86-88%, FPR 0%), a trade-off replicated across Claude and GLM models.
Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory cs.CL · 2026-06-27 · unverdicted · none · ref 19 · internal anchor
A hybrid attention mechanism with editable request-local memory slots and sparse fallback achieves high accuracy on synthetic overwrite, version, and anti-pollution tasks where pure fixed-state or sparse methods fail, while identifying open-domain selection as the remaining bottleneck.
Express Language Modeling cs.LG · 2026-06-09 · unverdicted · none · ref 12 · internal anchor
Express converts non-causal attention approximations to causal versions, achieving log^{3/2}(n)/s error with O(s) memory and O(s^2 log^2(n)) overhead when combined with Thinformer.
Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook cs.AI · 2026-06-09 · unverdicted · none · ref 49 · internal anchor
Semantic mapping of 8,954 definitions and 2,700 scales from 14,000+ papers shows learner agency and autonomy span task regulation, personal motivation, and sociocultural dimensions, with existing scales and generative AI research underrepresenting the sociocultural dimension.
IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking cs.CL · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
IS-CoT framework interleaves planning, writing, and reflection in LLMs to prevent length collapse, yielding IS-Writer-8B that outperforms larger models on long-form benchmarks with better length compliance.
Cross-LLM Consistency in Inference: Evidence from Shared Interactions cs.AI · 2026-06-06 · unverdicted · none · ref 14 · internal anchor
LLMs share lower-order interaction patterns for token prediction from identical prompts, with stronger consistency in advanced models, identified via interaction-based explanations.
The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs cs.CL · 2026-06-05 · unverdicted · none · ref 53 · internal anchor
Using a 1PL IRT model on real cultural questions across 13 locales, the study identifies a local-language knowledge-access advantage masked by lower proficiency in raw accuracy.
GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning cs.CV · 2026-06-03 · unverdicted · none · ref 13 · internal anchor
GOPAgen proposes integrating video codec GOPs with a motion agent, GOP tree reasoning, structural memory, and motion vector database to improve efficiency and motion detail in agentic long-video VQA, reporting gains on MotionBench and EgoSchema.
Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark cs.CL · 2026-06-01 · unverdicted · none · ref 45 · internal anchor
Explicit gender cues induce bounded but systematic decision flips in LLMs on value trade-offs, with self-attributions frequently denying gender influence.
Towards Efficient LLMs Annealing with Principled Sample Selection cs.CL · 2026-05-29 · unverdicted · none · ref 16 · internal anchor
DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 25 · internal anchor
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer