super hub Mixed citations

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Bin Xu, Bowen Wang, Chenhui Zhang, Dan Zhang, Da Yin, Team GLM: Aohan Zeng · 2024 · cs.CL · arXiv 2406.12793

Mixed citation behavior. Most common role is background (58%).

126 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 126 citing papers more from Bin Xu arXiv PDF

abstract

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 baseline 9 method 2 other 1

citation-polarity summary

background 15 baseline 9 use method 2

claims ledger

abstract We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achiev

authors

Bin Xu Bowen Wang Chenhui Zhang Dan Zhang Da Yin Team GLM: Aohan Zeng

co-cited works

representative citing papers

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

Asuka-Bench is a new benchmark of 50 web tasks with 784 criteria that evaluates 8 LLMs in 2 frameworks on multi-round refinement, finding a 38-point spread in weighted task pass rate and a top score of only 52% after three rounds.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Defines representational capacity as the upper bound on distinguishable near-orthogonal directions in transformer latent spaces, derived from embedding similarity distributions and an adjusted Johnson-Lindenstrauss formula dependent on the k/d ratio.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Language models show a scale-dependent switch from anticorrelated to correlated reasoning-truthfulness coupling at a family-specific critical parameter count, with architecture and data choices shifting the transition point.

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

cs.CL · 2026-05-10 · conditional · novelty 7.0

K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

cs.OS · 2026-05-05 · unverdicted · novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.

OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.

From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.

citing papers explorer

Showing 26 of 126 citing papers.

Disposition Distillation at Small Scale: A Three-Arc Negative Result cs.LG · 2026-04-13 · accept · none · ref 7 · internal anchor
Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.
MAFIG: Multi-agent Driven Formal Instruction Generation Framework cs.AI · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
MAFIG uses a Perception Agent and Emergency Decision Agent plus span-focused local distillation to let lightweight models rapidly generate formal instructions that fix local scheduling failures, achieving over 94% success with sub-second latency on port, warehousing, and deck datasets.
A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement cs.CL · 2025-07-14 · unverdicted · none · ref 21 · internal anchor
SMCS coordinates 15 open-source LLMs via retrieval-based prior selection and exploration-exploitation posterior enhancement, outperforming GPT-4.1 by 5.36% and GPT-o3-mini by 5.28% on eight benchmarks.
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models cs.CV · 2025-05-21 · unverdicted · none · ref 60 · internal anchor
LENS is a new multi-level benchmark dataset for evaluating MLLMs on perception-to-reasoning tasks using the same images across all levels with recent social media content.
Advancing AI Research Assistants with Expert-Involved Learning cs.AI · 2025-05-03 · unverdicted · none · ref 36 · internal anchor
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs cs.IR · 2025-04-22 · unverdicted · none · ref 33 · internal anchor
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model eess.IV · 2025-04-09 · unverdicted · none · ref 11 · internal anchor
Q-Agent uses CoT decomposition on a fine-tuned MLLM for multi-degradation perception plus IQA-driven greedy selection of restoration algorithms to claim better performance than All-in-One IR models.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 214 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling cs.CV · 2025-01-21 · unverdicted · none · ref 11 · internal anchor
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject
How Does A Text Preprocessing Pipeline Affect Ontology Matching? cs.CL · 2024-11-06 · unverdicted · none · ref 56 · internal anchor
Phase 1 preprocessing steps improve ontology matching more than Phase 2 steps on OAEI tracks, and two new repair methods reduce false mappings.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling cs.CV · 2024-10-08 · unverdicted · none · ref 52 · internal anchor
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
CogVLM2: Visual Language Models for Image and Video Understanding cs.CV · 2024-08-29 · conditional · none · ref 17 · internal anchor
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 44 · internal anchor
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning cs.CL · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
LegalGraphRAG adds hierarchical organization to legal knowledge graphs and a multi-agent verification loop to reach claimed state-of-the-art accuracy and trustworthiness on legal reasoning benchmarks.
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate cs.AI · 2026-05-17 · unverdicted · none · ref 101 · internal anchor
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next cs.LG · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Frontier models show positive capability coupling (r=0.72) across SWE-bench and GPQA, with lab-specific emphasis shifts measured by an h-field residual that distinguishes permanent pretraining changes from reversible post-training ones.
From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions cs.HC · 2026-05-06 · unverdicted · none · ref 11 · internal anchor
Evaluation of LLMs shows that specific prompting can increase higher-order questions by 11.53% and reduce repetitiveness by 24.45% in question generation for education.
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models cs.CL · 2024-03-20 · unverdicted · none · ref 11 · internal anchor
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
LexPath: A domain-oriented multi-path framework for legal article retrieval cs.IR · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
LexPath combines IRAC-guided sparse retrieval, structure-guided dense retrieval, and intent-aware reranking to outperform standard lexical, dense, hybrid, and RAG baselines on legal article retrieval benchmarks.
XekRung Technical Report cs.CR · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain cs.CV · 2026-01-14 · unverdicted · none · ref 30 · internal anchor
A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 2 · internal anchor
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio cs.CL · 2026-06-15 · unreviewed · ref 13 · internal anchor
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems cs.CR · 2026-04-08 · unreviewed · ref 6 · internal anchor
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation cs.CV · 2026-04-02 · unreviewed · ref 13 · internal anchor
OM4OV: Leveraging Ontology Matching for Ontology Versioning cs.AI · 2024-09-30 · unreviewed · ref 23 · internal anchor

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer