SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
C-pack: Packaged resources to advance general chinese embedding
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
A VLM-based emotion recognition system for human-robot collaboration achieves higher semantic and sentiment alignment with human annotations than a CNN baseline and results in preferred adaptive robot behavior in a user study.
LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving average steps while improving quality on three datasets.
MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.
citing papers explorer
-
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
"I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration
A VLM-based emotion recognition system for human-robot collaboration achieves higher semantic and sentiment alignment with human annotations than a CNN baseline and results in preferred adaptive robot behavior in a user study.
-
LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation
LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving average steps while improving quality on three datasets.
-
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
-
Supervising the search process produces reliable and generalizable information-seeking agents
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
Nomic Embed: Training a Reproducible Long Context Text Embedder
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.