LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
hub Canonical reference
Attention is all you need
Canonical reference. 70% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CLAIR recovers the shared LoRA subspace and detects contaminated clients in heterogeneous federated settings through structured low-rank plus block-sparse decomposition, with theoretical recovery guarantees and empirical gains over local fine-tuning.
Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
PEPS decomposes positional encodings into projected points with unique frequency-dependent motions to support more efficient learned grid-based encodings in INRs, outperforming prior methods on image, texture, and SDF tasks with often 25% fewer parameters.
GraphCBMs extend concept bottleneck models by building latent concept graphs to model correlations between concepts, yielding better image classification accuracy, more informative structure for interpretability, and stronger intervention results.
AttnTrace is an attention-weight-based context traceback method for LLMs that claims higher accuracy and efficiency than prior art like TracLLM while aiding prompt injection detection.
InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
StreamPhy introduces an end-to-end streaming framework using state-space models and an expressive FT-FiLM decoder to infer continuous physical dynamics from irregular sparse data, claiming 48% better accuracy and 20-100X faster inference than diffusion baselines.
NEMORI is an adaptive memory distillation framework for LLM agents that transforms raw interactions into narratives and extracts insights via prediction error to decide what deserves retention.
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
citing papers explorer
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment
CLAIR recovers the shared LoRA subspace and detects contaminated clients in heterogeneous federated settings through structured low-rank plus block-sparse decomposition, with theoretical recovery guarantees and empirical gains over local fine-tuning.
-
Rodrigues Network for Learning Robot Actions
Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
-
PEPS: Positional Encoding Projected Sampling -- Extended
PEPS decomposes positional encodings into projected points with unique frequency-dependent motions to support more efficient learned grid-based encodings in INRs, outperforming prior methods on image, texture, and SDF tasks with often 25% fewer parameters.
-
Graph Concept Bottleneck Models
GraphCBMs extend concept bottleneck models by building latent concept graphs to model correlations between concepts, yielding better image classification accuracy, more informative structure for interpretability, and stronger intervention results.
-
AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption
AttnTrace is an attention-weight-based context traceback method for LLMs that claims higher accuracy and efficiency than prior art like TracLLM while aiding prompt injection detection.
-
The Generalization Ridge: Information Flow in Natural Language Generation
InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.
-
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
-
Block-wise Adaptive Caching for Accelerating Diffusion Policy
BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
-
StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
StreamPhy introduces an end-to-end streaming framework using state-space models and an expressive FT-FiLM decoder to infer continuous physical dynamics from irregular sparse data, claiming 48% better accuracy and 20-100X faster inference than diffusion baselines.
-
What Deserves Memory: Adaptive Memory Distillation for LLM Agents
NEMORI is an adaptive memory distillation framework for LLM agents that transforms raw interactions into narratives and extracts insights via prediction error to decide what deserves retention.
-
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
Mixtral of Experts
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
-
Mistral 7B
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
-
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
-
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
ROS-LLM integrates LLMs with ROS to let non-experts specify robot tasks in natural language, supporting sequence, behavior tree, and state machine modes plus imitation learning and reflection on feedback.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
-
Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation
Ligandformer is a self-attention graph neural network framework that predicts compound properties, outputs attention maps for local structural interpretation, and claims improved robustness and generalization over prior methods.
-
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.