The first public dataset of 10,217 GPT-Image-2 generated images sourced from Twitter in the week after release, with CLIP taxonomy, OCR, face detection, clustering analyses, and a finding that C2PA provenance data is stripped on upload.
super hub Mixed citations
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Mixed citation behavior. Most common role is background (50%).
abstract
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique
authors
co-cited works
representative citing papers
t-SNE converges in the large-data limit to a non-convex variational energy with attraction and repulsion terms that admits a unique smooth minimizer but infinitely many discontinuous ones in one dimension.
Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The paper introduces #PraCegoVer, the first large-scale image captioning dataset in Portuguese sourced from Instagram posts with single user-generated captions per image.
GoodQ uses generative models with information-dense prompting, distribution-aware selection, and teacher-guided noise reduction to achieve SOTA low-bit (W4A4) and extreme-bit (W3A3) zero-shot quantization for object detectors.
SOCP uses self-organizing maps for unsupervised group discovery to enable local calibration in conformal prediction, reducing regional coverage gaps on benchmarks with small set-size increases while preserving validity guarantees.
On heterogeneous document collections, only query expansion and a newly introduced per-source calibrated corrector (SSCC) deliver reliable gains beyond a strong cross-encoder reranker; other common retrieval enhancements do not.
A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.
Transformer representations form trajectories showing semantic convergence in middle-to-late layers, higher curvature on reasoning tasks, bifurcation on ambiguous tokens, and a consistent three-phase cosine similarity pattern across GPT-2, TinyLlama, and Qwen2.5.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
An exact algebraic identity plus low-rank SVD and Haar-measure null-space approximation reduce per-point mean curvature cost from O(m^4) to O(k^2 m + k m p^2) with 50-300x speedups and negligible accuracy loss.
RedZeD presents a new theoretical framework and algorithm for faster persistent homology computation on Vietoris-Rips filtrations.
FiSeR uses coarse contrastive separation of natural vs synthetic images plus fine contrastive grouping by generator identity to improve cross-domain AUROC by +10.22 over DIRE baseline on multiple test sets.
LLM residual streams during addition form an Iso-Raw-Sum Trajectory anchored by digit semantics and modulated by continuous carry signals, with errors arising as geometric slippages across quantization thresholds in a noisy model.
GlucoFM decomposes CGM traces into dual state-event streams, pretrains on 109k hours of unlabeled data, and reports superior subject-disjoint performance on seven clinical tasks across four cohorts.
ScaleMAP is a dimensionality-reduction method that preserves both neighborhood structure and local density by scaling embedding displacements with original local radii, matching DensMAP on density while retaining UMAP-level neighborhood fidelity.
CoP achieves over 90% of per-instance SAM performance on cell-type benchmarks with one click per type via recursive non-parametric expansion of reliable same-type points.
COLAGUARD matches explicit-reasoning guardrail performance on safety benchmarks while delivering 12.9X speedup and 22.4X token reduction by propagating hidden states instead of generating text.
A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.
Successor representation training on natural language causes part-of-speech categories to emerge spontaneously in the learned embeddings, with structure varying by predictive horizon.
A 527-item GDPR-aligned privacy preference item bank was developed by extracting 669 statements from 99 GDPR articles and validating them through multi-round expert consensus and semantic clustering.
citing papers explorer
-
Physics-informed, Generative Adversarial Design of Funicular Shells
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
-
Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
SABLE shows that semantics-aware natural triggers enable effective backdoor attacks in federated learning against multiple aggregation rules while preserving benign accuracy.
-
A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data
A large benchmark finds traditional imputation methods for scRNA-seq data generally outperform deep learning ones, but numerical recovery does not reliably improve biological downstream analyses and no method wins across all settings.
-
Behavioral Integrity Verification for AI Agent Skills
BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
-
LEGO-MOF: Equivariant Latent Manipulation for Editable, Generative, and Optimizable MOF Design
LEGO-MOF maps MOF linkers to an equivariant latent space for continuous editing and uses test-time optimization to achieve a 147.5% average boost in pure CO2 uptake while preserving structural validity.
-
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
-
Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments
AMC models memory consolidation via a Liquid-Glass-Crystal process governed by an SDE with proven convergence to a Beta distribution, yielding 34-43% better forward transfer and 67-80% less forgetting on standard continual RL benchmarks.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
A foundation model for atomistic materials chemistry
MACE-MP-0 is a general-purpose atomistic ML force field trained on public data that enables stable simulations of diverse chemical systems with qualitative and sometimes quantitative accuracy, serving as a starting point for fine-tuning.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
CAST selects better multimodal coresets by fusing collapse-aware topologies across modalities and matching distributions at multiple scales in the diffusion wavelet domain.
-
Collaboration, Integration, and Thematic Exploration in European Framework Programmes: A Longitudinal Network Analysis
EU Framework Programmes have increased participation equity and integrated new countries through collaboration, yet research remains concentrated on established trajectories rather than broadly exploratory.
-
FastUMAP: Scalable Dimensionality Reduction via Bipartite Landmark Sampling
FastUMAP approximates UMAP via sparse bipartite point-landmark graphs and Nystrom initialization to deliver lower runtimes than Barnes-Hut t-SNE on most tested datasets while retaining competitive kNN accuracy.
-
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
-
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.