{"total":30,"items":[{"citing_arxiv_id":"2605.21217","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment","primary_cat":"stat.ML","submitted_at":"2026-05-20T14:12:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CLAIR recovers the shared LoRA subspace and detects contaminated clients in heterogeneous federated settings through structured low-rank plus block-sparse decomposition, with theoretical recovery guarantees and empirical gains over local fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15336","ref_index":7,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HoloMotion-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-14T18:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07384","ref_index":11,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:40:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StreamPhy introduces an end-to-end streaming framework using state-space models and an expressive FT-FiLM decoder to infer continuous physical dynamics from irregular sparse data, claiming 48% better accuracy and 20-100X faster inference than diffusion baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"or joint embedding predictive architectures [8], and then learn a generative model over the latent variables. During inference, techniques such as diffusion posterior sampling (DPS) [9] are employed to incorporate observations. Despite their strong empirical performance, these approaches typically rely on vectorized or matrix-form representations, making them readily compatible with off-the- shelf AI models (e.g., CNNs [10] and Transformers [11]) but less effective at explicitly capturing the intrinsic spatiotemporal structure of physical fields. Another line of work adopts a tensor de- composition perspective, often combined with functional representation learners such as implicit neural representations (INRs) to accommodate continuous domains [12, 13, 14, 15]. By explicitly modeling the multilinear structure of spatiotemporal data, tensor-based methods are better suited to"},{"citing_arxiv_id":"2605.06702","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment","primary_cat":"cs.AI","submitted_at":"2026-05-05T12:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24167","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PEPS: Positional Encoding Projected Sampling -- Extended","primary_cat":"cs.CV","submitted_at":"2026-04-27T08:23:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEPS decomposes positional encodings into projected points with unique frequency-dependent motions to support more efficient learned grid-based encodings in INRs, outperforming prior methods on image, texture, and SDF tasks with often 25% fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07392","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-04-08T06:14:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.14255","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Graph Concept Bottleneck Models","primary_cat":"cs.LG","submitted_at":"2025-08-19T20:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraphCBMs extend concept bottleneck models by building latent concept graphs to model correlations between concepts, yielding better image classification accuracy, more informative structure for interpretability, and stronger intervention results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.03793","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption","primary_cat":"cs.CL","submitted_at":"2025-08-05T17:56:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AttnTrace is an attention-weight-based context traceback method for LLMs that claims higher accuracy and efficiency than prior art like TracLLM while aiding prompt injection detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.03341","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"What Deserves Memory: Adaptive Memory Distillation for LLM Agents","primary_cat":"cs.AI","submitted_at":"2025-08-05T11:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NEMORI is an adaptive memory distillation framework for LLM agents that transforms raw interactions into narratives and extracts insights via prediction error to decide what deserves retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.05387","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Generalization Ridge: Information Flow in Natural Language Generation","primary_cat":"cs.CL","submitted_at":"2025-07-07T18:18:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13674","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention","primary_cat":"cs.CL","submitted_at":"2025-06-16T16:30:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13456","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Block-wise Adaptive Caching for Accelerating Diffusion Policy","primary_cat":"cs.AI","submitted_at":"2025-06-16T13:14:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02618","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rodrigues Network for Learning Robot Actions","primary_cat":"cs.RO","submitted_at":"2025-06-03T08:34:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.08223","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices","primary_cat":"cs.DC","submitted_at":"2025-03-11T09:41:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This marks a transformative shift in AI computation, where the need for computing power is expanding at an unprecedented pace, pushing the limits of existing hardware and infrastructure. Moore's Law is slowing down. Moore's Law, which has driven the growth in computing power for decades, is slowing down as we approach the physical limits of silicon-based chip technology [40]. The difficulty in shrinking transistors has led to diminishing returns in computational performance. As a result, the AI industry is relying more on specialized hardware like GPUs, TPUs, and custom chips to meet growing demands. However, this shift has made high-performance hardware even more expensive and exclusive, further intensifying the gap between organizations with the resources to"},{"citing_arxiv_id":"2409.18869","ref_index":86,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Emu3: Next-Token Prediction is All You Need","primary_cat":"cs.CV","submitted_at":"2024-09-27T16:06:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"understanding task, we assess the average scores across twelve benchmarks: SEEDBench-Img [45], OCRBench [ 59](with normalized results), MMVet [ 98], POPE [ 51], VQAv2 [ 27], GQA [ 34], TextVQA [78], ChartQA [61], AI2D [36], RealWorldQA [91], MMMU [99], and MMbench [58]. For the video generation task, we present comparison results of VBench. 1 Introduction Next-token prediction has revolutionized the field of language models [86, 69, 9], enabling break- throughs like ChatGPT [ 64] and sparking discussions about the early signs of artificial general intelligence (AGI) [10]. However, the applicability of this paradigm to multimodal models remains unclear, with limited evidence of its efficacy in achieving competitive performance across different tasks. In the realm of multimodal models, vision generation has been dominated by complex diffusion"},{"citing_arxiv_id":"2407.01284","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"I think, therefore I am. \" - René Descartes Human cognitive and reasoning patterns have profoundly shaped the progress of deep learning [1]. Initially, the design of neural networks [2] is inspired by the brain's neuronal mechanisms. It uses convolution kernels and hierarchical network to mimic human cognitive process of knowledge acquisition. Recently, Transformers [3] employ attention mechanisms to handle multiple information ∗Equal contribution. †Corresponding author Preprint. Under review. arXiv:2407.01284v1 [cs.AI] 1 Jul 2024 Understanding andConversion ofUnitsAngles and LengthCalculation ofSolid Figures UnderstandingofSolid Figures Calculation ofPlane FiguresUnderstandingofPlane FiguresBasic Transformationsof Figures"},{"citing_arxiv_id":"2406.19741","ref_index":53,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning","primary_cat":"cs.RO","submitted_at":"2024-06-28T08:28:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ROS-LLM integrates LLMs with ROS to let non-experts specify robot tasks in natural language, supporting sequence, behavior tree, and state machine modes plus imitation learning and reflection on feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.17557","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale","primary_cat":"cs.CL","submitted_at":"2024-06-25T13:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"from contributing datasets, we also releasedatatrove [12], the data processing library we developed to create FineWeb. On the whole, our work represents a significant step towards improving public knowledge and resources for curating LLM pre-training datasets. 2 Background In this work, we focus on the curation of training datasets for autoregressive Transformer-based large language models (LLMs) [13]. At their core, LLMs aim to produce a distribution over the next token of text conditioned on past tokens, where each token is typically a word or subword unit [ 3]. The generality of this paradigm allows LLMs to be applied to virtually any text-based task by formulating a prefix whose continuation corresponds to performing the task (e.g. \"The cat sat on the mat translated"},{"citing_arxiv_id":"2405.08748","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding","primary_cat":"cs.CV","submitted_at":"2024-05-14T16:33:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.16994","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","primary_cat":"cs.CV","submitted_at":"2024-04-25T19:29:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05561","ref_index":80,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TrustLLM: Trustworthiness in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-10T22:07:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Our findings indicate that: 1) Proprietary LLMs like GPT-4 and open-source LLMs like LLama2 often struggle to provide truthful responses when relying solely on their internal knowledge. This issue is primarily due to noise in their training data, including misinformation or outdated information, and the lack of generalization capability in the underlying Transformer architecture [80]. 2) Furthermore, all LLMs face challenges in zero-shot commonsense reasoning tasks, suggesting difficulty in tasks that are relatively *In this work, utility refers to the functional effectiveness of the model in natural language processing tasks, including abilities in logical reasoning, content summarization, text generation, and so on. 9 TRUST LLM"},{"citing_arxiv_id":"2401.04088","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mixtral of Experts","primary_cat":"cs.LG","submitted_at":"2024-01-08T18:47:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.07104","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SGLang: Efficient Execution of Structured Language Model Programs","primary_cat":"cs.AI","submitted_at":"2023-12-12T09:34:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Additionally, different program instances often share some common parts (e.g., system prompts). These scenarios create many shared prompt prefixes during execution, leading to numerous opportunities for reusing the KV cache. During LLM inference, the KV cache stores intermediate tensors from the forward pass, reused for decoding future tokens. They are named after key-value pairs in the self-attention mechanism [51]. KV cache computation depends only on prefix tokens. Therefore, requests with the same prompt prefix can reuse the KV cache, reducing redundant computation and memory usage. More background and some examples are provided in Appendix A. Given the KV cache reuse opportunity, a key challenge in optimizing SGLang programs is reusing the KV cache across multiple calls and instances."},{"citing_arxiv_id":"2310.06825","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mistral 7B","primary_cat":"cs.CL","submitted_at":"2023-10-10T17:54:58+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.10253","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts","primary_cat":"cs.AI","submitted_at":"2023-09-19T02:19:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.06571","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ModelScope Text-to-Video Technical Report","primary_cat":"cs.CV","submitted_at":"2023-08-12T13:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"follows LDM [46] but modifies it to the video generation task. Text-to-image synthesis via diffusion models. By receiving knowledge from natural language instructions (e.g., CLIP [42] and T5 [43]), diffusion models can be utilized for text-to-image synthesis. LDM [46] designed language-conditioned image generator by augmenting the UNet backbone [47] with cross-attention layers [58]. DALL-E 2 [44] generated image embeddings for a diffusion decoder with CLIP text encoder. The concurrent work, Imagen [ 48], found the scalibility of T5, which means increasing the size of T5 could boost image fidelity and language-image alignment. Building on existing image generation framework [46, 48], Imagic [27] achieves text-based semantic image"},{"citing_arxiv_id":"2306.03310","ref_index":66,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","primary_cat":"cs.AI","submitted_at":"2023-06-05T23:32:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"process a sequence of encoded visual information. The language instruction is incorporated into the ResNet features using the FiLM method [50] and added to the LSTM inputs, respectively.RESNET-T 4A suite of 10 tasks is enough to observe catastrophic forgetting while maintaining computation efficiency. 5 architecture [75] uses a similar ResNet-based visual backbone, but a transformer decoder [ 66] as the temporal backbone to process outputs from ResNet, which are a temporal sequence of visual tokens. The language embedding is treated as a separate token in inputs to the transformer alongside the visual tokens. The VIT-T architecture [31], which is widely used in visual-language tasks, uses a Vision Transformer (ViT) as the visual backbone and a transformer decoder as the temporal backbone."},{"citing_arxiv_id":"2206.10789","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2022-06-22T01:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":", DALL-E [2] and CogView [3], has made signiﬁcant progress in generating high-ﬁdelity images and demonstrating generalization capabilities to unseen combinations of objects and concepts. Both treat the task as a form of language modeling, from textual descriptions into visual words, and use modern sequence- to-sequence architectures like Transformers [4] to learn the relationship between language inputs and visual outputs. A key component of these approaches is the conversion of each image into a sequence of discrete units through the use of an image tokenizer such as dV AE [5] or VQ-V AE [6]. Visual tokenization essentially uniﬁes the view of text and images so that both can be treated simply as sequences of discrete tokens-and thus amenable to sequence-to-sequence models."},{"citing_arxiv_id":"2205.01917","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoCa: Contrastive Captioners are Image-Text Foundation Models","primary_cat":"cs.CV","submitted_at":"2022-05-04T07:01:14+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2202.10873","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation","primary_cat":"q-bio.BM","submitted_at":"2022-02-21T15:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Ligandformer is a self-attention graph neural network framework that predicts compound properties, outputs attention maps for local structural interpretation, and claims improved robustness and generalization over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}