pith. machine review for the scientific record. sign in

arxiv: 2410.21276 · v1 · submitted 2024-10-25 · 💻 cs.CL · cs.AI· cs.CV· cs.CY· cs.LG· cs.SD· eess.AS

Recognition: unknown

GPT-4o System Card

Adam Lerer, Adam Perelman, Adam P. Goucher, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M\k{a}dry, Alexander Kirillov, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alexi Christakis, Alexis Conneau, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Ali Kamali, Allan Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braunstein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, Andrew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse, Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak, Avital Oliver, Barret Zoph, Behrooz Ghorbani, Benjamin Zweig, Ben Leimberger, Ben Rossen, Ben Sokolowsky, Ben Wang, Beth Hoover, Blake Samic, Bobby Spero, Bob McGrew, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki, Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Daniel Levin, Daniel Levy, David Carr, David Farhi, David Mely, David Robinson, David Sasaki, Denny Jin, Dev Valladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Antonow, Eric Kramer, Eric Peterson, Eric Sigler, Eric Wallace, Eugene Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski Such, Filippo Raso, Francis Zhang, Freddie Sulit, Fred von Lohmann, Gabriel Goh, Gene Oden, Geoff Salmon, Giulio Starace, Greg Brockman, Hadi Salman, Haiming Bao, Haitang Hu, Hannah Wong, Haoyu Wang, Heather Schmidt, Heather Whitney, Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen Chang, Hyung Won Chung, Ian Kivlichan, Ian O'Connell, Ian Osband, Ian Silber, Ian Sohl, Ibrahim Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob Menick, Jakub Pachocki, James Aung, James Betker, James Crooks, James Lennon, Jamie Kiros, Jane Park, Jan Leike, Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee, Jessica Shieh, Jiahui Yu, Jiayi Weng, Jieqi Yu, Jie Tang, Ji Lin, Joanne Jang, Joaquin Quinonero Candela, Joe Beutler, Joe Landers, Joel Parish, Johannes Heidecke, John Schulman, Jonathan Lachman, Jonathan McKay, Jonathan Uesato, Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Josh Gross, Josh Kaplan, Josh Snyder, Joshua Achiam, Jos Kraaijeveld, Joyce Lee, Joy Jiao, Juntang Zhuang, Justyn Harriman, Kai Fricke, Kai Hayashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla Wood, Kendra Rimbach, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin Button, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle Luther, Lama Ahmad, Larry Kai, Lauren Itow, Lauren Workman, Leher Pathak, Leo Chen, Lia Guy, Liam Fedus, Liang Zhou, Lien Mamitsuka, Li Jing, Lilian Weng, Lindsay McCallum, Lindsey Held, Long Ouyang, Louis Feuvrier, Lukas Kondraciuk, Lukasz Kaiser, Luke Hewitt, Luke Metz, Lu Zhang, Lyric Doshi, Mada Aflak, Maddie Simens, Madelaine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gray, Mark Hudnall, Marvin Zhang, Marwan Aljubeh, Mateusz Litwin, Matthew Zeng, Max Johnson, Mayank Gupta, Maya Shetty, Meghan Shah, Mehmet Yatbaz, Mengchao Zhong, Meng Jia Yang, Mia Glaese, Mianna Chen, Michael Janner, Michael Lampe, Michael Petrov, Michael Wu, Michele Wang, Michelle Fradin, Michelle Pokrass, Miguel Castro, Miguel Oom Temudo de Castro, Mikhail Pavlov, Miles Brundage, Miles Wang, Minal Khan, Mira Murati, Mo Bavarian, Molly Lin, Murat Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher, Natalie Summers, Natan LaFontaine, Neil Chowdhury, Nick Ryder, Nick Stathas, Nick Turley, Niko Felix, Nik Tezak, Nithanth Kudige, Nitish Keskar, Noah Deutsch, Noel Bundick, Nora Puckett, Ofir Nachum, Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia Watkins, Olivier Godement, OpenAI: Aaron Hurst, Owen Campbell-Moore, Patrick Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Peter Bakkum, Peter Deng, Peter Dolan, Peter Hoeschele, Peter Welinder, Philippe Tillet, Philip Pronin, Phil Tillet, Prafulla Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul Arora, Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Ricky Wang, Rob Donnelly, Rob Honsby, Rocky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael, Rowan Zellers, Roy Chen, Ruby Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sandhini Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, Sean Grove, Sean Metzger, Shamez Hermani, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto, Shirong Wu, Shuaiqi (Tony) Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shadwell, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tomer Kaftan, Tom Rubin, Tom Stasi, Tristan Heywood, Troy Peterson, Tyce Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, Yury Malkov, Yu Zhang

classification 💻 cs.CL cs.AIcs.CVcs.CYcs.LGcs.SDeess.AS
keywords gpt-4otextaudiocapabilitiescardimagesystemvision
0
0 comments X
read the original abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

    cs.CR 2026-05 unverdicted novelty 8.0

    M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but inco...

  2. From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

    cs.SE 2026-04 unverdicted novelty 8.0

    MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...

  3. SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents

    cs.CR 2026-04 unverdicted novelty 8.0

    The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...

  4. CHASM: Unveiling Covert Advertisements on Chinese Social Media

    cs.LG 2026-04 unverdicted novelty 8.0

    CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

  5. HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    cs.SD 2026-04 unverdicted novelty 8.0

    HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...

  6. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  7. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  8. ReConText3D: Replay-based Continual Text-to-3D Generation

    cs.CV 2026-04 conditional novelty 8.0

    ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

  9. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...

  10. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...

  11. DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

    cs.AI 2026-04 unverdicted novelty 8.0

    DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

  12. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  13. SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...

  14. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

  15. Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.

  16. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  17. Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

  18. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  19. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  20. FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

    cs.CR 2026-05 unverdicted novelty 7.0

    FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.

  21. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  22. Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding

    cs.CL 2026-05 unverdicted novelty 7.0

    ChartCF achieves strong chart understanding performance in VLMs using significantly less training data by generating code-based counterfactuals, selecting similar samples, and performing multimodal preference optimization.

  23. OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

    cs.CV 2026-05 conditional novelty 7.0

    OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...

  24. PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

    cs.AI 2026-05 unverdicted novelty 7.0

    PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

  25. FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.

  26. How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

    cs.CL 2026-05 unverdicted novelty 7.0

    Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...

  27. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  28. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

  29. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  30. Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies

    cs.HC 2026-05 unverdicted novelty 7.0

    SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.

  31. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  32. OZ-TAL: Online Zero-Shot Temporal Action Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

  33. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

  34. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  35. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  36. AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

    cs.AI 2026-05 unverdicted novelty 7.0

    AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.

  37. LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

    cs.AI 2026-05 conditional novelty 7.0

    LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.

  38. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  39. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.

  40. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...

  41. Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

  42. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  43. HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.

  44. Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

  45. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  46. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  47. LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    eess.IV 2026-05 unverdicted novelty 7.0

    LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...

  48. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  49. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  50. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  51. MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

    cs.AI 2026-05 unverdicted novelty 7.0

    MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.

  52. DataDignity: Training Data Attribution for Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.

  53. FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

    cs.CV 2026-05 unverdicted novelty 7.0

    FlowDIS uses flow matching to transport image distributions to mask distributions, optionally conditioned on text, and outperforms prior DIS methods by 5.5% on F_beta^omega and 43% on MAE.

  54. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  55. Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.

  56. POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

    cs.SE 2026-05 unverdicted novelty 7.0

    POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

  57. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  58. Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

    cs.AI 2026-05 unverdicted novelty 7.0

    ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.

  59. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  60. TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

    cs.CV 2026-05 unverdicted novelty 7.0

    TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.