{"total":27,"items":[{"citing_arxiv_id":"2606.25478","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition","primary_cat":"cs.CV","submitted_at":"2026-06-24T07:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TACO proposes Relative Structure Distillation and a lightweight specialization projection to mitigate inconsistency between fine-tuning and evaluation objectives in open-vocabulary video recognition, claiming state-of-the-art results on cross-dataset and base-to-novel benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17590","ref_index":112,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization","primary_cat":"cs.CV","submitted_at":"2026-06-16T06:52:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TivTok factorizes video clips into reusable time-invariant tokens and frame-specific time-variant tokens via Scope-Induced Factorization and Invariant Broadcasting, achieving 2.91x better compression for 128-frame videos on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09156","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniGen-AR: AutoRegressive Any-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-08T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03578","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Diffusing in the Right Space: A Systematic Study of Latent Diffusability","primary_cat":"cs.CV","submitted_at":"2026-06-02T12:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":292,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-task long-video understanding. Streaming Understanding OVO-Bench [288] Multi Online perception with backward tracing of past events. StreamingBench [289] Acc./Lat. Video comprehension under latency constraints. OmniMMI [290] Multi Multimodal streaming interaction evaluation. Generation UCF-101 [291] FVD Action-class video generation distributional metric. Kinetics-600 [292] FVD Large-scale action distribution for video FVD. VBench [150] Multi Temporal consistency, motion smoothness, aesthetics. SeedVideoBench 2.0 [293] 6-dim Motion, prompt adherence, A/V sync. Arena.AI [294] Elo Community-scale human-preference Elo ranking. Table 3.Summary of major evaluation benchmarks for native multimodal models. Each benchmark is shown on its own row to"},{"citing_arxiv_id":"2605.23288","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-22T07:01:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SimVA constructs a 4D similarity volume over video tokens and action classes then applies spatial, motion-aware, and Mamba-based temporal aggregation to achieve competitive zero-shot and few-shot performance on open-vocabulary action recognition benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22819","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cambrian-P: Pose-Grounded Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age.T-RO, 2017. [17] Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. [18] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018. [19] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset.arXiv preprint arXiv:1907.06987, 2019. [20] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia."},{"citing_arxiv_id":"2605.20838","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"USV: Towards Understanding the User-generated Short-form Videos","primary_cat":"cs.CV","submitted_at":"2026-05-20T07:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07859","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00434","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations","primary_cat":"cs.CV","submitted_at":"2026-05-01T06:11:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LIMSSR reformulates incomplete multimodal learning as LLM-driven sequence-to-score reasoning with prompt-guided imputation and mask-aware aggregation, outperforming baselines on action quality assessment without complete training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18367","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EAST: Early Action Prediction Sampling Strategy with Token Masking","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:57:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17062","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-04-18T16:34:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Motion separation modules plus negative prompts improve CLIP-based zero-shot video action recognition on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13667","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage","primary_cat":"cs.CV","submitted_at":"2026-04-15T09:35:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08050","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning","primary_cat":"cs.CV","submitted_at":"2026-04-09T09:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.11149","ref_index":107,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey of Action Quality Assessment: Method and Benchmark","primary_cat":"cs.CV","submitted_at":"2024-12-15T10:47:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey proposes a modality-driven hierarchical taxonomy for AQA methods, establishes a unified benchmark for video-based approaches across datasets, and outlines research trends and challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"P3D [45] 3D CNN [103], [104], [105] Kinetics-400 [46] ⋆⋆⋆⋆⋆: Balanced for separate spatial and temporal conv. layers ⋆⋆⋆⋆⋆: Higher than 3D CNNs ⋆⋆⋆⋆⋆: Less for its hybrid design I3D [46] 3D CNN [24], [50], [58], [59], [62], [106] Kinetics-400 [46] ⋆⋆⋆⋆⋆: Strong by inflating 2D conv. into 3D conv. ⋆⋆⋆⋆⋆: High re- sources for 3D conv. ⋆⋆⋆⋆⋆: Good for various tasks VST [47] Transformer [36], [56], [65], [84] Kinetics-600 [107] ⋆⋆⋆⋆⋆: Good spatial, advanced long-range temporal dependencies ⋆⋆⋆⋆⋆: High for self-attention ⋆⋆⋆⋆⋆: Good for complex scenarios 3 M ETHODS WITH A HIERARCHICAL TAXONOMY This section presents a taxonomy of representative AQA methods categorized by input modalities: video-based (see Sec. 3.1), skeleton-based (see Sec. 3.2), and multi-modality (see Sec."},{"citing_arxiv_id":"2411.17690","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis","primary_cat":"cs.MM","submitted_at":"2024-11-26T18:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Experiments with a video-text-to-speech transformer show co-temporal positional indexing enables synchronization without timestamps, text and video supply complementary signals, and modality ordering creates a trade-off between in-domain accuracy and cross-domain generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.05615","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives","primary_cat":"cs.CL","submitted_at":"2024-06-09T02:36:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"These results underscore the strong out-of-the- box pixel-level perceptual capacity of our InternViT-6B. 4.3. Vision-Language Benchmarks In this section, we evaluate the inherent capabilities of In- ternVL on various vision-language tasks. Zero-Shot Image Classification. We conduct thorough validation of the zero-shot image classification capabil- K400 [17] K600 [18] K700 [19]method #F top-1 avg. top-1 avg. top-1 avg. OpenCLIP-g [67] 1 − 63.9 − 64.1 − 56.9 OpenCLIP-G [67] 1 − 65.9 − 66.1 − 59.2 EV A-01-CLIP-g+ [130] 1 − 66.7 − 67.0 − 60.9 EV A-02-CLIP-E+ [130] 1 − 69.8 − 69.3 − 63.4 InternVL-C (ours) 1 65.9 76.1 65.5 75.5 56.8 67.5 ViCLIP [152] 8 64.8 75.7 62.2 73.5 54.3 66.4 InternVL-C (ours) 8 69.1 79.4 68.9 78.8 60.6 71."},{"citing_arxiv_id":"2312.14125","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoPoet: A Large Language Model for Zero-Shot Video Generation","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.05737","ref_index":266,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation","primary_cat":"cs.CV","submitted_at":"2023-10-09T14:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.15389","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","primary_cat":"cs.CV","submitted_at":"2023-03-27T17:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"size and ∼1/5 image-text pairs, achieves a 1.2-point averaged improvement over OpenCLIP-H/14. For video classification, we sample only a single center frame in each video, making it an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 [47] and the mean of top-1 and top-5 accuracy for Kinetics-400 [9], Kinetics-600 [7] and Kinetics- 700 [8]. In Table 3 we show that EV A-CLIPis also quite effective in zero-shot video recognition benchmarks. Table 4 reports the zero-shot image and text retrieval results on Flickr30K [53] and COCO [34]. EV A-CLIPout- performs all the competitors at the base and large model size. While the zero-shot retrieval performance of EV A- 02-CLIP-E/14 is slightly lower than OpenCLIP-G/14, the"},{"citing_arxiv_id":"2212.03191","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InternVideo: General Video Foundation Models via Generative and Discriminative Learning","primary_cat":"cs.CV","submitted_at":"2022-12-06T18:09:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.15868","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers","primary_cat":"cs.CV","submitted_at":"2022-05-29T19:02:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"CogVideo (Ours) 50.46 626 CogVideo (Ours)** - 545 Method FVD( ↓) Latent Video Tranformer[17] 224.73 Video Transformer[33] 170 DVD-GAN-FP[4] 69.15 TriVD-GAN-FP[15] 25.74 CogVideo (Ours) 109.23 CogVideo (Ours)** 59.55 5 Experiments 5.1 Machine Evaluation Machine evaluation is conducted on two popular benchmarks for video generation, i.e., UCF101 [22] and Kinetics-600 [3]. Following Rakhimov et al. [17], Yu et al. [37], we use Fréchet Video Distance (FVD) [27] and Inception score (IS) [21] as metrics in the evaluation. FVD is calculated based on I3D model[2] trained on Kinetics-400, and IS is based on C3D model [25] which was ﬁrst trained on the Sports-1M dataset [12] and then ﬁnetuned on the UCF101 dataset. Our evaluation code is the"},{"citing_arxiv_id":"2205.01917","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoCa: Contrastive Captioners are Image-Text Foundation Models","primary_cat":"cs.CV","submitted_at":"2022-05-04T07:01:14+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.03458","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2022-04-07T14:08:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[7] Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018. [8] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299-6308, 2017. [9] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018. [10] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. International Conference on Learn- ing Representations, 2021."},{"citing_arxiv_id":"1911.11641","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PIQA: Reasoning about Physical Commonsense in Natural Language","primary_cat":"cs.CL","submitted_at":"2019-11-26T15:31:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIQA is a new benchmark showing that current AI models achieve 77% on physical commonsense questions versus humans at 95%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}