SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Mixed citations
Language models are few-shot learners
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
A hierarchical QA framework converts RST discourse trees into enhanced sentence representations for structure-guided retrieval and reports consistent gains over baselines on four datasets across genres and languages.
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
citing papers explorer
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Mutation-Guided Unit Test Generation with a Large Language Model
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
-
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
A hierarchical QA framework converts RST discourse trees into enhanced sentence representations for structure-guided retrieval and reports consistent gains over baselines on four datasets across genres and languages.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
-
Ignore Previous Prompt: Attack Techniques For Language Models
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.