{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:L5DDZ2QZIV2FS2B7UNNWJLPIKJ","short_pith_number":"pith:L5DDZ2QZ","schema_version":"1.0","canonical_sha256":"5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165","source":{"kind":"arxiv","id":"2504.13181","version":2},"attestation_state":"computed","paper":{"title":"Perception Encoder: The best visual embeddings are not at the output of the network","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma","submitted_at":"2025-04-17T17:59:57Z","abstract_excerpt":"We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one c"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.13181","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57Z","cross_cats_sorted":[],"title_canon_sha256":"451081b8c383b7d3d716be07c800834b1ca1e73ae180cef305cbd11d15d32e78","abstract_canon_sha256":"03ab2b855c72230580f5d0a2e514a039a1a78cf6641bf2d4636445f57968f8e2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T04:23:23.598441Z","signature_b64":"n14HSx3OmKUc6X+sWQYyPTi8LbYIfO4cdvxdaSoDk6lsyXrDmWIUg1oNse1Y4scPiW69m6lywty+L8Raq1qWBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165","last_reissued_at":"2026-05-18T04:23:23.597930Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T04:23:23.597930Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Perception Encoder: The best visual embeddings are not at the output of the network","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma","submitted_at":"2025-04-17T17:59:57Z","abstract_excerpt":"We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one c"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the intermediate-layer embeddings remain superior after the two alignment procedures without post-hoc data selection or task-specific hyperparameter tuning that would undermine the claim of a single general pretraining recipe.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"093cd69663750ce0aee377802b1bfc3b5bbbead4847c43bfc5ffb4d104da318d"},"source":{"id":"2504.13181","kind":"arxiv","version":2},"verdict":{"id":"9a4b5b32-b4ac-433f-91cd-ace723ab191d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T22:16:59.879451Z","strongest_claim":"after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network.","one_line_summary":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the intermediate-layer embeddings remain superior after the two alignment procedures without post-hoc data selection or task-specific hyperparameter tuning that would undermine the claim of a single general pretraining recipe.","pith_extraction_headline":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output."},"references":{"count":169,"sample":[{"doi":"","year":2019,"title":"Nocaps: Novel object captioning at scale","work_id":"041edd2d-2995-46f1-a2f6-1b15274c0edf","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, P","work_id":"9ad2b071-82d8-4cfa-b994-b9975094b575","ref_index":2,"cited_arxiv_id":"2410.07073","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":3,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2019,"title":"ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models","work_id":"073ccda3-4075-4094-b532-d808f9ecd0b4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"PaliGemma: A versatile 3B VLM for transfer","work_id":"df6f48b3-5792-47c7-9614-cb856ea31ad9","ref_index":5,"cited_arxiv_id":"2407.07726","is_internal_anchor":true}],"resolved_work":169,"snapshot_sha256":"281e242da9f17678256c5f9a0aff02d3e8b2bd788438d8cb4c55d170b9f115db","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a71aca8f04904d580988c3689748b903e924d78d88b0b76a7db9bc0196e29351"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.13181","created_at":"2026-05-18T04:23:23.598016+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.13181v2","created_at":"2026-05-18T04:23:23.598016+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.13181","created_at":"2026-05-18T04:23:23.598016+00:00"},{"alias_kind":"pith_short_12","alias_value":"L5DDZ2QZIV2F","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"L5DDZ2QZIV2FS2B7","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"L5DDZ2QZ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":41,"internal_anchor_count":41,"sample":[{"citing_arxiv_id":"2605.23028","citing_title":"RADAR: Relative Angular Divergence Across Representations","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23033","citing_title":"Uncovering the Latent Potential of Deep Intermediate Representations","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23556","citing_title":"Is Dimensionality a Barrier for Retrieval Models?","ref_index":173,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17630","citing_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21028","citing_title":"DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17283","citing_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17630","citing_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17633","citing_title":"SparseSAM: Structured Sparsification of Activations in Segment Anything Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18324","citing_title":"Improved Baselines with Representation Autoencoders","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20085","citing_title":"Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.06248","citing_title":"Deepfake Detection that Generalizes Across Benchmarks","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20899","citing_title":"Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18457","citing_title":"VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2511.16719","citing_title":"SAM 3: Segment Anything with Concepts","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2512.08730","citing_title":"SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13511","citing_title":"Adapting MLLMs for Nuanced Video Retrieval","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2512.17817","citing_title":"Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01738","citing_title":"Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2603.03577","citing_title":"From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13565","citing_title":"Qwen-Image-VAE-2.0 Technical Report","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02320","citing_title":"Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04133","citing_title":"Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08298","citing_title":"What Cohort INRs Encode and Where to Freeze Them","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10404","citing_title":"Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25184","citing_title":"Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication","ref_index":23,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ","json":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ.json","graph_json":"https://pith.science/api/pith-number/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/graph.json","events_json":"https://pith.science/api/pith-number/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/events.json","paper":"https://pith.science/paper/L5DDZ2QZ"},"agent_actions":{"view_html":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ","download_json":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ.json","view_paper":"https://pith.science/paper/L5DDZ2QZ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.13181&json=true","fetch_graph":"https://pith.science/api/pith-number/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/graph.json","fetch_events":"https://pith.science/api/pith-number/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/action/storage_attestation","attest_author":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/action/author_attestation","sign_citation":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/action/citation_signature","submit_replication":"https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ/action/replication_record"}},"created_at":"2026-05-18T04:23:23.598016+00:00","updated_at":"2026-05-18T04:23:23.598016+00:00"}