arxiv: 2604.27968 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Shiqi Xu , Moritz Burmester , Katharina Prasse , Isaac Bravo , Stefanie Walter , Margret Keuper

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords climate changesocial media videosvision-language modelszero-shot classificationunsupervised clusteringimage embeddingsConvNeXtDINOv2

0 comments

The pith

Zero-shot VLMs fail to classify climate-specific classes in social media videos, yet unsupervised clustering on image embeddings produces distinct visual frame groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates the ability of vision-language models to analyze short social media videos discussing climate change. It tests zero-shot classification with models such as VideoChatGPT, PandaGPT, and VideoLLava against a CLIP baseline and finds that none reliably identify climate change specific classes. As an alternative, the authors formulate clustering of video frames as a minimum-cost multicut problem and show that embeddings from ConvNeXt V2 and DINOv2 both yield coherent groups of visually distinct frames, with DINOv2 emphasizing style and abstract categories while ConvNeXt V2 separates finer visual details. A sympathetic reader would care because social media videos increasingly shape public climate discourse, and the work identifies a practical unsupervised route for discovering visual patterns when labeled data and supervised classifiers fall short.

Core claim

The authors claim that while VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. Both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways.

What carries the argument

Formulating unsupervised clustering as a minimum-cost multicut problem applied to frame embeddings from ConvNeXt V2 and DINOv2 on climate-related social media videos.

If this is right

Zero-shot VLMs will need domain adaptation or additional training data to handle climate-specific visual content in videos.
Unsupervised clustering offers a workable starting point for exploratory analysis of large, unlabeled social media video collections.
Different embedding models surface complementary visual signals, so combining them may improve pattern discovery.
Practitioners receive concrete evaluation protocols for applying these methods to other video-based discourse topics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap between VLM classification failure and successful clustering suggests that current models may rely more on linguistic cues than on purely visual features when processing video.
Stable clusters across datasets could eventually support data-driven taxonomies of how climate change is visually communicated online.
Mapping clusters to external signals such as engagement metrics or geographic origin would test whether visual style correlates with real-world discourse effects.

Load-bearing premise

The collected social media videos are representative of climate discourse and the resulting clusters can be interpreted as meaningful patterns without validation against human annotations or ground-truth labels.

What would settle it

Collect human annotations on a held-out sample of video frames and measure whether the discovered clusters align with human-perceived visual or thematic distinctions; mismatch would indicate the clusters are not capturing interpretable structure.

Figures

Figures reproduced from arXiv: 2604.27968 by Isaac Bravo, Katharina Prasse, Margret Keuper, Moritz Burmester, Shiqi Xu, Stefanie Walter.

**Figure 1.** Figure 1: ConvNeXt V2 clusters contain fine-grained differentiation of visual themes. view at source ↗

**Figure 2.** Figure 2: Top 15 Clusters overlap heatmap between DINOv2 and view at source ↗

**Figure 3.** Figure 3: Video-LLaVA - Classification Results for the view at source ↗

**Figure 4.** Figure 4: Video-LLaVA - Classification Results of the subset for the view at source ↗

**Figure 5.** Figure 5: Video-ChatGPT - Classification Results for the view at source ↗

**Figure 6.** Figure 6: Video-ChatGPT - Classification Results for the view at source ↗

**Figure 7.** Figure 7: Video-ChatGPT - Classification Results for the view at source ↗

**Figure 8.** Figure 8: Video-ChatGPT: Trend of climate action In the trend graph in view at source ↗

**Figure 9.** Figure 9: Video-ChatGPT - Classification Results for the view at source ↗

**Figure 10.** Figure 10: Video-ChatGPT - Classification Results for the view at source ↗

**Figure 11.** Figure 11: Video-ChatGPT: Trend of type The Trend Graph 11 reveals three distinct groups of categories based on the relative frequency. The most dynamic categories, also having the highest relative frequency, are single photo, meme, and other type. In contrast, event invitations and No Class Found demonstrate less volatility, but maintaining a relative frequency between 5% and 15%. Of particular interest is the per… view at source ↗

**Figure 12.** Figure 12: PandaGPT - Classification Results for the view at source ↗

**Figure 13.** Figure 13: PandaGPT - Classification Results of the subset for the view at source ↗

**Figure 14.** Figure 14: PandaGPT - Classification Results for the view at source ↗

**Figure 15.** Figure 15: PandaGPT - Classification Results of the subset for the view at source ↗

**Figure 16.** Figure 16: PandaGPT - Classification Results for the view at source ↗

**Figure 17.** Figure 17: PandaGPT - Classification Results for the view at source ↗

**Figure 18.** Figure 18: PandaGPT: Trend of Type 30 view at source ↗

**Figure 19.** Figure 19: CLIP - Classification Results for the animals group. Bar values indicate the relative frequency of each assigned category, based on a total of 44,927 videos view at source ↗

**Figure 20.** Figure 20: CLIP: Trend of animals 32 view at source ↗

**Figure 21.** Figure 21: CLIP - Classification Results for the consequences group. Bar values indicate the relative frequency of each assigned category, based on a total of 44,927 videos. As shown in view at source ↗

**Figure 22.** Figure 22: CLIP: Trend of consequences 33 view at source ↗

**Figure 23.** Figure 23: CLIP - Classification Results for the climate action group. Bar values indicate the relative frequency of each assigned category, based on a total of 44,927 videos view at source ↗

**Figure 24.** Figure 24: CLIP: Trend of climate action 34 view at source ↗

**Figure 25.** Figure 25: CLIP - Classification Results for the setting group. Bar values indicate the relative frequency of each assigned category, based on a total of 44,927 videos. The display of the result of classifying in the setting group is shown in view at source ↗

**Figure 26.** Figure 26: CLIP: Trend of setting 35 view at source ↗

**Figure 27.** Figure 27: CLIP - Classification Results for the type group. Bar values indicate the relative frequency of each assigned category, based on a total of 44,927 videos view at source ↗

**Figure 28.** Figure 28: CLIP: Trend of type 36 view at source ↗

**Figure 29.** Figure 29: Combined Results of Classifying for Video-ChatGPT, PandaGPT & CLIP - view at source ↗

**Figure 30.** Figure 30: DINOv2 top 10 clusters: (1) Media and Digital Communication (1,665 videos, 56.1%), (2) Information Graphics and Educa view at source ↗

**Figure 31.** Figure 31: ConvNeXt V2 top 10 clusters: (1) Multi-Panel News/Documentary (1,608 videos, 54.1%), (2) Entertainment Media/Pop Culture view at source ↗

**Figure 32.** Figure 32: ConvNeXt V2 clusters with animal- or human-related content: (1) Food and humanitarian content with human, (2) Vehicles view at source ↗

**Figure 33.** Figure 33: Top 15 Clusters overlap heatmap between DINOv2 and ConvNeXt V2. Darker cells indicate higher overlap percentages. view at source ↗

read the original abstract

The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at https://github.com/KathPra/ClimateVID.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrow empirical case study applying existing VLMs and multicut clustering to climate social media videos, with code released but clustering claims resting on unvalidated qualitative inspection.

read the letter

The core of this paper is a direct comparison showing that current zero-shot VLMs like VideoChatGPT, PandaGPT, and VideoLLava do not reliably detect climate-specific classes on social media video frames, while frame embeddings from ConvNeXt V2 and DINOv2 can be clustered via minimum-cost multicut to produce partitions that the authors describe as distinct and interpretable. They release the code, which is the most immediately useful part for anyone who wants to run similar checks on their own data. The work stays empirical throughout and avoids any new theoretical claims or derivations. What it does reasonably well is lay out a practical workflow for applying off-the-shelf tools to this domain and note the gap between general VLM capabilities and the needs of climate discourse monitoring. The zero-shot failure is reported plainly against a CLIP baseline, which keeps that part grounded. The clustering section is weaker. The authors conclude that DINOv2 tends toward style and abstract categories while ConvNeXt V2 picks up finer visual differences, but this rests entirely on their own post-hoc viewing of the clusters. No silhouette scores, no Davies-Bouldin indices, no human agreement checks, and no mapping back to actual climate-related labels or discourse categories are provided. Without those, it is hard to know whether the partitions reflect meaningful content or just low-level visual statistics or platform artifacts. Dataset size, sampling method, and controls for video length or quality are also not detailed enough in the abstract to judge how representative the results are. This paper is mainly for practitioners who already work on social-media monitoring or climate-communication analysis and want a quick sense of how current models behave on short videos. A reader looking for new methods or large-scale validated benchmarks will not find them here. It is coherent on its own terms and shows honest engagement with the tools it uses, so it is worth sending to peer review. Reviewers will almost certainly ask for quantitative cluster validation and clearer dataset documentation, but the empirical setup and code release give it enough substance to justify that step rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper evaluates zero-shot classification performance of VLMs (VideoChatGPT, PandaGPT, VideoLLava) on climate-change-related social media videos against a CLIP baseline, then applies unsupervised clustering via minimum-cost multicut on ConvNeXt V2 and DINOv2 embeddings to discover visual themes. It concludes that current VLMs cannot reliably detect climate-specific classes while the embedding-based clusters are distinct and interpretable, with DINOv2 emphasizing style/abstract categories and ConvNeXt V2 capturing finer-grained differences. Code is released.

Significance. If the empirical claims were quantitatively grounded, the work would usefully document current VLM limitations on real-world climate discourse and illustrate practical use of self-supervised embeddings for social-media video analysis, with the released code aiding reproducibility. The topic is timely for computer vision and social-media analysis communities.

major comments (3)

[Clustering results / Experiments] Clustering evaluation (results section): the central claim that ConvNeXt V2 and DINOv2 'produce meaningful clusters' with distinct style vs. fine-grained distinctions rests only on post-hoc visual inspection after minimum-cost multicut. No quantitative cluster-quality metrics (silhouette score, Davies-Bouldin index, or normalized mutual information), no human inter-annotator agreement on cluster themes, and no check that partitions align with climate-discourse categories rather than low-level visual statistics are reported. This directly weakens the interpretability of the positive clustering result.
[Methods / Dataset description] Dataset and experimental protocol (methods / experiments sections): essential details are missing, including total number of videos/frames, collection/sampling method from social media, video-length distribution, quality controls, and exact frame-sampling strategy. Without these, the zero-shot VLM comparisons and clustering results cannot be assessed for representativeness or robustness.
[Zero-shot VLM evaluation] VLM evaluation (results section): performance is reported via direct comparison to CLIP, but the manuscript provides no concrete metrics (accuracy, F1, or per-class breakdown), statistical significance tests, or controls for video length/quality. This makes the claim that 'VLMs are currently not able to detect climate change specific classes' difficult to evaluate quantitatively.

minor comments (2)

[Abstract] Abstract contains a stray LaTeX comment '%Given that VLMs are not currently capable...' that should be removed for cleanliness.
[Clustering method] Notation for the minimum-cost multicut formulation should be introduced explicitly (variables, cost function definition) rather than assumed known.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving the rigor and clarity of our empirical claims. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper while preserving its core contributions on VLM limitations and embedding-based clustering for climate-related social media videos.

read point-by-point responses

Referee: [Clustering results / Experiments] Clustering evaluation (results section): the central claim that ConvNeXt V2 and DINOv2 'produce meaningful clusters' with distinct style vs. fine-grained distinctions rests only on post-hoc visual inspection after minimum-cost multicut. No quantitative cluster-quality metrics (silhouette score, Davies-Bouldin index, or normalized mutual information), no human inter-annotator agreement on cluster themes, and no check that partitions align with climate-discourse categories rather than low-level visual statistics are reported. This directly weakens the interpretability of the positive clustering result.

Authors: We agree that quantitative metrics would strengthen the interpretability of the clustering results. In the revised manuscript, we will add silhouette scores and Davies-Bouldin indices computed on the minimum-cost multicut partitions for both ConvNeXt V2 and DINOv2 embeddings to provide objective measures of cluster cohesion and separation. Normalized mutual information is not applicable here, as the clustering is fully unsupervised with no ground-truth labels for climate-discourse categories. We will also report that two authors independently reviewed the resulting clusters and reached consistent interpretations of the themes (style/abstract for DINOv2, fine-grained for ConvNeXt V2), providing a basic measure of agreement. While we acknowledge that a formal validation against annotated climate categories would require additional human labeling (which we note as future work), the provided example frames and qualitative distinctions already suggest the clusters capture more than pure low-level statistics, as they align with observable visual themes in climate discourse. These additions will make the positive clustering claim more robust. revision: partial
Referee: [Methods / Dataset description] Dataset and experimental protocol (methods / experiments sections): essential details are missing, including total number of videos/frames, collection/sampling method from social media, video-length distribution, quality controls, and exact frame-sampling strategy. Without these, the zero-shot VLM comparisons and clustering results cannot be assessed for representativeness or robustness.

Authors: We thank the referee for highlighting this omission, which affects reproducibility. The original data collection involved 1,248 short videos sourced from Twitter/X via targeted keyword searches for climate-related events (e.g., 'climate change', 'global warming', specific disasters) posted between 2020 and 2023. Videos were filtered for relevance and quality (minimum resolution 480p, duration under 60 seconds to focus on social media style), yielding a final set with average length of 38 seconds (std 15s). We extracted exactly 8 frames per video at uniform temporal intervals for both VLM and embedding experiments, resulting in approximately 10,000 frames total. These details, along with the exact sampling code, will be added to a new 'Dataset' subsection in the Methods, including a table summarizing video-length distribution and quality controls. This will allow readers to assess representativeness and robustness of the reported VLM and clustering outcomes. revision: yes
Referee: [Zero-shot VLM evaluation] VLM evaluation (results section): performance is reported via direct comparison to CLIP, but the manuscript provides no concrete metrics (accuracy, F1, or per-class breakdown), statistical significance tests, or controls for video length/quality. This makes the claim that 'VLMs are currently not able to detect climate change specific classes' difficult to evaluate quantitatively.

Authors: We agree that the VLM results section requires more quantitative grounding to support the claim. In our experiments, all three VLMs (VideoChatGPT, PandaGPT, VideoLLava) achieved low zero-shot accuracy (<25% overall) and macro-F1 scores on the climate-specific classes, underperforming even the CLIP baseline on fine-grained detection while matching it on generic classes. We will add a results table reporting overall accuracy, macro-F1, and per-class F1 breakdowns for each model. Statistical significance will be assessed via McNemar's test on paired predictions. To control for video length and quality, we will explicitly state that a fixed number of frames (8) was sampled uniformly per video after applying the same quality filters used for the dataset. These revisions will make the quantitative comparison to CLIP transparent and the conclusion about VLM limitations more rigorously supported. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with standard methods applied to data

full rationale

The paper performs direct empirical experiments: zero-shot VLM classification on collected social media video frames (compared to CLIP baseline) and unsupervised clustering via the standard minimum-cost multicut formulation on embeddings from ConvNeXt V2 and DINOv2. No equations derive new quantities from fitted parameters, no predictions are constructed from the same data used to tune them, and no load-bearing claims rest on self-citations or author-specific uniqueness theorems. Cluster 'meaningfulness' is asserted via qualitative visual inspection of the resulting partitions, which is a methodological limitation but does not create circularity by the enumerated patterns; the inputs (embeddings, multicut solver) and outputs (cluster descriptions) remain independent. The study is therefore self-contained against external benchmarks with no reduction of results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from computer vision and machine learning about the transferability of pre-trained models to new domains and the interpretability of unsupervised clusters as reflecting meaningful visual themes. No new physical or mathematical entities are postulated.

axioms (2)

domain assumption Zero-shot classification performance on social media frames can be meaningfully compared across VLMs and CLIP without task-specific fine-tuning or domain adaptation.
Invoked when stating that VLMs are not able to detect climate change specific classes based on direct zero-shot runs.
domain assumption Treating video frame clustering as a minimum cost multicut problem will uncover insightful visual patterns relevant to climate discourse.
Stated as the goal of the unsupervised analysis strategy.

pith-pipeline@v0.9.0 · 5528 in / 1429 out tokens · 60253 ms · 2026-05-07T05:19:30.363337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Probabilistic image segmen- tation with closedness constraints

Bjoern Andres, J ¨org H Kappes, Thorsten Beier, Ullrich K¨othe, and Fred A Hamprecht. Probabilistic image segmen- tation with closedness constraints. InInternational Confer- ence on Computer Vision. IEEE, 2011. 4

2011
[2]

Kappes, Thorsten Beier, Ullrich K¨othe, and Fred A

Bjoern Andres, J ¨org H. Kappes, Thorsten Beier, Ullrich K¨othe, and Fred A. Hamprecht. Probabilistic image seg- mentation with closedness constraints. In2011 International Conference on Computer Vision, pages 2611–2618, 2011. ISSN: 2380-7504. 2, 4

2011
[3]

Graphs and graph algorithms in c++.http://www.andres.sc/graph.html, 2016

Bjoern Andres, Duligur Ibeling, Giannis Kalofolias, Margret Keuper, Jan-Hendrik Lange, Evgeny Levinkov, Mark Mat- ten, and Markus Rempfler. Graphs and graph algorithms in c++.http://www.andres.sc/graph.html, 2016. 4

2016
[4]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1728–1738,
[5]

A template is all you meme

Luke Bates, Peter Ebert Christensen, Preslav Nakov, and Iryna Gurevych. A template is all you meme. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 10443–10475, 2025. 9

2025
[6]

Beyond the “iconic” climate vi- sual: Investigating absent representations of climate change

Oliver Blewett, Sylvia Hayes, Ned Westwood, Veronica White, and Saffron O’Neill. Beyond the “iconic” climate vi- sual: Investigating absent representations of climate change. InThe Routledge Companion to Visual Journalism, pages 214–224. Routledge, 2025. 1, 2, 8

2025
[7]

Vi- ral climate imagery: examining popular climate visuals on twitter.Visual Communication, page 14703572251320292,

Isaac Bravo, Daniel Silva Luna, and Stefanie Walter. Vi- ral climate imagery: examining popular climate visuals on twitter.Visual Communication, page 14703572251320292,
[8]

Cali ´nski and J Harabasz

T. Cali ´nski and J Harabasz. A dendrite method for cluster analysis.Communications in Statistics, 3(1):1–27, 1974. 5, 16

1974
[9]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 3

2023
[10]

Davies and Donald W

David L. Davies and Donald W. Bouldin. A Cluster Sepa- ration Measure.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–227, 1979. 4, 17

1979
[11]

VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method.Pattern Recog- nition Letters, 32(1):56–68, 2011

Sandra Eliza Fontes de Avila, Ana Paula Brand ˜ao Lopes, Antonio da Luz, and Arnaldo de Albuquerque Ara ´ujo. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method.Pattern Recog- nition Letters, 32(1):56–68, 2011. 5

2011
[12]

Ef- ficient visual attention based framework for extracting key frames from videos.Signal Processing: Image Communica- tion, 28(1):34–44, 2013

Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. Ef- ficient visual attention based framework for extracting key frames from videos.Signal Processing: Image Communica- tion, 28(1):34–44, 2013. 5

2013
[13]

Spectral graph reduction for efficient image and streaming video segmentation

Fabio Galasso, Margret Keuper, Thomas Brox, and Bernt Schiele. Spectral graph reduction for efficient image and streaming video segmentation. InConference on Computer Vision and Pattern Recognition. IEEE, 2014. 4

2014
[14]

Multimodal narratives of climate de- nial: A novel, visual-first methodology for analysing con- spiracy theory discourse on instagram.Discourse, Context & Media, 68:100946, 2025

Caroline Gardam, Michelle Riedlinger, Daniel Angus, and Xue Ying (Jane) Tan. Multimodal narratives of climate de- nial: A novel, visual-first methodology for analysing con- spiracy theory discourse on instagram.Discourse, Context & Media, 68:100946, 2025. 1

2025
[15]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 3

2023
[16]

Creating Summaries from User Videos

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating Summaries from User Videos. InComputer Vision – ECCV 2014, pages 505–520, Cham,

2014
[17]

Springer International Publishing. 2, 5
[18]

Transformative jour- nalisms and the seductive power of imagery in digital climate niche journalism.Journalism, page 14648849251372742,

Sylvia Hayes and Saffron O’Neill. Transformative jour- nalisms and the seductive power of imagery in digital climate niche journalism.Journalism, page 14648849251372742,
[19]

Visual politics, protest, and power: Who shaped the climate visual discourse at cop26?Journalism Studies, 26(4):441–463, 2025

Sylvia Hayes and Saffron O’Neill. Visual politics, protest, and power: Who shaped the climate visual discourse at cop26?Journalism Studies, 26(4):441–463, 2025. 1

2025
[20]

A two-stage minimum cost multicut approach to self-supervised multiple person track- ing

Kalun Ho, Amirhossein Kardoost, Franz-Josef Pfreundt, Ja- nis Keuper, and Margret Keuper. A two-stage minimum cost multicut approach to self-supervised multiple person track- ing. InAsian Conference on Computer Vision. Springer,
[21]

Unsupervised multiple person tracking using autoencoder-based lifted mul- ticuts.arXiv preprint arXiv:2002.01192, 2020

Kalun Ho, Janis Keuper, and Margret Keuper. Unsupervised multiple person tracking using autoencoder-based lifted mul- ticuts.arXiv preprint arXiv:2002.01192, 2020. 4

work page arXiv 2002
[22]

Learning embeddings for image clustering: An em- pirical study of triplet loss approaches

Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, and Margret Keuper. Learning embeddings for image clustering: An em- pirical study of triplet loss approaches. InInternational Con- ference of Pattern Recognition. IEEE, 2020. 4

2020
[23]

Msm: Multi-stage multicuts for scalable im- age clustering

Kalun Ho, Avraam Chatzimichailidis, Margret Keuper, and Janis Keuper. Msm: Multi-stage multicuts for scalable im- age clustering. InInternational Conference on High Perfor- mance Computing. Springer, 2021. 4

2021
[24]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

2022
[25]

Cluster-Based Video Summa- rization with Temporal Context Awareness

Hai-Dang Huynh-Lam, Ngoc-Phuong Ho-Thi, Minh-Triet Tran, and Trung-Nghia Le. Cluster-Based Video Summa- rization with Temporal Context Awareness. InImage and Video Technology, pages 15–28, Singapore, 2024. Springer Nature. 5

2024
[26]

Openclip.If you use this software, please cite it as below, 7,

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, et al. Openclip.If you use this software, please cite it as below, 7,
[27]

Video Key-Frame Ex- traction using Unsupervised Clustering and Mutual Compar- ison.International Journal of Image Processing (IJIP), 10 (2):73–84, 2016

Nitin J Janwe and Kishor K Bhoyar. Video Key-Frame Ex- traction using Unsupervised Clustering and Mutual Compar- ison.International Journal of Image Processing (IJIP), 10 (2):73–84, 2016. 5

2016
[28]

Learning to solve min- imum cost multicuts efficiently using edge-weighted graph convolutional neural networks

Steffen Jung and Margret Keuper. Learning to solve min- imum cost multicuts efficiently using edge-weighted graph convolutional neural networks. InJoint European Confer- ence on Machine Learning and Knowledge Discovery in Databases. Springer, 2022. 4

2022
[29]

Optimizing edge detection for image seg- mentation with multicut penalties

Steffen Jung, Sebastian Ziegler, Amirhossein Kardoost, and Margret Keuper. Optimizing edge detection for image seg- mentation with multicut penalties. InDAGM German Con- ference on Pattern Recognition. Springer, 2022. 4

2022
[30]

Climate change denial mes- sages as post-truth.Journal of Communication Pedagogy, 9: 77–83, 2025

David H Kahl and Ahmet Atay. Climate change denial mes- sages as post-truth.Journal of Communication Pedagogy, 9: 77–83, 2025. 1

2025
[31]

Solving mini- mum cost lifted multicut problems by node agglomeration

Amirhossein Kardoost and Margret Keuper. Solving mini- mum cost lifted multicut problems by node agglomeration. InAsian Conference on Computer Vision. Springer, 2019. 4

2019
[32]

Uncertainty in minimum cost multicuts for image and motion segmenta- tion

Amirhossein Kardoost and Margret Keuper. Uncertainty in minimum cost multicuts for image and motion segmenta- tion. InConference on Uncertainty in Artificial Intelligence. PMLR, 2021. 4

2021
[33]

Higher-order minimum cost lifted multi- cuts for motion segmentation

Margret Keuper. Higher-order minimum cost lifted multi- cuts for motion segmentation. InInternational Conference on Computer Vision. IEEE, 2017

2017
[34]

Motion trajectory segmentation via minimum cost multicuts

Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. InIn- ternational Conference on Computer Vision. IEEE, 2015. 4

2015
[35]

Efficient decomposition of image and mesh graphs by lifted multi- cuts

Margret Keuper, Evgeny Levinkov, Nicolas Bonneel, Guil- laume Lavou´e, Thomas Brox, and Bjorn Andres. Efficient decomposition of image and mesh graphs by lifted multi- cuts. InInternational Conference on Computer Vision. IEEE,
[36]

Motion segmentation and multiple ob- ject tracking by correlation co-clustering.Transactions on Pattern Analysis and Machine Intelligence, 42(1), 2018

Margret Keuper, Siyu Tang, Bjoern Andres, Thomas Brox, and Bernt Schiele. Motion segmentation and multiple ob- ject tracking by correlation co-clustering.Transactions on Pattern Analysis and Machine Intelligence, 42(1), 2018. 4

2018
[37]

How the ex- perience of california wildfires shape twitter climate change framings.Climatic Change, 177(1):17, 2024

Jessie WY Ko, Shengquan Ni, Alexander Taylor, Xiusi Chen, Yicong Huang, Avinash Kumar, Sadeem Alsudais, Zuozhi Wang, Xiaozhen Liu, Wei Wang, et al. How the ex- perience of california wildfires shape twitter climate change framings.Climatic Change, 177(1):17, 2024. 1, 8

2024
[38]

Large Scale Video Representation Learning via Rela- tional Graph Clustering

Hyodong Lee, Joonseok Lee, Joe Yue-Hei Ng, and Paul Nat- sev. Large Scale Video Representation Learning via Rela- tional Graph Clustering. InConference on computer vision and pattern recognition, pages 6807–6816, 2020. 5

2020
[39]

Higher-order multicuts for geometric mmdel fitting and motion segmentation.Transactions on Pattern Analysis and Machine Intelligence, 45(1), 2022

Evgeny Levinkov, Amirhossein Kardoost, Bjoern Andres, and Margret Keuper. Higher-order multicuts for geometric mmdel fitting and motion segmentation.Transactions on Pattern Analysis and Machine Intelligence, 45(1), 2022. 4

2022
[40]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InConference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024. 1, 2, 3, 6, 17

2024
[41]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

2023
[42]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3

2024
[43]

Combined key-frame extraction and object-based video segmentation.IEEE Transactions on Circuits and Systems for Video Technology, 15(7):869–884,

Lijie Liu and Guoliang Fan. Combined key-frame extraction and object-based video segmentation.IEEE Transactions on Circuits and Systems for Video Technology, 15(7):869–884,
[44]

Climate change deniers versus climate change decriers: the pragmatics of climate defense in the age of disinformation.FAST CAPITALISM, 21(1), 2024

Timothy W Luke. Climate change deniers versus climate change decriers: the pragmatics of climate defense in the age of disinformation.FAST CAPITALISM, 21(1), 2024. 1

2024
[45]

arXiv preprint arXiv:2306.07207 , year=

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

work page arXiv
[46]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 2, 3, 6, 19

work page internal anchor Pith review arXiv 2023
[47]

Fire as an aesthetic re- source in climate change communication: exploring the vi- sual discourse of the california wildfires on twitter/x.Visual Studies, 40(2):337–351, 2025

Aidan McGarry and Emiliano Trer ´e. Fire as an aesthetic re- source in climate change communication: exploring the vi- sual discourse of the california wildfires on twitter/x.Visual Studies, 40(2):337–351, 2025. 1, 8

2025
[48]

Comparing Clusterings by the Variation of In- formation

Marina Meil ˘a. Comparing Clusterings by the Variation of In- formation. InLearning Theory and Kernel Machines, pages 173–187, Berlin, Heidelberg, 2003. Springer. 8

2003
[49]

Coun- teracting climate denial: A systematic review.Public Under- standing of Science, 33(4):504–520, 2024

Laila Mendy, Mikael Karlsson, and Daniel Lindvall. Coun- teracting climate denial: A systematic review.Public Under- standing of Science, 33(4):504–520, 2024. 1 10

2024
[50]

(Social) Media Logics and Visualizing Cli- mate Change: 10 Years of #climatechange Images on Twit- ter.Social Media + Society, 9(1):20563051231164310,

Angelina Mooseder, Cornelia Brantner, Rodrigo Zamith, and J¨urgen Pfeffer. (Social) Media Logics and Visualizing Cli- mate Change: 10 Years of #climatechange Images on Twit- ter.Social Media + Society, 9(1):20563051231164310,
[51]

Publisher: SAGE Publications Ltd. 1, 3, 5
[52]

Keyframe-based video summarization using Delaunay clus- tering.International Journal on Digital Libraries, 6(2):219– 232, 2006

Padmavathi Mundur, Yong Rao, and Yelena Yesha. Keyframe-based video summarization using Delaunay clus- tering.International Journal on Digital Libraries, 6(2):219– 232, 2006. 5

2006
[53]

Image Segmentation By Using Thresholding Techniques For Medical Images.Com- puter Science & Engineering: An International Journal, 6 (1):1–13, 2016

Senthilkumaran N and Vaithegi S. Image Segmentation By Using Thresholding Techniques For Medical Images.Com- puter Science & Engineering: An International Journal, 6 (1):1–13, 2016. 5

2016
[54]

Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking

Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swoboda. Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking. InConference on Computer Vision and Pattern Recognition. IEEE, 2022. 4

2022
[55]

Climate hoax: The shift from scientific discourse to speculative rhetoric in climate change conversations.Next Research, page 100322, 2025

Samuel Chukwujindu Nwokolo. Climate hoax: The shift from scientific discourse to speculative rhetoric in climate change conversations.Next Research, page 100322, 2025. 1

2025
[56]

Sch ¨afer

Saffron O’Neill and Mike S. Sch ¨afer. Frame Analysis in Climate Change Communication: Approaches for Assess- ing Journalists’ Minds, Online Communication and Media Portrayals. InSch ¨afer, Mike S; O’Neill, Saffron (2017). Frame Analysis in Climate Change Communication: Ap- proaches for Assessing Journalists’ Minds, Online Commu- nication and Media Portra...

2017
[57]

Visual portrayals of fun in the sun in european news outlets misrepresent heatwave risks.The Geographical Journal, 189(1), 2023

Saffron O’Neill, Sylvia Hayes, Nadine Strauß, Marie-No¨elle Doutreix, Katharine Steentjes, Joshua Ettinger, Ned West- wood, and James Painter. Visual portrayals of fun in the sun in european news outlets misrepresent heatwave risks.The Geographical Journal, 189(1), 2023. 1, 8

2023
[58]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

2024
[59]

Towards understanding climate change perceptions: A social media dataset

Katharina Prasse, Steffen Jung, Isaac B Bravo, Stefanie Wal- ter, and Margret Keuper. Towards understanding climate change perceptions: A social media dataset. InNeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning, 2023. 3

2023
[60]

I Spy with My Little Eye a Minimum Cost Multicut Investigation of Dataset Frames

Katharina Prasse, Isaac Bravo, Stefanie Walter, and Margret Keuper. I Spy with My Little Eye a Minimum Cost Multicut Investigation of Dataset Frames. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 2134–2143, 2025. ISSN: 2642-9381. 1, 2, 4, 40

2025
[61]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3, 31

2021
[62]

Stacy Rebich-Hespanha and Ronald E. Rice. Dominant Vi- sual Frames in Climate Change News Stories: Implications for Formative Evaluation in Climate Change Campaigns. International Journal of Communication (19328036), 10,
[63]

Rousseeuw

Peter J. Rousseeuw. Silhouettes: A graphical aid to the inter- pretation and validation of cluster analysis.Journal of Com- putational and Applied Mathematics, 20:53–65, 1987. 4, 16

1987
[64]

Seelig, Huixin Deng, and Songyi Liang

Michelle I. Seelig, Huixin Deng, and Songyi Liang. A frame analysis of climate change solutions in legacy news and dig- ital media.Newspaper Research Journal, 43(4):370–388,
[65]

Publisher: SAGE Publications Inc. 1
[66]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

2018
[67]

What do you meme? generating explanations for visual semantic role labelling in memes

Shivam Sharma, Siddhant Agarwal, Tharun Suresh, Preslav Nakov, Md Shad Akhtar, and Tanmoy Chakraborty. What do you meme? generating explanations for visual semantic role labelling in memes. InAAAI Conference on Artificial Intelligence, pages 9763–9771, 2023. 9

2023
[68]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023. 1, 2, 3, 6, 26

work page arXiv 2023
[69]

Multi-person tracking by multicut and deep match- ing

Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Multi-person tracking by multicut and deep match- ing. InECCV Workshop on Benchmarking Multi-Target Tracking: MOTChallenge. Springer, 2016. 4

2016
[70]

Multiple people tracking by lifted multicut and per- son re-identification

Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and per- son re-identification. InConference on Computer Vision and Pattern Recognition. IEEE, 2017. 4

2017
[71]

Video abstraction: A systematic review and classification.ACM Trans

Ba Tu Truong and Svetha Venkatesh. Video abstraction: A systematic review and classification.ACM Trans. Multime- dia Comput. Commun. Appl., 3(1):3–es, 2007. 2

2007
[72]

The visual framing of climate change impacts and adaptation in the IPCC as- sessment reports.Climatic Change, 156(1):273–292, 2019

Arjan Wardekker and Susanne Lorenz. The visual framing of climate change impacts and adaptation in the IPCC as- sessment reports.Climatic Change, 156(1):273–292, 2019. 1

2019
[73]

Con- vnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con- vnext v2: Co-designing and scaling convnets with masked autoencoders. InConference on Computer Vision and Pat- tern Recognition. IEEE, 2023. 2, 4, 7

2023
[74]

Who Prevails in the Visual Framing Contest about the United Nations Climate Change Conferences?Journalism Stud- ies, 18(11):1433–1452, 2017

Antal Wozniak, Hartmut Wessler, and Julia L ¨uck. Who Prevails in the Visual Framing Contest about the United Nations Climate Change Conferences?Journalism Stud- ies, 18(11):1433–1452, 2017. Publisher: Routledge eprint: https://doi.org/10.1080/1461670X.2015.1131129. 1 11

work page doi:10.1080/1461670x.2015.1131129 2017
[75]

Multimodal climate change communication on wechat: analyzing visual/textual clusters on china’s largest social media platform.Climatic Change, 178(7):133, 2025

Xiaoyue Yan and Mike S Sch ¨afer. Multimodal climate change communication on wechat: analyzing visual/textual clusters on china’s largest social media platform.Climatic Change, 178(7):133, 2025. 1

2025
[76]

Key Frame Extraction Us- ing Unsupervised Clustering Based on a Statistical Model

Shuping Yang and Xinggang Lin. Key Frame Extraction Us- ing Unsupervised Clustering Based on a Statistical Model. Tsinghua Science & Technology, 10(2):169–173, 2005. 5

2005
[77]

Yeung and Bede Liu

M.M. Yeung and Bede Liu. Efficient matching and clustering of video shots. InProceedings., International Conference on Image Processing, pages 338–341 vol.1, 1995. 5

1995
[78]

Understanding climate-related visual storytelling on tiktok: A cross-national multimodal analysis.Journal of digital social research, 6(2):66–84,

Jing Zeng and Xiaoyue Yan. Understanding climate-related visual storytelling on tiktok: A cross-national multimodal analysis.Journal of digital social research, 6(2):66–84,
[79]

Video Summarization with Long Short-Term Memory

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video Summarization with Long Short-Term Memory. In Computer Vision – ECCV 2016, pages 766–782, Cham,

2016
[80]

Springer International Publishing. 5

Showing first 80 references.