arxiv: 2604.11730 · v3 · submitted 2026-04-13 · 💻 cs.CV · cs.HC· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

Manuela Gonz\'alez-Gonz\'alez , Soufiane Belharbi , Muhammad Osama Zeeshan , Masoumeh Sharafi , Muhammad Haseeb Aslam , Lorenzo Sia , Nicolas Richet , Marco Pedersoli

show 3 more authors

Alessandro Lameiras Koerich Simon L Bacon Eric Granger

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LG

keywords ambivalence and hesitancy recognitionmultimodal video analysisdigital health interventionsaffective computingdeep learning fusionBAH datasetpersonalized behavior change

0 comments

The pith

Standard deep learning models show limited success recognizing ambivalence and hesitancy in videos, indicating that better methods for handling multimodal conflicts are needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores automatic recognition of ambivalence and hesitancy, which are subtle conflicting emotions that can delay or prevent people from adopting healthy behaviors in digital health interventions. These emotions appear as inconsistencies in facial expressions, voice, body language, or spoken words. The authors test three setups on the BAH video dataset: ordinary supervised training, unsupervised domain adaptation to personalize the model to new users, and zero-shot inference with large language models. Performance across all setups remains low, which leads them to conclude that current video architectures cannot reliably capture the required spatio-temporal patterns or cross-modal conflicts.

Core claim

Applying standard deep learning pipelines to multimodal video for ambivalence and hesitancy recognition produces only limited accuracy, demonstrating that existing architectures are insufficient to exploit affective inconsistencies within and across modalities and that specialized spatio-temporal and multimodal fusion techniques will be required before such recognition can support personalized digital health interventions.

What carries the argument

Multimodal video analysis pipelines that combine facial, vocal, linguistic, and body cues to detect affective inconsistency, evaluated in supervised, domain-adaptation, and large-language-model zero-shot regimes on the BAH dataset.

If this is right

Improved fusion methods would allow digital health systems to detect when a user is wavering between acceptance and refusal of a recommended behavior.
Such detection would enable real-time personalization of interventions, for example by adjusting message framing or timing when ambivalence is flagged.
Domain adaptation and zero-shot LLM routes both inherit the same fusion shortcomings, so gains would require changes to the underlying video representation rather than only the training regime.
Accurate A/H recognition could reduce the cost and improve the scalability of behavior-change support in settings where in-person experts are unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion limitations may appear in other video tasks that rely on detecting internal contradictions, such as multimodal deception detection or conflicting sentiment in conversation.
If better models are built, they could be tested for transfer to related affective states like uncertainty or mixed emotions in clinical or educational video data.
The finding suggests that progress on this task may depend more on new architectural primitives for inconsistency modeling than on simply scaling data or model size.

Load-bearing premise

That off-the-shelf deep learning video models can detect subtle emotional conflicts across and within modalities without new architectural adaptations for spatio-temporal fusion.

What would settle it

A new model that adds explicit spatio-temporal and cross-modal fusion layers and then achieves substantially higher accuracy on the same BAH test videos would show that the current limited performance is not an inherent limit of the task.

Figures

Figures reproduced from arXiv: 2604.11730 by Alessandro Lameiras Koerich, Eric Granger, Lorenzo Sia, Manuela Gonz\'alez-Gonz\'alez, Marco Pedersoli, Masoumeh Sharafi, Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Nicolas Richet, Simon L Bacon, Soufiane Belharbi.

**Figure 2.** Figure 2: BAH examples of frames with (green) and without (orange) A/H with cues detailed in [22]. impact of individual modalities/multimodal/fusion, temporalmodeling/context. Video-level classification is also considered. 4.1 Pre-processing of Modalities Visual. Frames from each video captured at 24 fps are extracted, and for each frame, faces are cropped and aligned using the RetinaFace model [14]. The face wit… view at source ↗

**Figure 3.** Figure 3: Multimodal model used for baseline evaluation [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies standard multimodal DL, domain adaptation, and zero-shot LLMs to ambivalence/hesitancy recognition on the BAH video dataset and reports limited performance, which honestly highlights the need for better spatio-temporal and cross-modal fusion.

read the letter

The main thing to know is that off-the-shelf supervised models, unsupervised domain adaptation, and LLM zero-shot setups all deliver limited results on the BAH dataset for spotting ambivalence and hesitancy in videos. The authors use this to argue that current techniques do not yet handle the conflicting signals within and across modalities well enough for reliable use in digital health interventions.

Referee Report

2 major / 2 minor

Summary. The manuscript explores the application of deep learning to multimodal ambivalence/hesitancy (A/H) recognition in videos from the BAH dataset. It evaluates three setups—supervised learning, unsupervised domain adaptation for personalization, and LLM zero-shot inference—and reports limited performance, concluding that more adapted models are needed for spatio-temporal and cross-modal fusion to capture affective inconsistencies.

Significance. If the reported limited performance is substantiated with quantitative evidence, the work usefully identifies open challenges in affective computing for digital health interventions, particularly the difficulty of modeling subtle conflicts within and across modalities. The inclusion of multiple learning paradigms (supervised, domain adaptation, zero-shot) is a positive aspect that broadens the empirical scope.

major comments (2)

[Abstract] Abstract: The central claim that 'our results show limited performance' is load-bearing for the recommendation of better spatio-temporal and multimodal fusion methods, yet the abstract supplies no accuracy, F1, or other quantitative metrics, no baseline comparisons, no dataset statistics (e.g., number of videos, class balance), and no error bars. Without these, the claim that standard architectures are insufficient cannot be evaluated.
[Experiments] Results/Experiments section (inferred from the three learning setups described): The manuscript states that standard deep learning architectures yield limited performance on A/H recognition but provides no details on the specific video models used (e.g., which spatio-temporal backbones, fusion strategies, or loss functions), making it impossible to assess whether the 'limited performance' stems from architectural limitations or from implementation choices.

minor comments (2)

[Abstract] The abstract and introduction use 'A/H' without an initial definition on first use; expand the acronym at first mention for clarity.
[Introduction] The manuscript refers to the 'unique and recently published BAH video dataset' but does not cite its source or provide a reference; add the appropriate citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and substantiation of our claims about the challenges in multimodal A/H recognition. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'our results show limited performance' is load-bearing for the recommendation of better spatio-temporal and multimodal fusion methods, yet the abstract supplies no accuracy, F1, or other quantitative metrics, no baseline comparisons, no dataset statistics (e.g., number of videos, class balance), and no error bars. Without these, the claim that standard architectures are insufficient cannot be evaluated.

Authors: We agree that the abstract should be more self-contained to support the central claim. In the revised version, we will expand the abstract to report key quantitative results (accuracy, F1, and other relevant metrics for the supervised, domain adaptation, and zero-shot setups), include BAH dataset statistics (number of videos, class balance), reference baseline comparisons, and note error bars or variability from our runs. This will allow readers to directly evaluate the evidence for needing improved spatio-temporal and cross-modal fusion methods. revision: yes
Referee: [Experiments] Results/Experiments section (inferred from the three learning setups described): The manuscript states that standard deep learning architectures yield limited performance on A/H recognition but provides no details on the specific video models used (e.g., which spatio-temporal backbones, fusion strategies, or loss functions), making it impossible to assess whether the 'limited performance' stems from architectural limitations or from implementation choices.

Authors: We acknowledge the need for greater specificity in the experimental description. We will revise the Experiments section to explicitly detail the spatio-temporal backbones (e.g., the video encoders used for visual features), cross-modal fusion strategies (e.g., late fusion, attention mechanisms, or other approaches), and loss functions applied in each of the three learning setups. These additions will enable assessment of whether the observed limited performance arises from the task's inherent difficulties (affective inconsistencies across modalities) or from the particular implementation choices, thereby strengthening the motivation for more adapted models. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical application paper that applies off-the-shelf supervised video models, domain-adaptation techniques, and LLM zero-shot inference to the BAH dataset and reports the resulting performance numbers. No derivations, equations, parameter fittings, or self-referential definitions appear in the work; the modest conclusion that current architectures yield limited performance and that better spatio-temporal fusion is needed follows directly from the tabulated experimental outcomes without any reduction to the paper's own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies no explicit free parameters, invented entities, or detailed axioms beyond standard machine-learning assumptions for video processing.

axioms (1)

domain assumption Deep learning models can learn representations of subtle emotional states from multimodal video data when trained on sufficient examples.
Implicit premise required to justify applying off-the-shelf video models to A/H detection.

pith-pipeline@v0.9.0 · 5644 in / 1098 out tokens · 68609 ms · 2026-05-10T15:56:49.245876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore the application of deep learning models for A/H recognition in videos... supervised learning, unsupervised domain adaptation... zero-shot inference via large language models... multimodal fusion... spatio-temporal modeling
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 10 canonical work pages · 2 internal anchors

[1]

C. J. Armitage and M. Conner. Attitudinal ambivalence: A test of three key hypotheses.Personality and Social Psychology Bulletin, 26(11):1421–1432, 2000

2000
[2]

Aslam, M

M. Aslam, M. Zeeshan, S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Distilling privileged multimodal information for expression recognition using optimal transport. InInternational Conference on Automatic Face and Gesture Recognition (FG), 2024

2024
[3]

S. Bai, J. Kolter, and V . Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.CoRR, abs/1803.01271, 2018

work page internal anchor Pith review arXiv 2018
[4]

Belharbi, M

S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Guided inter- pretable facial expression recognition via spatial action unit cues. InInternational Conference on Automatic Face and Gesture Recognition (FG), 2024

2024
[5]

Belharbi, M

S. Belharbi, M. Pedersoli, A. L. Koerich, S. Bacon, and E. Granger. Spatial action unit cues for interpretable deep facial expression recognition. InAI and Digital Health Symposium, 2024

2024
[6]

Bucher, E

A. Bucher, E. S. Blazek, and C. T. Symons. How are machine learning and artificial intelligence used in digital behavior change interventions? a scoping review.Mayo Clinic Proceedings: Digital Health, 2(3):375–404, 2024

2024
[7]

Buttorff, T

C. Buttorff, T. Ruder, and M. Bauman.Multiple chronic conditions in the United States, volume 10. 2017

2017
[8]

Chaptoukaev, V

H. Chaptoukaev, V . Strizhkova, M. Panariello, B. Dalpaos, A. Reka, V . Manera, S. Thummler, E. ISMAILOV A, N. Evans, F. Bremond, M. Todisco, M. A. Zulu- aga, and L. M. Ferrari. StressID: a multimodal dataset for stress identification. InNeurIPS, 2023

2023
[9]

Conner and C

M. Conner and C. Armitage. Attitudinal ambivalence. 2008

2008
[10]

Conner and P

M. Conner and P. Sparks. Ambivalence and attitudes.European review of social psychology, 12(1):37–70, 2002

2002
[11]

Kollias and P

D. Kollias and P. Tzirakis and A. Cowen and S. Zafeiriou and I. Kotsia and E. Granger and M. Pedersoli and S. Bacon and A. Baird and C. Gagne and C. Shao and G. Hu and S. Belharbi and M. H. Aslam. Advancements in affective and behavior analysis: The 8th abaw workshop and competition. InCVPR workshop, 2025

2025
[12]

Davidson and U

K. Davidson and U. Scholz. Understanding and predicting health behaviour change: a contemporary view through the lenses of meta-reviews.Health psy- chology review, 14(1):1–5, 2020

2020
[13]

B. R. Delazeri, A. G. Hochuli, J. P. Barddal, A. L. Koerich, and A. de S. Britto Jr. Representation ensemble learning applied to facial expression recognition.Neu- ral Computing and Applications, 37(1):417–438, 2025

2025
[14]

J. Deng, J. Guo, Y . Zhou, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single- stage dense face localisation in the wild.CoRR, abs/1905.00641, 2019

work page arXiv 1905
[15]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. InNAACL-HLT, pages 4171–4186, 2019

2019
[16]

The diabetes prevention program (DPP): description of lifestyle intervention.Diabetes Care, 25(12):2165– 2171, Dec

Diabetes Prevention Program (DPP) Research Group. The diabetes prevention program (DPP): description of lifestyle intervention.Diabetes Care, 25(12):2165– 2171, Dec. 2002

2002
[17]

L. Dong, X. Wang, S. Setlur, V . Govindaraju, and I. Nwogu. Ig3d: Integrating 3d face representations in facial expression inference. InECCV, 2024

2024
[18]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

2021
[19]

Y . Fan, J. Lam, and V . Li. Facial action unit intensity estimation via semantic correspondence learning with dynamic graph convolution. InAAAI, 2020

2020
[20]

N. M. Foteinopoulou and I. Patras. Emoclip: A vision-language method for zero-shot video facial expression recognition. InInternational conference on automatic face and gesture recognition (FG), 2024

2024
[21]

González-González, J

M. González-González, J. Almeida, L. Ortiz, S. Belharbi, K. Lavoie, E. Granger, and S. Bacon. Identifying multimodal cues of ambivalence and hesitancy for digital health behaviour change interventions. InAnnals of Behavioral Medicine, 2026

2026
[22]

González-González, S

M. González-González, S. Belharbi, M. O. Zeeshan, M. Sharafi, M. H. Aslam, M. Pedersoli, A. L. Koerich, S. L. Bacon, and E. Granger. BAH dataset for ambivalence/hesitancy recognition in videos for digital behavioural change. In ICLR, 2026

2026
[23]

X. Guo, B. Zhu, L. Polanía, C. Boncelet, and K. Barner. Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. InACM international conference on multimodal interaction, pages 635–639, 2018

2018
[24]

K. Hacker. The burden of chronic disease.Mayo Clinic Proceedings: Innovations, Quality & Outcomes, 8(1):112–119, 2024

2024
[25]

J. Hall, J. Harrigan, and R. Rosenthal. Nonverbal behavior in clinician—patient interaction.Applied and preventive psychology, 4(1):21–37, 1995

1995
[26]

J. Han, L. Xie, J. Liu, and X. Li. Personalized broad learning system for facial expression.Multimedia Tools and Applications, 2020

2020
[27]

Hassan, M

J. Hassan, M. H. Gani, N. Hussein, M. U. Khattak, M. M. Naseer, F. Shah- baz Khan, and S. H. Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InNeurIPS, 2023

2023
[28]

Hayashi, S

D. Hayashi, S. Carvalho, P. Ribeiro, R. Rodrigues, T. São-João, K. Lavoie, S. Bacon, and M. E. Cornelio. Methods to assess ambivalence towards food and diet: a scoping review.Journal of Human Nutrition and Dietetics, 36(5):2010– 2025, 2023

2010
[29]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, 2016

2016
[30]

Z. He, Z. Li, F. Yang, L. Wang, J. Li, C. Zhou, and J. Pan. Advances in multimodal emotion recognition based on brain–computer interfaces.Brain sciences, 10(10):687, 2020

2020
[31]

Heisel and M

M. Heisel and M. Mongrain. Facial expressions and ambivalence: Looking for conflict in all the right faces.Journal of Nonverbal Behavior, 28:35–52, 2004

2004
[32]

Hershey, S

S. Hershey, S. Chaudhuri, D. Ellis, J. Gemmeke, A. Jansen, R. Moore, M. Plakal, D. Platt, R. Saurous, B. Seybold, M. Slaney, R. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. InICASSP, 2017

2017
[33]

Heuveline

P. Heuveline. Global and national declines in life expectancy: An end-of-2021 assessment.Population and development review, 48(1):31–50, 2022

2021
[34]

Hohman, W

Z. Hohman, W. Crano, and E. Niedbala. Attitude ambivalence, social norms, and behavioral intentions: Developing effective antitobacco persuasive communica- tions.Psychology of Addictive Behaviors, 30(2):209, 2016

2016
[35]

Hornstein, K

S. Hornstein, K. Zantvoort, U. Lueken, B. Funk, and K. Hilbert. Personalization strategies in digital mental health interventions: a systematic review and concep- tual framework for depressive symptoms.Frontiers in digital health, 5:1170002, 2023

2023
[36]

T.-C. C. Hsu, P. Whelan, J. Gandrup, C. J. Armitage, L. Cordingley, and J. Mc- Beth. Personalized interventions for behaviour change: A scoping review of just-in-time adaptive interventions.British Journal of Health Psychology, 30(1):e12766, 2025

2025
[37]

J. L. Kaar, C. M. Luberto, K. A. Campbell, and J. C. Huffman. Sleep, health behaviors, and behavioral interventions: Reducing the risk of cardiovascular disease in adults.World Journal of Cardiology, 9(5):396, 2017

2017
[38]

Karmanov, D

A. Karmanov, D. Guan, S. Lu, A. El Saddik, and E. Xing. Efficient test-time adaptation of vision-language models. InCVPR, 2024. 9 González et al. [Under Review 2026]

2024
[39]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, and M. Suleyman. The kinetics human action video dataset.CoRR, abs/1705.06950, 2017

work page internal anchor Pith review arXiv 2017
[40]

Khanna, N

R. Khanna, N. Robinson, M. O’Donnell, H. Eyre, and E. Smith. Affective computing in psychotherapy.Advances in Psychiatry and Behavioral Health, 2(1):95–105, Sep 2022

2022
[41]

D. Kollias. Multi-label compound expression recognition: C-expr database & network. InCVPR, 2023

2023
[42]

Kollias and S

D. Kollias and S. Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface.CoRR, 2019

2019
[43]

K. Kraack. A multimodal emotion recognition system: Integrating facial expres- sions, body movement, speech, and spoken language.CoRR, abs/2412.17907, 2024

work page arXiv 2024
[44]

Labbé, I

S. Labbé, I. Colmegna, V . Valerio, V . Boucher, S. Peláez, A. Dragomir, C. Laurin, E. Hazel, S. Bacon, and K. Lavoie. Training physicians in motivational com- munication to address influenza vaccine hesitation: a proof-of-concept study. Vaccines, 10(2):143, 2022

2022
[45]

Li and W

S. Li and W. Deng. Deep emotion transfer network for cross-database facial expression recognition. InICPR, 2018

2018
[46]

S. Li, W. Deng, and J. Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InCVPR, 2017

2017
[47]

Z. Lian, H. Chen, L. Chen, H. Sun, L. Sun, Y . Ren, Z. Cheng, B. Liu, R. Liu, X. Peng, J. Yi, and J. Tao. Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models. InICML, 2025

2025
[48]

Liang, D

J. Liang, D. Hu, and J. Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. InICML, 2020

2020
[49]

Liang, H

Y . Liang, H. Chen, Y . Xiong, Z. Zhou, M. Lyu, Z. Lin, S. Niu, S. Zhao, J. Han, and G. Ding. Advancing reliable test-time adaptation of vision-language models under visual variations. InACM Multimedia, 2025

2025
[50]

Liberatori, A

B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci. Test-time zero-shot temporal action localization. InCVPR, 2024

2024
[51]

B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection, 2024

2024
[52]

Lisowska, S

A. Lisowska, S. Wilk, and M. Peleg. Personalising digital health behaviour change interventions using machine learning and domain knowledge.CoRR, abs/2304.03392, 2023

work page arXiv 2023
[53]

C. Liu, X. Zhang, X. Liu, T. Zhang, L. Meng, Y . Liu, Y . Deng, and W. Jiang. Facial expression recognition based on multi-modal features for videos in the wild. InCVPR, 2023

2023
[54]

H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y . Song, Y . Hu, W. Chen, and Y . Ding. Norface: Improving facial expression analysis by identity normalization.ECCV, 2024

2024
[55]

X. Liu, L. Jin, X. Han, J. Lu, J. You, and L. Kong. Identity-aware facial expression recognition in compressed video. InICPR, 2021

2021
[56]

Y . Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y . Zhan. Expression snip- pet transformer for robust video-based facial expression recognition.Pattern Recognition, 138:109368, 2023

2023
[57]

Lokhande, C

H. Lokhande, C. Garware, T. Kudale, and R. Kumar. Personalized well-being interventions (pwis): A new frontier in mental health. InAffective Computing for Social Good: Enhancing Well-being, Empathy, and Equity, pages 183–200. 2024

2024
[58]

C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes. Learning multi-dimensional edge feature-based AU relation graph for facial action unit recognition. InIJCAI, 2022

2022
[59]

MacDonald

N. MacDonald. Vaccine hesitancy: Definition, scope and determinants.Vaccine, 33(34):4161–4164, 2015

2015
[60]

Mantena, A

S. Mantena, A. Johnson, M. Oppezzo, N. Schütz, A. Tolas, R. Doijad, C. M. Mattson, A. Lawrie, M. Ramirez-Posada, P. Schmiedmayer, et al. Fine-tuning llms in behavioral psychology for scalable health coaching.NPJ Cardiovascular Health, 2(1):48, 2025

2025
[61]

Manuel and T

J. Manuel and T. Moyers. The role of ambivalence in behavior change.Addiction, 111(11):1910–1912, Nov. 2016

1910
[62]

J. Mao, R. Xu, X. Yin, Y . Chang, B. Nie, A. Huang, and Y . Wang. Poster++: A simpler and stronger facial expression recognition network.Pattern Recognition, page 110951, 2024

2024
[63]

Mather and P

M. Mather and P. Scommegna. Up to half of u.s. premature deaths are preventable; behavioral factors key, 2015

2015
[64]

J. A. Matthews, S. Matthews, M. D. Faries, and R. Q. Wolever. Supporting sustainable health behavior change: the whole is greater than the sum of its parts. Mayo Clinic Proceedings: Innovations, Quality & Outcomes, 8(3):263–275, 2024

2024
[65]

McCord, F

C. McCord, F. Ullrich, K. A. S. Merchant, D. Bhagianadh, K. D. Carter, E. Nelson, J. P. Marcin, K. B. Law, J. Neufeld, A. Giovanetti, and M. M. Ward. Comparison of in-person vs. telebehavioral health outcomes from rural populations across america.BMC Psychiatry, 22(1):778, Dec. 2022

2022
[66]

Michie, M

S. Michie, M. Richardson, M. Johnston, C. Abraham, J. Francis, W. Hardeman, M. Eccles, J. Cane, and C. Wood. The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: building an international consensus for the reporting of behavior change interventions.Annals of behavioral medicine, 46(1):81–95, 2013

2013
[67]

Michie, R

S. Michie, R. West, and B. Spring. Moving from theory to practice and back in social and health psychology. 2013

2013
[68]

K. R. Middleton, S. D. Anton, and M. G. Perri. Long-term adherence to health behavior change.American journal of lifestyle medicine, 7(6):395–404, 2013

2013
[69]

D. D. Miller. Can ai help with the hardest thing: pro health behavior change.npj Cardiovascular Health, 3(1):3, 2026

2026
[70]

Miller and G

W. Miller and G. Rose. Motivational interviewing and decisional balance: con- trasting responses to client ambivalence.Behavioural and cognitive psychother- apy, 43(2):129–141, 2015

2015
[71]

Mollahosseini, B

A. Mollahosseini, B. Hassani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.TAFFC, 10(1):18–31, 2019

2019
[72]

B. C. Mulder, H. Algra, E. Cruijsen, J. M. Geleijnse, R. M. Winkels, and W. Kroeze. Beyond motivation: Creating supportive healthcare environments for engaging in therapeutic patient education according to healthcare providers. PEC innovation, 6:100405, 2025

2025
[73]

Murtaza, S

S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger. A realistic protocol for evaluation of weakly supervised object localization. InWACV, 2025

2025
[74]

Nasimzada, J

J. Nasimzada, J. Kleesiek, K. Herrmann, A. Roitberg, and C. Seibold. Towards synthetic data generation for improved pain recognition in videos under patient constraints.CoRR, abs/2409.16382, 2024

work page arXiv 2024
[75]

Nielsen, M

L. Nielsen, M. Riddle, W. M. King, J.W. Aklin, W. Chen, D. Clark, E. Collier, S. Czajkowski, L. Esposito, R. Ferrer, et al. The NIH science of behavior change program: Transforming the science through a focus on mechanisms of change. Behaviour Research and Therapy, 101:3–11, July 2017

2017
[76]

O’Donnell, M

A. O’Donnell, M. Addison, L. Spencer, H. Zurhold, M. Rosenkranz, R. McGov- ern, E. Gilvarry, M.-S. Martens, U. Verthein, and E. Kaner. Which individual, social and environmental influences shape key phases in the amphetamine type stimulant use trajectory? a systematic narrative review and thematic synthesis of the qualitative literature.Addiction, 114(1):...

2019
[77]

V . Olie, C. Grave, G. Helft, V . Nguyen-Thanh, R. Andler, G. Quatremere, A. Pas- quereau, E. Lahaie, G. Lailler, C. Verdot, et al. Epidemiology of cardiovascular risk factors: Behavioural risk factors.Archives of Cardiovascular Diseases, 117(12):770–784, 2024

2024
[78]

Ortiz, T

C. Ortiz, T. López-Cuadrado, A. Ayuso-Álvarez, C. Rodríguez-Blázquez, and I. Galán. Co-occurrence of behavioural risk factors for non-communicable diseases and mortality risk in spain: a population-based cohort study.BMJ open, 15(1):e093037, 2025

2025
[79]

J. A. Parkinson. Promoting behavioral change to improve health outcomes, 2025

2025
[80]

R. G. Praveen, P. Cardinal, and E. Granger. Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention.IEEE Trans- actions on Biometrics, Behavior, and Identity Science, 5(3):360–373, 2023

2023

Showing first 80 references.