pith. machine review for the scientific record. sign in

arxiv: 2604.21369 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.HC

Recognition: unknown

Channel-Free Human Activity Recognition via Inductive-Bias-Aware Fusion Design for Heterogeneous IoT Sensor Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:32 UTC · model grok-4.3

classification 💻 cs.LG cs.HC
keywords human activity recognitionchannel-free processingIoT sensor fusionconditional batch normalizationheterogeneous sensorsmetadata conditioningjoint optimization
0
0 comments X

The pith

A single shared model can recognize human activities from any combination of IoT sensors by processing channels independently and using metadata to guide fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that human activity recognition can operate without assuming any fixed number, order, or semantic arrangement of input channels from heterogeneous IoT sensors. It does so by encoding each channel separately, feeding sensor metadata such as body location and modality into a conditional batch normalization step for late fusion, and training with a joint loss on both per-channel and fused outputs. A sympathetic reader would care because conventional models tie their input layers to specific dataset channel templates, rendering them unusable when sensor setups change across devices or environments. The design therefore aims at reusable inference that preserves discriminability even as channel compositions vary.

Core claim

The central claim is that strict channel-free HAR becomes feasible through channel-wise encoding paired with a shared encoder, metadata-conditioned late fusion via conditional batch normalization, and a combination loss that jointly optimizes individual channel predictions and the final fused result. Sensor metadata recovers structural relations that independent channel processing would otherwise discard, allowing one model to handle arbitrary channel counts and arrangements across datasets.

What carries the argument

Metadata-conditioned late fusion via conditional batch normalization, which adapts the fusion step using sensor details such as body location, modality, and axis to restore information lost when channels are processed independently.

Load-bearing premise

Sensor metadata such as body location, modality, and axis is available and sufficient to recover the structural information that channel-independent processing alone cannot retain.

What would settle it

A direct comparison on the same heterogeneous datasets in which the version without metadata conditioning matches or exceeds the full model's accuracy and cross-dataset transfer performance.

Figures

Figures reproduced from arXiv: 2604.21369 by Tatsuhito Hasegawa.

Figure 1
Figure 1. Figure 1: Comparison of EF, MF, and LF for channel-free HAR. The three [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the proposed channel-free HAR model. Each channel [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy (%) distributions on PAMAP2 for Baseline, EF, MF, LF, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inference time as a function of the number of input channels for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy as a function of perturbation intensity under six conditions on PAMAP2 (LOSO-CV; shaded: [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis on PAMAP2 (single trial, error bars: [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Human activity recognition (HAR) in Internet of Things (IoT) environments must cope with heterogeneous sensor settings that vary across datasets, devices, body locations, sensing modalities, and channel compositions. This heterogeneity makes conventional channel-fixed models difficult to reuse across sensing environments because their input representations are tightly coupled to predefined channel structures. To address this problem, we investigate strict channel-free HAR, in which a single shared model performs inference without assuming a fixed number, order, or semantic arrangement of input channels, and without relying on sensor-specific input layers or dataset-specific channel templates. We argue that fusion design is the central issue in this setting. Accordingly, we propose a channel-free HAR framework that combines channel-wise encoding with a shared encoder, metadata-conditioned late fusion via conditional batch normalization, and joint optimization of channel-level and fused predictions through a combination loss. The proposed model processes each channel independently to handle varying channel configurations, while sensor metadata such as body location, modality, and axis help recover structural information that channel-independent processing alone cannot retain. In addition, the joint loss encourages both the discriminability of individual channels and the consistency of the final fused prediction. Experiments on PAMAP2, together with robustness analysis on six HAR datasets, ablation studies, sensitivity analysis, efficiency evaluation, and cross-dataset transfer learning, demonstrate three main findings...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a channel-free HAR framework for heterogeneous IoT sensor environments that processes each input channel independently via a shared encoder, performs metadata-conditioned late fusion using conditional batch normalization (conditioned on sensor metadata such as body location, modality, and axis), and jointly optimizes channel-level and fused predictions with a combination loss. This design aims to enable a single reusable model that handles arbitrary channel counts, orders, and compositions without fixed input layers or dataset-specific templates. The approach is evaluated via experiments on PAMAP2, robustness analysis across six HAR datasets, ablation studies, sensitivity analysis, efficiency evaluation, and cross-dataset transfer learning, with the abstract indicating these demonstrate three main findings on the fusion design's effectiveness.

Significance. If the empirical results hold and the framework generalizes, this could be significant for practical HAR deployment in variable IoT settings, as it reduces the need for per-environment model redesign or retraining. The combination of channel-independent processing with metadata-driven inductive biases via CBN offers a concrete mechanism to retain structural information without sacrificing flexibility, and the joint loss provides a principled way to balance per-channel discriminability with fused consistency. Cross-dataset transfer experiments are a positive element for assessing real-world reusability.

major comments (1)
  1. [Robustness analysis on six HAR datasets] Robustness analysis and ablation studies sections: The central 'strict channel-free' claim depends on the assumption that complete, accurate sensor metadata is always available at inference to condition the CBN fusion and recover structural priors discarded by channel-wise encoding. However, the described PAMAP2 and cross-dataset experiments supply full metadata by construction, with no reported ablation using masked, noisy, or absent metadata. This leaves untested the regime where the late-fusion path would collapse to the weaker channel-independent baseline, which is load-bearing for claims about arbitrary heterogeneous IoT deployments.
minor comments (1)
  1. [Abstract] The abstract states that the experiments 'demonstrate three main findings' but does not enumerate or summarize those findings, which reduces clarity when assessing whether the data supports the central claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the robustness analysis and metadata assumptions below.

read point-by-point responses
  1. Referee: [Robustness analysis on six HAR datasets] Robustness analysis and ablation studies sections: The central 'strict channel-free' claim depends on the assumption that complete, accurate sensor metadata is always available at inference to condition the CBN fusion and recover structural priors discarded by channel-wise encoding. However, the described PAMAP2 and cross-dataset experiments supply full metadata by construction, with no reported ablation using masked, noisy, or absent metadata. This leaves untested the regime where the late-fusion path would collapse to the weaker channel-independent baseline, which is load-bearing for claims about arbitrary heterogeneous IoT deployments.

    Authors: We appreciate this observation. In the proposed framework, sensor metadata (body location, modality, axis) is treated as auxiliary configuration information that is known a priori in IoT deployments and supplied at inference; it is not inferred from the raw signals. The 'strict channel-free' property refers specifically to the absence of fixed input-layer assumptions or dataset-specific channel templates, allowing arbitrary channel counts/orders/compositions via per-channel encoding. Metadata-conditioned CBN then injects the structural priors needed for effective late fusion. We agree that an explicit test of the fallback regime is valuable. In the revision we will add an ablation that simulates missing/noisy metadata (randomly masking 30% of fields and injecting categorical noise) across PAMAP2 and two additional datasets, reporting both fused and per-channel accuracies to quantify graceful degradation to the channel-independent baseline. This will be included in the robustness analysis section without changing the core experimental claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; proposal is self-contained methodological design

full rationale

The paper introduces a channel-free HAR framework as an explicit architectural proposal combining independent channel encoding, a shared encoder, metadata-conditioned late fusion via conditional batch normalization, and a joint loss for channel-level and fused predictions. This construction is presented as a new design choice whose value is demonstrated empirically on PAMAP2 and cross-dataset experiments rather than derived from or reduced to prior fitted parameters, self-citations, or definitional loops. The role of sensor metadata (body location, modality, axis) is stated as an inductive bias to recover structure, not as a hidden redefinition of the input channels or a prediction forced by construction. No equations or claims in the provided text exhibit self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chains. The framework remains falsifiable via ablation on metadata availability, consistent with a non-circular proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of metadata-conditioned fusion and joint optimization. No new physical entities are postulated. The approach assumes metadata availability and that the proposed loss improves both per-channel and fused discriminability.

axioms (2)
  • domain assumption Sensor metadata (body location, modality, axis) is available at both training and inference time.
    Required to condition the late fusion step via conditional batch normalization.
  • domain assumption Joint optimization via a combination loss simultaneously improves channel-level discriminability and fused prediction consistency.
    The training procedure is built on this assumption.

pith-pipeline@v0.9.0 · 5537 in / 1328 out tokens · 31009 ms · 2026-05-09T22:32:35.361796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

  1. [1]

    Activity recognition using cell phone accelerom- eters,

    J. R. Kwapiszet al., “Activity recognition using cell phone accelerom- eters,”SIGKDD Explor. Newsl., vol. 12, no. 2, pp. 74–82, 2011

  2. [2]

    Hasc challenge: Gathering large scale human activity corpus for the real-world activity understandings,

    N. Kawaguchiet al., “Hasc challenge: Gathering large scale human activity corpus for the real-world activity understandings,” inIn Proc. of the 2nd Augmented Human International Conference, Mar. 2011

  3. [3]

    Unimib shar: A dataset for human activity recognition using acceleration data from smartphones,

    D. Micucciet al., “Unimib shar: A dataset for human activity recognition using acceleration data from smartphones,”Applied Sciences, vol. 7, no. 10, 2017

  4. [4]

    A public domain dataset for human activity recog- nition using smartphones,

    D. Anguitaet al., “A public domain dataset for human activity recog- nition using smartphones,”European Symposium on Artificial Neural Networks (ESANN), pp. 437–442, 2013

  5. [5]

    Complex human activity recognition using smartphone and wrist-worn motion sensors,

    M. Shoaibet al., “Complex human activity recognition using smartphone and wrist-worn motion sensors,”Sensors, vol. 16, no. 4, 2016

  6. [6]

    Smart devices are different: Assessing and Mitigating- Mobile sensing heterogeneities for activity recognition,

    A. Stisenet al., “Smart devices are different: Assessing and Mitigating- Mobile sensing heterogeneities for activity recognition,” inProceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. New York, NY , USA: ACM, Nov. 2015

  7. [7]

    MHealthDroid: A novel framework for agile develop- ment of mobile health applications,

    O. Banoset al., “MHealthDroid: A novel framework for agile develop- ment of mobile health applications,” inAmbient Assisted Living and Daily Activities, ser. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2014, pp. 91–98

  8. [8]

    Comparative study on classifying human activities with miniature inertial and magnetic sensors,

    K. Altunet al., “Comparative study on classifying human activities with miniature inertial and magnetic sensors,”Pattern Recognit., vol. 43, no. 10, pp. 3605–3620, Oct. 2010

  9. [9]

    Creating and benchmarking a new dataset for physical activity monitoring,

    A. Reiss and D. Stricker, “Creating and benchmarking a new dataset for physical activity monitoring,” inIn Proc. of the 5th International Con- ference on PErvasive Technologies Related to Assistive Environments (PETRA), 2012, pp. 40:1–40:8

  10. [10]

    Har-net: Fusing deep representation and hand-crafted features for human activity recognition,

    M. Donget al., “Har-net: Fusing deep representation and hand-crafted features for human activity recognition,” inSignal and Information Pro- cessing, Networking and Computers. Singapore: Springer Singapore, 2019, pp. 32–40

  11. [11]

    On the opportunities and risks of foundation models,

    R. Bommasaniet al., “On the opportunities and risks of foundation models,” 2022

  12. [12]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liuet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision (ECCV) 2024. Cham: Springer Nature Switzerland, 2025, pp. 38–55

  13. [13]

    Segment anything,

    A. Kirillovet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 4015–4026

  14. [14]

    Improving language understanding by generative pre-training,

    A. Radfordet al., “Improving language understanding by generative pre-training,” OpenAI Technical Report, 2018, accessed: 2026- 01-14. [Online]. Available: https://cdn.openai.com/research-covers/ language-unsupervised/language understanding paper.pdf

  15. [15]

    Cross-dataset activity recognition via adaptive spatial- temporal transfer learning,

    X. Qinet al., “Cross-dataset activity recognition via adaptive spatial- temporal transfer learning,”Proc. ACM Interact. Mob. Wearable Ubiq- uitous Technol., vol. 3, no. 4, 2020

  16. [16]

    Crosshar: Generalizing cross-dataset human activity recognition via hierarchical self-supervised pretraining,

    Z. Honget al., “Crosshar: Generalizing cross-dataset human activity recognition via hierarchical self-supervised pretraining,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 2, 2024

  17. [17]

    Domain-robust pre-training method for the sensor-based human activity recognition,

    Z. Zhao and T. Hasegawa, “Domain-robust pre-training method for the sensor-based human activity recognition,” in2022 International Conference on Machine Learning and Cybernetics (ICMLC), 2022, pp. 67–71

  18. [18]

    Har-doremi: Optimizing data mixture for self-supervised human activity recognition across heterogeneous imu datasets,

    L. Banet al., “Har-doremi: Optimizing data mixture for self-supervised human activity recognition across heterogeneous imu datasets,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13542

  19. [19]

    Deep sets,

    M. Zaheeret al., “Deep sets,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

  20. [20]

    Human activity recognition model capable of handling various input waveforms,

    T. Hasegawa, “Human activity recognition model capable of handling various input waveforms,” inNeural Information Processing. Singa- pore: Springer Nature Singapore, 2025, pp. 1–16

  21. [21]

    Mobile activity recognition for a whole day: recognizing real nursing activities with big dataset,

    S. Inoueet al., “Mobile activity recognition for a whole day: recognizing real nursing activities with big dataset,” inProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), 2015, p. 1269–1280

  22. [22]

    Toward practical factory activity recognition: unsupervised understanding of repetitive assembly work in a factory,

    T. Maekawaet al., “Toward practical factory activity recognition: unsupervised understanding of repetitive assembly work in a factory,” inProceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), 2016, p. 1088–1099

  23. [23]

    Human physical activity recognition using smart- phone sensors,

    R.-A. V oicuet al., “Human physical activity recognition using smart- phone sensors,”Sensors, vol. 19, no. 3, 2019. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  24. [24]

    Ensemble residual network-based gender and activity recognition method with signals,

    T. Tunceret al., “Ensemble residual network-based gender and activity recognition method with signals,”The Journal of Supercomputing, vol. 76, no. 3, pp. 2119–2138, Mar 2020

  25. [25]

    isplinception: An inception-resnet deep learning architecture for human activity recognition,

    M. Ronaldet al., “isplinception: An inception-resnet deep learning architecture for human activity recognition,”IEEE Access, vol. 9, pp. 68 985–69 001, 2021

  26. [26]

    A comparative study: Toward an effective convolutional neural network architecture for sensor-based human activity recogni- tion,

    Z. Zhaoet al., “A comparative study: Toward an effective convolutional neural network architecture for sensor-based human activity recogni- tion,”IEEE Access, vol. 10, pp. 20 547–20 558, 2022

  27. [27]

    Sensor-based human activity recognition based on multi-stream time-varying features with eca-net dimensionality re- duction,

    A. S. M. Miahet al., “Sensor-based human activity recognition based on multi-stream time-varying features with eca-net dimensionality re- duction,”IEEE Access, vol. 12, 2024

  28. [28]

    Enhancing human activity recognition with tb-convatt: A multi-dimensional attention framework,

    H. Yanget al., “Enhancing human activity recognition with tb-convatt: A multi-dimensional attention framework,”Biomedical Signal Processing and Control, vol. 110, p. 108314, 2025

  29. [29]

    Deep learning models for real-time human activity recognition with smartphones,

    S. Wan,et al., “Deep learning models for real-time human activity recognition with smartphones,”Mobile Networks and Applications, vol. 25, no. 2, pp. 743–755, Apr. 2020

  30. [30]

    Multi-input CNN-GRU based human activity recognition using wearable sensors,

    N. Duaet al., “Multi-input CNN-GRU based human activity recognition using wearable sensors,”Computing, vol. 103, no. 7, pp. 1461–1478, Jul. 2021

  31. [31]

    Times-series data augmentation and deep learning for construction equipment activity recognition,

    K. M. Rashid and J. Louis, “Times-series data augmentation and deep learning for construction equipment activity recognition,”Advanced Engineering Informatics, vol. 42, p. 100944, 2019

  32. [32]

    Octave mix: Data augmentation using frequency de- composition for activity recognition,

    T. Hasegawa, “Octave mix: Data augmentation using frequency de- composition for activity recognition,”IEEE Access, vol. 9, pp. 53 679– 53 686, 2021

  33. [33]

    Activitygan: generative adversarial networks for data augmentation in sensor-based human activity recognition,

    X. Liet al., “Activitygan: generative adversarial networks for data augmentation in sensor-based human activity recognition,” inAdjunct Proceedings of the 2020 ACM International Joint Conference on Per- vasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers (UbiComp/ISWC), 2020, p. 249–254

  34. [34]

    SensorLM: Learning the language of wearable sen- sors,

    Y . Zhanget al., “SensorLM: Learning the language of wearable sen- sors,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  35. [35]

    Human activity recognition using deep transfer learning of cross position sensor based on vertical distribution of data,

    N. Varshneyet al., “Human activity recognition using deep transfer learning of cross position sensor based on vertical distribution of data,” Multimedia Tools and Applications, vol. 81, no. 16, pp. 22 307–22 322, Jul 2022

  36. [36]

    Local domain adaptation for cross-domain activity recognition,

    J. Zhaoet al., “Local domain adaptation for cross-domain activity recognition,”IEEE Transactions on Human-Machine Systems, vol. 51, no. 1, pp. 12–21, 2021

  37. [37]

    Goat: A generalized cross-dataset activity recognition framework with natural language supervision,

    S. Miao and L. Chen, “Goat: A generalized cross-dataset activity recognition framework with natural language supervision,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 4, 2024

  38. [38]

    Towards customizable foundation models for human activity recognition with wearable devices,

    M. Qiuet al., “Towards customizable foundation models for human activity recognition with wearable devices,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 9, no. 3, 2025

  39. [39]

    Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling,

    S. Mo and P. Morgado, “Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling,” inProceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2024

  40. [40]

    Audio–visual keyword transformer for unconstrained sentence-level keyword spotting,

    Y . Liet al., “Audio–visual keyword transformer for unconstrained sentence-level keyword spotting,”CAAI Transactions on Intelligence Technology, vol. 9, no. 1, pp. 142–152, 2024

  41. [41]

    CNN-LSTM-Based late sensor fusion for human activity recognition in big data networks,

    Z. Baloch, F. K. Shaikh, and M. A. Unar, “CNN-LSTM-Based late sensor fusion for human activity recognition in big data networks,” Wireless Communications and Mobile Computing, vol. 2022, Aug. 2022

  42. [42]

    DFTerNet: Towards 2-bit dynamic fusion networks for accurate human activity recognition,

    Z. Yanget al., “DFTerNet: Towards 2-bit dynamic fusion networks for accurate human activity recognition,”IEEE Access, vol. 6, pp. 56 750– 56 764, 2018

  43. [43]

    P2lhap: Wearable-sensor-based human activity recognition, segmentation, and forecast through patch-to-label seq2seq transformer,

    S. Liet al., “P2lhap: Wearable-sensor-based human activity recognition, segmentation, and forecast through patch-to-label seq2seq transformer,” IEEE Internet of Things Journal, vol. 12, no. 6, pp. 6818–6830, 2025

  44. [44]

    Y . Zhaoet al., “Attention-based sensor fusion for emotion recognition from human motion by combining convolutional neural network and weighted kernel support vector machine and using inertial measurement unit signals,”IET Signal Proc., vol. 17, no. 4, Apr. 2023

  45. [45]

    Human activity recognition from multiple sensors data using multi-fusion representations and cnns,

    F. M. Nooriet al., “Human activity recognition from multiple sensors data using multi-fusion representations and cnns,”ACM Trans. Multi- media Comput. Commun. Appl., vol. 16, no. 2, 2020

  46. [46]

    A time-efficient convolu- tional neural network model in human activity recognition,

    M. Gholamrezaii and S. M. T. AlModarresi, “A time-efficient convolu- tional neural network model in human activity recognition,”Multimedia Tools and Applications, vol. 80, no. 13, pp. 19 361–19 376, 2021

  47. [47]

    Perceptionnet: A deep convolutional neural network for late sensor fusion,

    P. Kasnesiset al., “Perceptionnet: A deep convolutional neural network for late sensor fusion,” inIntelligent Systems and Applications. Cham: Springer International Publishing, 2019, pp. 101–119

  48. [48]

    CNN-based sensor fusion techniques for multimodal human activity recognition,

    S. M ¨unzneret al., “CNN-based sensor fusion techniques for multimodal human activity recognition,” inProceedings of the 2017 ACM Interna- tional Symposium on Wearable Computers, ser. ISWC ’17, Sep. 2017, pp. 158–165

  49. [49]

    Modulating early visual processing by language,

    H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville, “Modulating early visual processing by language,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6597–6607

  50. [50]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  51. [51]

    Object-centric learning with slot attention,

    F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  52. [52]

    Tempo- ral convolutional networks for action segmentation and detection,

    C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tempo- ral convolutional networks for action segmentation and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  53. [53]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inProc. of the International Conference on Learning Representations, May 2015, pp. 1–14

  54. [54]

    Mobilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  55. [55]

    EfficientNet: Rethinking model scaling for con- volutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con- volutional neural networks,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 6105–6114

  56. [56]

    Dynamic negative correlation learning in deep ensemble learning,

    H. Takama and T. Hasegawa, “Dynamic negative correlation learning in deep ensemble learning,” inProceedings of the 14th IIAE International Conference on Industrial Application Engineering 2026. The Institute of Industrial Applications Engineers, Japan, 2026. Tatsuhito Hasegawa(Member, IEEE) received the Ph.D. degree in engineering from Kanazawa Univer- si...