Continual Visual and Verbal Learning Through a Child's Egocentric Input

Brenden Lake; Kenneth A. Norman; Mengye Ren; Xiaoyang Jiang; Yanlai Yang

arxiv: 2606.05115 · v1 · pith:RLODB23Pnew · submitted 2026-06-03 · 💻 cs.CV · cs.AI· cs.CL

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Xiaoyang Jiang , Yanlai Yang , Kenneth A. Norman , Brenden Lake , Mengye Ren This is my paper

Pith reviewed 2026-06-28 06:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords continual learningegocentric visionmultimodal learningword learningcontrastive learningstreaming dataSAYCamchild development

0 comments

The pith

Neural networks learn word meanings from a child's egocentric video in one chronological pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BabyCL, a framework that trains on child-worn camera footage without reshuffling or repeating the data across many epochs. It breaks the continuous stream into temporal segments and uses separate replay buffers for visual features and image-text pairs, then applies three contrastive losses on a shared network. This produces word-referent mappings that beat other streaming baselines and close much of the gap to full offline training. The work matters because it tests whether word learning can occur under constraints that more closely match a child's actual, single-exposure experience rather than artificial batch training.

Core claim

BabyCL processes the SAYCam dataset in a single chronological pass by combining streaming visual representation learning with an image-text contrastive objective. It uses multi-stage temporal segmentation of the stream together with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget this yields better performance than streaming baselines on the SAYCam Labeled-S 4AFC benchmark while narrowing the gap to offline training; the gains remain stable across changes in segmentation window length and buffer eviction rules.

What carries the argument

BabyCL's dual replay buffer paired with multi-stage temporal segmentation, which maintains independent visual and multimodal histories to support continual contrastive learning on a single pass through egocentric video.

If this is right

Meaningful word-referent mappings can be acquired without cycling through the data for hundreds of epochs.
The approach narrows the performance gap to offline training when the total optimization budget is held constant.
The improvements remain stable when the length of the online temporal segmentation window is varied.
Performance does not depend on any single buffer eviction rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same segmentation-plus-dual-buffer pattern could be tested on longer video streams or on data from multiple children to check scalability.
The framework raises the possibility that separate visual and multimodal memory stores help continual learning in other sensory domains such as audio or touch.
If the method generalizes, it could be applied to robotic systems that must learn object names from their own camera streams without offline batch access.

Load-bearing premise

The SAYCam recordings, once segmented and buffered, form a close enough stand-in for the continuous single-exposure experience that supports word learning in children.

What would settle it

Removing the dual replay buffer or the temporal segmentation and finding that BabyCL no longer outperforms other streaming baselines on the SAYCam Labeled-S 4AFC benchmark.

Figures

Figures reproduced from arXiv: 2606.05115 by Brenden Lake, Kenneth A. Norman, Mengye Ren, Xiaoyang Jiang, Yanlai Yang.

**Figure 1.** Figure 1: Overview of BabyCL. The incoming SAYCam stream is partitioned into event segments of roughly three minutes using hierarchical visual clustering, with boundaries adjusted to utterance timestamps. Embeddings of frames within the same event segment are brought together while those from different event segments are pushed apart (ℒ𝑆 ). Matching pairs of utterances and frames are brought together while mismatch… view at source ↗

**Figure 2.** Figure 2: Example of the 4AFC task. Given a target category (“ball”) and three foil categories (“cat”, “couch”, and “car”), the model must select the image corresponding to the target category from four candidate images [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-video nearest neighbors (BabyCL). Each major column corresponds to one query segment. The top row (yellow) shows representative frames from the query segment, while the rows below (blue) show representative frames from retrieved segments from other sessions. 4.7 Grad-CAM++ localization under noun queries Nearest-neighbor retrieval summarizes coarse neighborhoods; Grad-CAM++ (Chattopadhay et al., 2018… view at source ↗

**Figure 4.** Figure 4: Grad-CAM++ with noun queries (BabyCL). Four Labeled-S categories; original frames appear above heatmap overlays for each. Limitations and future work. There remains a notable performance gap between BabyCL and offline CVCL; future works can aim to further close this gap. The replay buffers used here are large relative to the size of the video stream. Finally, additional evaluation benchmarks probing compos… view at source ↗

read the original abstract

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BabyCL shows single-pass egocentric word learning is doable with dual replay buffers, but the buffers mean some frames get multiple looks, softening the 'closer to child experience' claim.

read the letter

BabyCL processes the SAYCam egocentric videos in one chronological pass. It uses multi-stage temporal segmentation plus separate replay buffers for visual features and multimodal pairs, then trains a shared backbone with three contrastive losses. On the labeled 4AFC benchmark it beats streaming baselines and narrows the gap to offline training under a matched budget. Ablations indicate the gains hold across segmentation window sizes and eviction rules.

The dual-buffer design and the joint loss setup are the concrete additions. They let the model keep some history without full shuffling or repeated epochs, which is a reasonable engineering step for this data regime.

The replay buffers create the main softness. Resampling from them means the effective exposure count per frame is greater than one even though the overall traversal stays chronological. The paper needs to report average exposures or sampling rates to show how far this stays from true single-exposure input; without those numbers the central claim about closeness to child experience rests on weaker ground. The abstract also gives no effect sizes or variance, so the full results tables will decide whether the outperformance is reliable.

This is for groups working on continual multimodal learning or computational developmental models. The method is specific enough and the data real enough that it deserves a serious referee, though the replay quantification and statistical details will need attention in revision.

Referee Report

2 major / 0 minor

Summary. The paper introduces BabyCL, a continual multimodal learning framework that processes the SAYCam egocentric video dataset in a single chronological pass. It combines multi-stage temporal segmentation with a dual replay buffer (managing visual and multimodal histories independently) and trains a shared backbone with three contrastive losses. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC word-referent mapping benchmark under matched optimization budgets and narrows the gap to an offline upper bound; ablations indicate robustness to segmentation window length and replay eviction rules. The central claim is that meaningful word-referent mappings emerge under training conditions much closer to a child's single-exposure egocentric experience.

Significance. If the results hold after addressing the exposure-count issue, the work would be significant for bridging machine learning and developmental science by showing that grounded language acquisition is feasible without hundreds of shuffled epochs, using a more realistic single-pass regime on real child data. The ablations on segmentation and eviction provide some robustness evidence.

major comments (2)

[Abstract and Methods] Abstract and Methods (training procedure): The central claim that results demonstrate mappings 'under training conditions much closer to a child's actual experience' of single-exposure input rests on the assumption that the dual replay buffer preserves near-single exposure during the chronological traversal. However, independent resampling of visual and multimodal histories means non-trivial replay rates produce multiple exposures per frame; without explicit reporting of effective exposure counts, sampling probabilities, or buffer statistics, this assumption remains unverified and directly weakens the proxy-for-child-experience interpretation.
[Results] Results section: The abstract asserts outperformance over streaming baselines and ablation robustness but supplies no numerical accuracies, error bars, baseline specifications, or statistical tests on the 4AFC benchmark. The full paper must include these quantitative details (with comparisons to the offline upper bound) to substantiate the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and commit to revisions that strengthen the manuscript's transparency and clarity.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods (training procedure): The central claim that results demonstrate mappings 'under training conditions much closer to a child's actual experience' of single-exposure input rests on the assumption that the dual replay buffer preserves near-single exposure during the chronological traversal. However, independent resampling of visual and multimodal histories means non-trivial replay rates produce multiple exposures per frame; without explicit reporting of effective exposure counts, sampling probabilities, or buffer statistics, this assumption remains unverified and directly weakens the proxy-for-child-experience interpretation.

Authors: We agree that quantifying the effective exposure counts is essential for rigorously supporting the single-pass interpretation. In the revised manuscript we will add a dedicated paragraph and supplementary table reporting average exposures per frame (computed from the dual replay sampling probabilities), buffer statistics over the full chronological traversal, and a direct comparison to the offline multi-epoch regime. These additions will make the degree of deviation from pure single exposure explicit while preserving the central argument that the regime remains far closer to child experience than hundreds of shuffled epochs. revision: yes
Referee: [Results] Results section: The abstract asserts outperformance over streaming baselines and ablation robustness but supplies no numerical accuracies, error bars, baseline specifications, or statistical tests on the 4AFC benchmark. The full paper must include these quantitative details (with comparisons to the offline upper bound) to substantiate the performance claims.

Authors: The Results section already reports the 4AFC accuracies, error bars across seeds, baseline specifications, and comparisons to the offline upper bound. To directly address the concern we will insert a compact summary table of the key metrics (including statistical significance where applicable) into the main Results section and ensure the abstract references the primary numerical gains. If any requested detail is currently only in the supplement, it will be moved to the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons are independent of inputs

full rationale

The paper introduces an empirical continual learning framework (BabyCL) with temporal segmentation and dual replay buffers, then reports performance on the SAYCam Labeled-S benchmark against streaming baselines and an offline upper bound. No mathematical derivation chain exists that reduces predictions to fitted parameters or self-definitions by construction. Claims rest on experimental outcomes under matched optimization budgets rather than any self-citation load-bearing premise or ansatz smuggled via prior work. The single-pass processing is an explicit design choice whose effects are measured externally, not presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameter counts and background assumptions; listed items are inferred from described components.

free parameters (2)

temporal segmentation window length
Explicitly ablated; value chosen to achieve reported robustness.
replay buffer eviction rule
Ablated and affects performance; specific rule selected for results.

axioms (2)

domain assumption Image-text contrastive objectives on a shared backbone can extract word-referent mappings from video streams
Central training mechanism invoked throughout the method.
domain assumption Independent visual and multimodal replay buffers suffice to mitigate forgetting in single-pass continual learning
Core design choice for the dual buffer.

pith-pipeline@v0.9.1-grok · 5739 in / 1251 out tokens · 30095 ms · 2026-06-28T06:37:30.813539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 12 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2404.19132 , year=

Integrating Present and Past in Unsupervised Continual Learning , author=. arXiv preprint arXiv:2404.19132 , year=

work page arXiv
[2]

arXiv preprint arXiv:2109.05675 , year =

Online Unsupervised Learning of Visual Representations and Categories , author =. arXiv preprint arXiv:2109.05675 , year =

work page arXiv
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised models are continual learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

The Tenth International Conference on Learning Representations , year =

Divyam Madaan and Jaehong Yoon and Yuanchun Li and Yunxin Liu and Sung Ju Hwang , title =. The Tenth International Conference on Learning Representations , year =
[5]

Advances in neural information processing systems , volume=

Continual unsupervised representation learning , author=. Advances in neural information processing systems , volume=
[6]

Taylor and Seth Baer and Constantine Dovrolis , title =

James Seale Smith and Cameron E. Taylor and Seth Baer and Constantine Dovrolis , title =. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , year =
[7]

Open mind , volume=

SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective , author=. Open mind , volume=. 2021 , publisher=

2021
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[9]

Advances in neural information processing systems , volume=

How well do unsupervised learning algorithms model human real-time and life-long learning? , author=. Advances in neural information processing systems , volume=
[10]

Advances in Neural Information Processing Systems , volume=

Self-supervised learning through the eyes of a child , author=. Advances in Neural Information Processing Systems , volume=
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[12]

IEEE Transactions on Image Processing , volume=

Dsnet: A flexible detect-to-summarize network for video summarization , author=. IEEE Transactions on Image Processing , volume=. 2020 , publisher=

2020
[13]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VII 14 , pages=

Video summarization with long short-term memory , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VII 14 , pages=. 2016 , organization=

2016
[14]

Proceedings of the European conference on computer vision (ECCV) , pages=

Video summarization using fully convolutional sequence networks , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[15]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 , pages=

Category-specific video summarization , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 , pages=. 2014 , organization=

2014
[16]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[17]

Representation Learning with Contrastive Predictive Coding

Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[19]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16 , pages=

Contrastive multiview coding , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16 , pages=. 2020 , organization=

2020
[20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[21]

Advances in neural information processing systems , volume=

Unsupervised learning of visual features by contrasting cluster assignments , author=. Advances in neural information processing systems , volume=
[22]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=
[23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[24]

Proceedings of the European conference on computer vision (ECCV) , pages=

Deep clustering for unsupervised learning of visual features , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[25]

International conference on machine learning , pages=

Barlow twins: Self-supervised learning via redundancy reduction , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[26]

The Tenth International Conference on Learning Representations , year =

Adrien Bardes and Jean Ponce and Yann LeCun , title =. The Tenth International Conference on Learning Representations , year =
[27]

Proceedings of the IEEE international conference on computer vision , pages=

Unsupervised visual representation learning by context prediction , author=. Proceedings of the IEEE international conference on computer vision , pages=
[28]

European conference on computer vision , pages=

Unsupervised learning of visual representations by solving jigsaw puzzles , author=. European conference on computer vision , pages=. 2016 , organization=

2016
[29]

6th International Conference on Learning Representations , year =

Spyros Gidaris and Praveer Singh and Nikos Komodakis , title =. 6th International Conference on Learning Representations , year =
[30]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Context encoders: Feature learning by inpainting , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[31]

European Conference on Computer Vision , pages=

The challenges of continuous self-supervised learning , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[32]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=
[33]

Current directions in psychological science , volume=

Event segmentation , author=. Current directions in psychological science , volume=. 2007 , publisher=

2007
[34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning from One Continuous Video Stream , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[35]

, author=

The objective basis of behavior units. , author=. Journal of Personality and social psychology , volume=. 1977 , publisher=

1977
[36]

, author=

Perceiving, remembering, and communicating structure in events. , author=. Journal of experimental psychology: General , volume=. 2001 , publisher=

2001
[37]

Child development , volume=

Infants parse dynamic action , author=. Child development , volume=. 2001 , publisher=

2001
[38]

Journal of Cognition and Development , volume=

Infants' on-line segmentation of dynamic human action , author=. Journal of Cognition and Development , volume=. 2007 , publisher=

2007
[39]

2016 IEEE Winter Conference on Applications of Computer Vision , pages=

Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks , author=. 2016 IEEE Winter Conference on Applications of Computer Vision , pages=. 2016 , organization=

2016
[40]

Science , volume=

Grounded language acquisition through the eyes and ears of a single child , author=. Science , volume=. 2024 , publisher=

2024
[41]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[42]

ACM Transactions on Mathematical Software (TOMS) , volume=

Random sampling with a reservoir , author=. ACM Transactions on Mathematical Software (TOMS) , volume=. 1985 , publisher=

1985
[43]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[45]

Nature neuroscience , volume=

Human brain activity time-locked to perceptual event boundaries , author=. Nature neuroscience , volume=. 2001 , publisher=

2001
[46]

, author=

Event understanding and memory in healthy aging and dementia of the Alzheimer type. , author=. Psychology and aging , volume=. 2006 , publisher=

2006
[47]

Cognitive, Affective, & Behavioral Neuroscience , volume=

Activation of human motion processing areas during event perception , author=. Cognitive, Affective, & Behavioral Neuroscience , volume=. 2003 , publisher=

2003
[48]

Psychology of Learning and Motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of Learning and Motivation , volume=. 1989 , publisher=

1989
[49]

The Tenth International Conference on Learning Representations , year =

Dapeng Hu and Shipeng Yan and Qizhengqiu Lu and Lanqing Hong and Hailin Hu and Yifan Zhang and Zhenguo Li and Xinchao Wang and Jiashi Feng , title =. The Tenth International Conference on Learning Representations , year =
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scale: Online self-supervised lifelong learning without prior knowledge , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[51]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Wanderlust: Online continual object detection in the real world , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[52]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[53]

Advances in neural information processing systems , volume=

Improved deep metric learning with multi-class n-pair loss objective , author=. Advances in neural information processing systems , volume=
[54]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[55]

Proceedings of the European conference on computer vision (ECCV) , pages=

Group normalization , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[56]

31st British Machine Vision Conference , year =

Diganta Misra , title =. 31st British Machine Vision Conference , year =
[57]

Proceedings of the 32nd International Conference on Machine Learning , year =

Sergey Ioffe and Christian Szegedy , title =. Proceedings of the 32nd International Conference on Machine Learning , year =
[58]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

Stream-51: Streaming classification and novelty detection from videos , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=
[59]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Memory efficient experience replay for streaming learning , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

2019
[60]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

Lifelong machine learning with deep streaming linear discriminant analysis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=
[61]

arXiv preprint arXiv:2110.10741 , year=

Class incremental online streaming learning , author=. arXiv preprint arXiv:2110.10741 , year=

work page arXiv
[62]

European conference on computer vision , pages=

Remind your neural network to prevent catastrophic forgetting , author=. European conference on computer vision , pages=. 2020 , organization=

2020
[63]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[64]

International conference on machine learning , pages=

Online continual learning through mutual information maximization , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[65]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Online prototype learning for online continual learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[66]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , volume =

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle =. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , volume =
[67]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Label-efficient online continual object detection in streaming video , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[68]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,
[69]

Proceedings of the National Academy of Sciences , volume=

Neural event segmentation of continuous experience in human infants , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

2022
[70]

Journal of Neuroscience , volume=

Rapid memory reactivation at movie event boundaries promotes episodic encoding , author=. Journal of Neuroscience , volume=. 2019 , publisher=

2019
[71]

and Slaw, David , year =

Lassiter, G. and Slaw, David , year =. The Unitization and Memory of Events , volume =. Journal of Experimental Psychology: General , doi =
[72]

Measuring event segmentation: An investigation into the stability of event boundary agreement across groups , volume =

Sasmita, Karen and Swallow, Khena , year =. Measuring event segmentation: An investigation into the stability of event boundary agreement across groups , volume =. Behavior Research Methods , doi =
[73]

The Influence of Context Boundaries on Memory for the Sequential Order of Events , volume =

DuBrow, Sarah and Davachi, Lila , year =. The Influence of Context Boundaries on Memory for the Sequential Order of Events , volume =. Journal of Experimental Psychology: General , doi =
[74]

What Constitutes an Episode in Episodic Memory? , volume =

Ezzyat, Youssef and Davachi, Lila , year =. What Constitutes an Episode in Episodic Memory? , volume =. Psychological science , doi =
[75]

Neuron , volume=

Discovering event structure in continuous narrative perception and memory , author=. Neuron , volume=. 2017 , publisher=

2017
[76]

arXiv preprint arXiv:2406.09935 , year=

Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last , author=. arXiv preprint arXiv:2406.09935 , year=

work page arXiv
[77]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Gcr: Gradient coreset based replay buffer selection for continual learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[78]

2020 28th European signal processing conference (EUSIPCO) , pages=

Entropy-based sample selection for online continual learning , author=. 2020 28th European signal processing conference (EUSIPCO) , pages=. 2021 , organization=

2020
[79]

Advances in neural information processing systems , volume=

Gradient based sample selection for online continual learning , author=. Advances in neural information processing systems , volume=
[80]

Large Batch Training of Convolutional Networks

Large batch training of convolutional networks , author=. arXiv preprint arXiv:1708.03888 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2404.19132 , year=

Integrating Present and Past in Unsupervised Continual Learning , author=. arXiv preprint arXiv:2404.19132 , year=

work page arXiv

[2] [2]

arXiv preprint arXiv:2109.05675 , year =

Online Unsupervised Learning of Visual Representations and Categories , author =. arXiv preprint arXiv:2109.05675 , year =

work page arXiv

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised models are continual learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[4] [4]

The Tenth International Conference on Learning Representations , year =

Divyam Madaan and Jaehong Yoon and Yuanchun Li and Yunxin Liu and Sung Ju Hwang , title =. The Tenth International Conference on Learning Representations , year =

[5] [5]

Advances in neural information processing systems , volume=

Continual unsupervised representation learning , author=. Advances in neural information processing systems , volume=

[6] [6]

Taylor and Seth Baer and Constantine Dovrolis , title =

James Seale Smith and Cameron E. Taylor and Seth Baer and Constantine Dovrolis , title =. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , year =

[7] [7]

Open mind , volume=

SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective , author=. Open mind , volume=. 2021 , publisher=

2021

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[9] [9]

Advances in neural information processing systems , volume=

How well do unsupervised learning algorithms model human real-time and life-long learning? , author=. Advances in neural information processing systems , volume=

[10] [10]

Advances in Neural Information Processing Systems , volume=

Self-supervised learning through the eyes of a child , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[12] [12]

IEEE Transactions on Image Processing , volume=

Dsnet: A flexible detect-to-summarize network for video summarization , author=. IEEE Transactions on Image Processing , volume=. 2020 , publisher=

2020

[13] [13]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VII 14 , pages=

Video summarization with long short-term memory , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VII 14 , pages=. 2016 , organization=

2016

[14] [14]

Proceedings of the European conference on computer vision (ECCV) , pages=

Video summarization using fully convolutional sequence networks , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[15] [15]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 , pages=

Category-specific video summarization , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 , pages=. 2014 , organization=

2014

[16] [16]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[17] [17]

Representation Learning with Contrastive Predictive Coding

Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[19] [19]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16 , pages=

Contrastive multiview coding , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16 , pages=. 2020 , organization=

2020

[20] [20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[21] [21]

Advances in neural information processing systems , volume=

Unsupervised learning of visual features by contrasting cluster assignments , author=. Advances in neural information processing systems , volume=

[22] [22]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=

[23] [23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[24] [24]

Proceedings of the European conference on computer vision (ECCV) , pages=

Deep clustering for unsupervised learning of visual features , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[25] [25]

International conference on machine learning , pages=

Barlow twins: Self-supervised learning via redundancy reduction , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[26] [26]

The Tenth International Conference on Learning Representations , year =

Adrien Bardes and Jean Ponce and Yann LeCun , title =. The Tenth International Conference on Learning Representations , year =

[27] [27]

Proceedings of the IEEE international conference on computer vision , pages=

Unsupervised visual representation learning by context prediction , author=. Proceedings of the IEEE international conference on computer vision , pages=

[28] [28]

European conference on computer vision , pages=

Unsupervised learning of visual representations by solving jigsaw puzzles , author=. European conference on computer vision , pages=. 2016 , organization=

2016

[29] [29]

6th International Conference on Learning Representations , year =

Spyros Gidaris and Praveer Singh and Nikos Komodakis , title =. 6th International Conference on Learning Representations , year =

[30] [30]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Context encoders: Feature learning by inpainting , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[31] [31]

European Conference on Computer Vision , pages=

The challenges of continuous self-supervised learning , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[32] [32]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=

[33] [33]

Current directions in psychological science , volume=

Event segmentation , author=. Current directions in psychological science , volume=. 2007 , publisher=

2007

[34] [34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning from One Continuous Video Stream , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[35] [35]

, author=

The objective basis of behavior units. , author=. Journal of Personality and social psychology , volume=. 1977 , publisher=

1977

[36] [36]

, author=

Perceiving, remembering, and communicating structure in events. , author=. Journal of experimental psychology: General , volume=. 2001 , publisher=

2001

[37] [37]

Child development , volume=

Infants parse dynamic action , author=. Child development , volume=. 2001 , publisher=

2001

[38] [38]

Journal of Cognition and Development , volume=

Infants' on-line segmentation of dynamic human action , author=. Journal of Cognition and Development , volume=. 2007 , publisher=

2007

[39] [39]

2016 IEEE Winter Conference on Applications of Computer Vision , pages=

Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks , author=. 2016 IEEE Winter Conference on Applications of Computer Vision , pages=. 2016 , organization=

2016

[40] [40]

Science , volume=

Grounded language acquisition through the eyes and ears of a single child , author=. Science , volume=. 2024 , publisher=

2024

[41] [41]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[42] [42]

ACM Transactions on Mathematical Software (TOMS) , volume=

Random sampling with a reservoir , author=. ACM Transactions on Mathematical Software (TOMS) , volume=. 1985 , publisher=

1985

[43] [43]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[45] [45]

Nature neuroscience , volume=

Human brain activity time-locked to perceptual event boundaries , author=. Nature neuroscience , volume=. 2001 , publisher=

2001

[46] [46]

, author=

Event understanding and memory in healthy aging and dementia of the Alzheimer type. , author=. Psychology and aging , volume=. 2006 , publisher=

2006

[47] [47]

Cognitive, Affective, & Behavioral Neuroscience , volume=

Activation of human motion processing areas during event perception , author=. Cognitive, Affective, & Behavioral Neuroscience , volume=. 2003 , publisher=

2003

[48] [48]

Psychology of Learning and Motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of Learning and Motivation , volume=. 1989 , publisher=

1989

[49] [49]

The Tenth International Conference on Learning Representations , year =

Dapeng Hu and Shipeng Yan and Qizhengqiu Lu and Lanqing Hong and Hailin Hu and Yifan Zhang and Zhenguo Li and Xinchao Wang and Jiashi Feng , title =. The Tenth International Conference on Learning Representations , year =

[50] [50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scale: Online self-supervised lifelong learning without prior knowledge , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[51] [51]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Wanderlust: Online continual object detection in the real world , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[52] [52]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[53] [53]

Advances in neural information processing systems , volume=

Improved deep metric learning with multi-class n-pair loss objective , author=. Advances in neural information processing systems , volume=

[54] [54]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[55] [55]

Proceedings of the European conference on computer vision (ECCV) , pages=

Group normalization , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[56] [56]

31st British Machine Vision Conference , year =

Diganta Misra , title =. 31st British Machine Vision Conference , year =

[57] [57]

Proceedings of the 32nd International Conference on Machine Learning , year =

Sergey Ioffe and Christian Szegedy , title =. Proceedings of the 32nd International Conference on Machine Learning , year =

[58] [58]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

Stream-51: Streaming classification and novelty detection from videos , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=

[59] [59]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Memory efficient experience replay for streaming learning , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

2019

[60] [60]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

Lifelong machine learning with deep streaming linear discriminant analysis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

[61] [61]

arXiv preprint arXiv:2110.10741 , year=

Class incremental online streaming learning , author=. arXiv preprint arXiv:2110.10741 , year=

work page arXiv

[62] [62]

European conference on computer vision , pages=

Remind your neural network to prevent catastrophic forgetting , author=. European conference on computer vision , pages=. 2020 , organization=

2020

[63] [63]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[64] [64]

International conference on machine learning , pages=

Online continual learning through mutual information maximization , author=. International conference on machine learning , pages=. 2022 , organization=

2022

[65] [65]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Online prototype learning for online continual learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[66] [66]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , volume =

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle =. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , volume =

[67] [67]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Label-efficient online continual object detection in streaming video , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[68] [68]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,

[69] [69]

Proceedings of the National Academy of Sciences , volume=

Neural event segmentation of continuous experience in human infants , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

2022

[70] [70]

Journal of Neuroscience , volume=

Rapid memory reactivation at movie event boundaries promotes episodic encoding , author=. Journal of Neuroscience , volume=. 2019 , publisher=

2019

[71] [71]

and Slaw, David , year =

Lassiter, G. and Slaw, David , year =. The Unitization and Memory of Events , volume =. Journal of Experimental Psychology: General , doi =

[72] [72]

Measuring event segmentation: An investigation into the stability of event boundary agreement across groups , volume =

Sasmita, Karen and Swallow, Khena , year =. Measuring event segmentation: An investigation into the stability of event boundary agreement across groups , volume =. Behavior Research Methods , doi =

[73] [73]

The Influence of Context Boundaries on Memory for the Sequential Order of Events , volume =

DuBrow, Sarah and Davachi, Lila , year =. The Influence of Context Boundaries on Memory for the Sequential Order of Events , volume =. Journal of Experimental Psychology: General , doi =

[74] [74]

What Constitutes an Episode in Episodic Memory? , volume =

Ezzyat, Youssef and Davachi, Lila , year =. What Constitutes an Episode in Episodic Memory? , volume =. Psychological science , doi =

[75] [75]

Neuron , volume=

Discovering event structure in continuous narrative perception and memory , author=. Neuron , volume=. 2017 , publisher=

2017

[76] [76]

arXiv preprint arXiv:2406.09935 , year=

Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last , author=. arXiv preprint arXiv:2406.09935 , year=

work page arXiv

[77] [77]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Gcr: Gradient coreset based replay buffer selection for continual learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[78] [78]

2020 28th European signal processing conference (EUSIPCO) , pages=

Entropy-based sample selection for online continual learning , author=. 2020 28th European signal processing conference (EUSIPCO) , pages=. 2021 , organization=

2020

[79] [79]

Advances in neural information processing systems , volume=

Gradient based sample selection for online continual learning , author=. Advances in neural information processing systems , volume=

[80] [80]

Large Batch Training of Convolutional Networks

Large batch training of convolutional networks , author=. arXiv preprint arXiv:1708.03888 , year=

work page internal anchor Pith review Pith/arXiv arXiv