Making Sense of Touch from the Child’s View for Contrastive Learning

Max Whitton, Zecheng Wang^⋆, Puchen Liu^⋆, Quang Tuan Truong, Shengao Wang, Manaswi Yadamreddy,
Oktay Ozel, Visista Jayanti, Saniya Sekhon, Hanna Samuel Tadesse, Lawrence Miao, Junjie Wang,
Jiasen Lu, Chen Yu¹, and Boqing Gong
Project Homepage: https://max-whitton.github.io/Contrastive_Learning_From_Touch Boston University, {maxwh,bgong}@bu.edu¹University of Texas, Austin, chen.yu@austin.utexas.edu

Abstract

Is the sense of touch a mechanism for human babies’ learning of visual concepts? If so, can we quantify its importance, and to what extent do babies rely on their sense of touch for visual learning? To approach these questions in a principled way, we propose a structured coding system for baby-centric touch events, yielding a dataset of 264k two-second clips of touch events coded according to this system. Using this dataset, we pretrain developmentally grounded models that reveal promising insights into the nature of baby learning from touch.

I INTRODUCTION

Developmental psychology has a long history of linking motor skills and physical interactions with babies’ perceptual abilities, tracing back to the foundational theories of Piaget [10], Gibson [4], and Thelen & Smith [17]. For instance, developmental changes in motor skills, such as sitting and manual exploration, contribute directly to three-dimensional object completion [14]. Furthermore, self-generated variability in object images—achieved through holding, moving, or stacking objects—strongly predicts vocabulary growth [12]. These studies suggest that reentrant mappings [3, 2, 13], in which activity in the visual system is correlated in real time with that of the haptic system, provide qualitatively different glosses on the world that educate each other and facilitate robust conceptual learning.

We present a computational framework that investigates the role of touch in babies’ acquisition of visual concepts, asking whether, where, and to what extent touch signals improve visual learning.

We build the work upon recent computational efforts to model babies’ audiovisual intake [18, 19, 20]. Notably, the child’s view for contrastive learning (CVCL) [18] successfully demonstrated how a generic neural network can learn representations by mapping transcribed spoken language to corresponding raw, egocentric visual streams. While this audiovisual paradigm provides a strong foundation for modeling a baby’s language acquisition, by relying exclusively on seeing and hearing, such models omit the physical interactions (e.g., grasps, pokes, and manipulations) that babies actively use to disambiguate, reinforce, and educate their visual systems.

Ideally, our framework will mirror a baby’s sensory experience as closely as possible, recovering as much haptic input as we can from their real life, without adding any extra data that a baby would not have access to. However, it is prohibitively difficult to capture a baby’s high-fidelity tactile sensations at scale directly. We approach this problem in a principled way, without introducing intrusive physical sensors, by leveraging a structured coding system for babies’ touch events. We apply this taxonomy of touch to longitudinal, egocentric video recordings from three infants’ perspectives, yielding a novel dataset of 264k carefully coded video clips of touch events. This dataset effectively captures tactile experience by discretizing visual evidence of intentional physical interactions by or with the baby.

With this dataset, we extend CVCL by contrasting time-locked visual-speech and visual-touch pairs with time-misaligned ones. We then evaluate the learned representations on existing approximations of baby perception tests. Our findings indicate that touch is a highly effective learning signal, particularly when data from the audiovisual modalities is limited—the sense of touch enables the model to achieve superior linear probing and zero-shot performance on visual recognition tasks compared to baselines that learn solely from visual-speech pairs. These findings provide promising insights into the data-efficient nature of baby learning. Our data processing, training, and evaluation code can be found, along with out full dataset and some of our pretrained model weights, on our project homepage. Together, these serve as a computational sandbox to study related theories of development and learning.

II Coding Touch

We describe our approach to coding touch events from babies’ headcam videos. Figure 1 illustrates this pipeline.

II-A Overview and Design Considerations

II-A1 Data source

To align our data with what a real baby would have access to, we use SAYCam [15], a corpus of longitudinal, egocentric 478-hours recordings from three infants taken once every week from roughly 6 to 32 months old. While similar data sources exist, such as BabyView [8], we choose SAYCam because of the availability of existing in-domain benchmarks, pretraining splits, and comparable models [18, 20]. In fact, we use the exact same $[\texttt{video frame, speech transcript}]$ training pairs as BabyVLM-V2 and supplement them with our $[\texttt{vision, touch}]$ learning events.

II-A2 Coding vs. physical sensors for touch

Section II-B explains the coding system for touch from babies’ headcam videos — how we define and code touch events specific to babies. While collecting touch sensations using physical sensors initially seems elegant, we choose to code touch events manually because a) the scale of paired $[\texttt{vision, sensor}]$ data required for training a computation model would be prohibitively expensive to acquire using physical sensors, and b) we hypothesize the mechanism behind babies’ touch learning to be more discrete and categorical than continuous physical measurements provide.

II-A3 Coding touch events at scale

To obtain a large-scale dataset of such events (Section II-C), we develop a pipeline that semi-automatically annotates video clips using our baby touch taxonomy, yielding a text caption for each touch event. Examples from the completed dataset are shown in Figure 3A.

II-A4 Mapping text descriptions of touch to cluster indices

While the text descriptions of touch events facilitate coding and discussion among raters, they are semantic rather than tactile. Training a computational model on generated text descriptions would give more semantic supervision than a baby would receive. Hence, we discretize the “touch captions” to discrete cluster indices, each cluster index corresponding to a unique type of physical experience.

Altogether, our approach aims to construct the most accurate possible reconstruction of baby touch learning to reliably explore the role of touch in developmental learning.

Refer to caption — Figure 1: Three-stage pipeline to code a large scale dataset of baby-centric touch events.

II-B Human-coding of touch (Figure 1, stage 1)

To model learning from touch sensations, we need a structured touch taxonomy that is simultaneously consistent, such that the same touch event is generally mapped to a single code, and detailed, such that the set of codes contain sufficient granularity to describe the variations that a baby experiences between touch events. To help design such a taxonomy, we begin by recruiting a team of fifteen university participants. The importance of human annotators is two-fold: First, we need reliable, intelligent annotators who are sensitive to edge cases to design a robust structure for the taxonomy in the first place. Second, we need a set of trusted human annotations as a validation check for the quality of the semi-automatically generated event detections of touch, detailed later.

II-B1 Iterative design process

We met with our raters in person weekly to design a consistent, detailed partition of a baby’s touch experience. They first coded a small set of video frames using a basic handcrafted taxonomy. From their results, we compute frequency and inter-rater agreement statistics for each category. Based on common points of ambiguity or lack of detail, we updated the taxonomy by redefining or splitting unclear categories. For example, the categories “holding” and “manipulating” had high inter-rater uncertainty, so we merged them into “holding/manipulating”, trading expressiveness for consistency. Conversely, “lightly touching” initially represented too many different kinds of touch sensations, so we added a new label “poking” to increase the early taxonomy’s expressiveness. We repeated this process three times on fresh sets of video clips until achieving satisfactory label distributions (shown in Figure 2) and inter-rater agreement, then moved on to a scaled labeling effort.

II-B2 Inclusion vs. Exclusion

We define an infant touch event as a discrete, intentional action done by or to a baby. Static touch sensations (e.g., the sensation of a chair on one’s butt) and observational touch (e.g., the baby watching another person perform) are not classified as touch events in our framework. However, if an outside agent (usually a parent) intentionally touches the baby’s body, such as with a rub or pat, we consider it a learnable touch event. Any time the person performing or receiving an action, or the action itself, is not immediately obvious, we reject it from our dataset.

II-B3 Taxonomy structure

From the results of our iterative design process, we find that we can effectively parameterize a touch event with its verb, noun, body part, and additional conditions. For example, a touch event could be classified as [manipulating, spoon, hand, eating setting]. Note that noun and body part are meaningfully distinct in that noun is what the subject is touching, and body part is the body part they’re doing it with. Notably, each of verb, noun, and body part can either come from a predefined list known to be an effective partition for the range of baby touch, or for special cases, an open-ended text input. A touch event can also have multiple annotations for a single field to smooth the decision boundary between similar categories. We provide our 9 predefined verbs, 10 frequently occurring nouns, 5 body parts, and 4 additional conditions in Appendix V-A. Previously, Lederman and Klatzky [6] parameterized touch events by the attributes of the object being touched, like shape, size, and weight, and they found that a given child relies on only one attribute to classify objects. Our touch taxonomy aligns with their finding in that the object name subsumes attributes. Moreover, we argue that a touch event’s verb, body part, and other sensory conditions are important for describing the whole touch experiences. Altogether, we believe that because of our iterative and data-guided design approach, our taxonomy achieves satisfactory specificity and clarity for describing a baby’s touch.

Human-coded set: Once we have finalized the taxonomy, we have a set of 2,000 video clips from our final design round, each of which is annotated by 3 human annotators. 858 of these have detected touch events, while the remaining 1142 meet our exclusion criteria. For 79% of the video clips, the three volunteers unanimously agree to either include or exclude the clip for any reason; most of the remaining 21% may contain some borderline touch event, but due to the inherently subjective nature of the ‘Unclear’ exclusion criteria one or two of the annotators voted to exclude it. For events detected by the majority of annotators, there is nearly always a consensus on the categorization – 97% and 100% of the time for verbs and body parts, respectively. We construct a human annotated set via majority voting from the three annotators for each clip. As we are satisfied with the expertise of our trained annotators, we consider this set to be the “ground truth” labeling policy for generated labeling at scale.

II-C Automated coding at scale (Figure 1, stage 2)

The iterations with human raters allowed us to gain sufficient understanding of the touch events and to write clear coding instructions that engaged more raters with no background in our work. To scale up this effort, however, we resort to Gemini [16] coding with humans in the loop.

II-C1 Gemini coding

Using the human-coded high-quality set for quality validation, we iterate until finding a highly performing prompt for Gemini to generate touch labels, which we include in Appendix V-B. With our curated prompt and three evenly sampled frames per video clip, Gemini-3-pro [16] detects touch events with 77% accuracy, 68% precision, and 87% recall relative to the human majority votes. Further, Gemini-3-pro predicts the human-coded verb, noun, and body part with 62%, 82%, and 93% accuracy, respectively.

II-C2 Filtering

To improve the precision of our generated labels, we need an additional quality control component to remove low quality $[\texttt{frame, touch caption}]$ pairs from our training dataset. To do this, we remove all pairs whose pretrained CLIP [11] similarity scores are below 0.2, or, because touch is inherently motion-dependent, pretrained BLIP [7] (a video-based caption similarity model) scores below 0.04. We choose this BLIP threshold to remove approximately the same fraction of frames, approximately 10%, as did CLIP.

Gemini-coded set: After classifying SAYCam at scale, processing those into meaningful codes, and filtering those codes, we are left with 263,604 $[\texttt{image, touch}]$ training examples. Figure 2 displays some the relative frequencies of each predefined category, with open-ended categories counted as ”other”. Note that these percentages are only calculated over the valid touching events- for instance, 1.6% Turning a Page should be interpreted as 1.6% of touch learning events are in the form of turning a page, rather than babies spend 1.6% of their waking hours turning pages. Likewise, 53.6% in a ”Play Setting” means 53.6% of touch events happen in a Play Setting, rather than 53.6% of a baby’s total sensory input happens in a Play Setting. Our touch annotations disproportionally reflect specific conditions by nature, because touch learning in the wild occurs disproportionally under such conditions. We combine these examples with BabyVLM-V2’s 768k speech labels, yielding a dataset of 1.03M total captions over 936k unique frames. We can formalize our notation to consider three types of learning events: 96k [vision, speech, touch] triplets, 672k [vision, speech] pairs, and 168k [vision, touch] pairs. We show illustrative examples of each kind of event in Figure 3A.

II-D Mapping touch captions to IDs (Figure 1, stage 3)

We map the resultant “touch captions” (e.g., [manipulating, spoon, hand, eating setting]) to discrete cluster indices by running a vanilla K-means algorithm [9] over the embeddings of the touch captions extracted from CLIP’s text encoder [11]. Importantly, we will feed the cluster IDs rather than the “touch captions” to our computational models.

The need for clustering: We want to avoid providing additional supervision to our model that a baby wouldn’t have access to in their daily life. By converting the touch signals from a form like “grasping a spoon” to “experiencing a touch sensation of type 4”, we move closer to how a real baby might process touch. Note that the cluster formation still accounts for each category of touch and includes “fill in the blank” labels in K-means.

Moreover, for proper experimentation on the role of touch, we need to move the touch signals out of the language space to simulate the property that the spoken word “book” is not the same as the touch sensation of “book”. Specifically, when the model sees a training example with a speech transcription containing “book”, it should update its understanding of what the word “book” means; when it sees an example with a touch cluster associated with books, it should update its understanding of what that touch cluster means. If we did not include a clustering step in our experimental design and instead trained with the touch captions, these important properties would be lost, and we would be less certain about the role touch plays in the learning.

III Computational Models

With the resultant datasets of [video frame, speech transcript, touch index] triplets (see examples in Figure 3A), we train computation models using a contrastive learning algorithm (see Figure 3B). We extend CVCL’s visual-textual model architecture to three encoders for video frames, speech transcripts, and touch cluster IDs (represented as one-hot vectors), respectively. We then use the video-audio-text transformer [1]’s dual contrastive loss as the training objective, contrasting time-locked [frame, transcript] and [frame, touch] pairs against misaligned pairs. We will release our code for reproducibility.

We evaluate the resulting model’s visual perception on two existing baby learning benchmarks, Labeled-S [18] and Picture Vocabulary (PV) [20], both of which task a model with selecting, from four images, the one corresponding to a given category label. The four-choice stimuli images per task in Labeled-S are raw video frames randomly drawn from the same domain as our training dataset, whereas PV is more challenging — the test stimuli are crops of SAYCam video frames, which makes them slightly outside the training domain; and the distractors are carefully selected to be similar to ground-truth option, either semantically or visually. Samples from both benchmarks are shown in Figure 3C. For Labeled-S, we report both zero-shot and linear-probing results; the former selects the image with the highest cosine similarity to the given category label, computed over the embeddings of our models, and the latter adds a linear classification layer trained on the training set of either benchmark. For PV we omit zero-shot accuracy because the input images are out of domain.

IV Results

We report results that address our overarching research question: Is touch a mechanism for perceptual learning? If it is, how much does our model gain from it relative to other senses? We also conduct a minimal ablation study by reporting the performance of randomly intialized weights.

TABLE I: Linear-probing accuracy on PV and Labeled-S

%	PV	Labeled-S
Random weights (lower bound)	43	65
(vision, speech)	51	84
(vision, speech, touch)	57	85
Random guess	25	25

IV-A Is touch effective as a learning signal?

Table I shows the linear probing accuracy for three model variants. We consider:

•

Random weights: a baseline with no pretraining, using the same visual encoder architecture as the other two model variations with random weights,
•

(vision, speech): a contrastive learning model pretrained on our dataset without touch codes, and
•

(vision, speech, touch): a contrastive learning model pretrained on our dataset with both speech captions and touch signals discretized into 256 clusters.

We also report the chance rate of a random guess.

For both PV and Labeled-S, we see a promising performance boost from touch codes, as shown in Table I. Moreover, Figure 4B reports that for some data ratios, touch gives rise to significant improvement on Labeled-S under the zero-shot setting on Labeled-S. Since the touch learning events are not encoded in the language space, a zero-shot performance boost indicates that the touch pretraining truly shaped the space of vision embeddings, rather than just projecting touch clusters to visual input.

Moreover, pretrained models on both time-locked datasets significantly outperform the random-weight baseline, echoing Smith and Gasser’s proposal that “multiple overlapping and time-locked sensory systems enable the developing system to educate itself — without defined external tasks or teachers — just by perceiving and acting in the world” [13].

For a qualitative analysis, as shown in Figure 5, we generated attention maps for the (vision, speech, touch) and (vision, speech) models processing pretraining examples. In general, we observe that the model supervised with touch signals learned to favor visual information from regions surrounding hands and the objects they interact with more than the baseline model, qualitatively validating our setup as a computational model of human visual attention.

IV-B Where is touch (not) helpful?

We also evaluate the model variations on DevCV Toolbox [20], a computer vision suite developmentally aligned with the NIH Baby Toolbox [5], and find that for very challenging tasks (Counting, WhoHasMore, Localization, Spatial Details) and very simple tasks (Left/Right and Memory), there is no meaningful increase for learning from the touch signals over visual-speech data, suggesting that touch is most useful for the visual-word mapping tasks of Labeled-S and PV. Because of this finding, for our remaining results we focus on Labeled-S and PV, however, we briefly outline our setup and results for the other tasks in Appendix V-C. Altogether, our results suggest that for certain types of visual reasoning, but not all, touch is helpful in learning.

IV-C What is an effective ratio of touch to speech supervision?

To what extent could touch be helpful in visual learning? Intuitively, the effectiveness of touch sensations would diminish as the visual-speech data increases in size and quality. Figures 4 and 6 speak to this intuition and show how evaluation accuracy scales with the quantity of speech data available during pretraining. Of course, the more speech data available during training, the higher the performance for both Labeled-S and PV, with and without touch events. However, another general trend is that for our in-domain task, Labeled-S, if less speech data is available during training, the effect of touch data tends to be greater, and this holds especially true in the zero-shot setting. Simply put, touch is most important when the speech transcripts are limited — for a baby with little life experience, it is. Using this insight, we can think of babies as data-efficient learners, integrating as many senses as possible to inform their innate representation of the world [13].

IV-D How fine-grained should the touch clusters be to provide effective supervision?

IV-D1 Hypothesis

We can use our framework to draw insights into babies’ internal representations of touch sensations, specifically, how many distinct types of touch they consider. Intuitively, if we group the touch captions into fewer clusters, we end up treating dissimilar touch sensations as the same and learn a less discriminative representation—it is not necessarily negative because the model can then learn rapidly and data-efficiently to arrive at simpler decision boundaries. If we define more unique clusters, we may consider some similar touch sensations that vary by trivial factors to be different, and the resulting representation can be highly sensitive, implying that a model needs larger-scale data to capture more detailed representations. Tackling real-world visual tasks requires drastically different levels of representation granularity. Hence, it is important for both a baby and a computational model to adaptively leverage the touch sensations depending on specific tasks and situations.

IV-D2 Labeled-S and PV benefit the most from different levels of touch granularity

Performance for different numbers of clusters of touch is shown in Table II. Fortunately, our evaluation tasks are well suited to investigate our intuition about the number of clusters: PV quests detailed semantic understanding, as the samples are constructed from crops of semantically similar objects (e.g., sofa and bed), while Labeled-S is more general, like discriminating a video frame with a cat in it from a frame with a window in it. Interestingly, our results confirm our intuition that defining more clusters is important for capturing details, while fewer clusters make it easier to capture general semantics: the model trained with the 256 touch clusters performs the best on PV, and the model trained with 16 clusters performs the best on Labeled-S.

TABLE II: Comparing different numbers K of touch clusters

K	PV	Labeled-S (probe)	Labeled-S (zero-shot)
16	54	86	51
64	57	85	50
256	57	85	49

V Conclusions

We provide a computational framework for modeling touch behavior in babies and demonstrate its effectiveness through structured experiments. Our results quantitatively reveal the effect of touch on babies’ acquisition of visual concepts. Still, the exact mechanisms that drive it remain open questions — is there anything inherently important about the experience of touching, or is it just a means to enriching the visual learning stream? How interrelated is the supervision between time-locked touch events and speech utterances? At what developmental stage do babies start using touch to educate their visual system?

Moreover, we leave two intriguing hypotheses for future work. One is that the touching moments contain high-quality visual information (in the whole view) for learning, and the other is that the model learns to use and weigh visual information around a touched area for object learning, since the touched object is likely to be labeled in speech.

Our framework has the potential to serve as an open platform to advance understanding of such mechanisms and accelerate interdisciplinary research between developmental psychology and computer science.

Appendix

V-A Descriptions of touch categories

Here we describe how we partition the range of baby touch into predefined verbs in a way that is both specific and consistent.
Verbs (9)

•

Lightly Touching: The most common verb, feeling an object or surface to explore its its texture, temperature, or some other property.
•

Being touched: Visibly being actively touched by another person.
•

Holding/manipulating: Lifting, pushing or moving an object like a fork or a ball.
•

Grasping: Similar to holding/manipulating, but the object is stationary. Also similar to lightly touching, but the hand is wrapped around the object or pushing on it rather than merely touching it. Common for railings, tabletops, etc.
•

Poking: A subset of lightly touching that occurs very frequently in the baby domain.
•

Turning a page: A subset of holding/manipulating which also shows up frequently enough to warrant its own category.
•

Grabbing: The action of initiating a period of holding/manipulating an object.
•

Unknown
•

Other: All of the ”fill in the blank” verbs go here.

Nouns (10)

•

Book
•

Toy
•

Floor
•

Your own body
•

Bowl
•

Spoon
•

Plate
•

Fork
•

Unknown
•

Other (fill in the blank)

Body Parts (5)

•

Hand
•

Foot
•

(being touched on an) Ambiguous body part
•

(being touched on the) Face
•

Other (fill in the blank)

Additional Conditions (5)

•

Book Reading Setting
•

Eating Setting
•

Play Setting
•

Actively taking a bite
•

Other (fill in the blank)

V-B Prompt used to query touch captions

Prompt: You are a specialized video annotation assistant. You will be provided with 3 frames from a video clip.

Your task: output ONE valid JSON object that follows the schema below.

Think step-by-step internally to decide the correct labels, but DO NOT output your reasoning. Output ONLY the final JSON.

When deciding, consider these aspects internally (not necessarily in any fixed order):

•

Perspective: is this clearly the egocentric point-of-view of the INFANT (camera mounted on infant)?
•

Infant evidence: are INFANT body parts visible (hands/feet/legs/torso edge)? Do NOT treat adult-only body parts as infant evidence.
•

Contact evidence: is there visible physical contact between an INFANT body part and some object/surface/body? If you cannot confidently confirm infant contact, choose Unclear or No touch as appropriate.
•

Third-person infant: if you see the infant/toddler in third-person view (the ”Blond Toddler Rule”), label Not egocentric.
•

Ambiguity: if frames are blurry/dark/occluded and you cannot confidently label the infant’s interaction, you MUST select Unclear.

CRITICAL RULES (must follow):

•
You are annotating TOUCH sensations/actions of the EGOCENTRIC INFANT ONLY.
- –
  
  NEVER annotate the caregiver’s touch/actions.
- –
  
  If the caregiver touches an object but the baby does not, do NOT mark Touch for the baby.
•
The ”Blond Toddler Rule” overrides everything:
- –
  
  If you see a small infant/toddler in third-person view, set mutually_exclusive = ”Not egocentric”.
•
Egocentric requirement:
- –
  
  Only label ”Touch (default)” or ”No touch BY THE BABY” if the camera is clearly infant egocentric POV.
- –
  
  If egocentric POV is not clearly confirmed, prefer ”Not egocentric” (if third-person infant visible) else ”Unclear”.
•

Use ONLY what is visible in the provided frames. No guessing beyond evidence.

OUTPUT FORMAT: Return ONLY a valid JSON object. No markdown. No extra text. Use exactly this structure and keys:

⬇

{

"mutually_exclusive": "Touch (default) | No touch BY THE BABY | Unclear | Not egocentric",

"checkbox_q1": ["Book Reading Setting", "Eating Setting", "Play Setting", "ACTIVELY taking a bite"],

"text_q1": "Other sensory decription",

"checkbox_q2": ["Lightly Touching (default)", "Being touched", "Holding/manipulating", "Grasping", "Poking", "Turning a page", "Grabbing", "Unknown"],

"text_q2": "Enter another action not listed",

"checkbox_q3": ["Your own body", "Book", "Floor", "fork", "spoon", "bowl", "plate", "toy", "unknown"],

"text_q3": "Another custom object category",

"checkbox_q4": ["hand (default)", "Being touched on an ambiguous body part", "Being touched on the face", "Foot"],

"text_q4": "Fill in the blank with some other body part"

}

DATA TO ANALYZE:

•

Focus exclusively on the infant’s body parts and what THEY touch.
•

Ignore adult hands/movement unless the baby is touching them (adult can be the noun only if baby touches).

V-C Evaluation

We model our evaluation pipeline after [18], measuring linear probe accuracy, and zero shot accuracy where applicable. For reproducibility, we now provide brief descriptions of the formats and implementations for each evaluation task.
Picture Vocabulary and Labeled-S: In both of these tasks, the model receives as input four images and a target object name, and is measured on its ability to recognize which image contains the named object. In Picture Vocabulary, all four images are cropped about object bounding boxes, and the incorrect answers are selected to be semantically similar to the target. In Labeled-S, the input images are uncropped SAYCam frames, but the objects found in the input images are unrelated. For zero shot accuracy, we encode each of the images with the pretrained Vision Transformer (ViT), and encode the target word with its pretrained audio embedding. We recognize the image embedding with the highest cosine similarity to the word embedding as the model’s answer. For the linear probe, we learn a single linear projection layer on top of the ViT’s embeddings and a similar layer on top of the noun embeddings, which gives a reasonable performance boost. For Labeled-S, we predefine an 80/10/10 train/test/val split for all linear probing experiments.
Localize: This task evaluates the model’s ability to detect which quadrant of a frame a named object appears in. For the linear probe setup, we train a 3-layer MLP probe to output the quadrant the named object is most likely to be in, given the pretrained embeddings of the input frame and the target object’s name. We exclude the examples with target objects out of our vocabulary (250 examples of the 2100). Localization is a difficult task to learn past random performance- in our implementation, a randomly initialized ViT and the (vision, speech) baseline model both score 35% while the model trained on touch scores 38%, but we omit these results from our interpretation of touch as we consider all of the performance too low to constitute a meaningful learning result.
Counting and Who Has More: Counting evaluates the ViT’s ability to determine the quantity of a named object that occurs in a synthetic input image. We evaluated the counting accuracy of a 7 layer residual MLP probe on the target noun and image embeddings of our pretrained models. Our baseline model and touch model both score 35%; although this is notably above a lower bound (27%) and below an upper bound (59%), we do not include it in our experiments as it is not clear that touch learning had a measurable impact. Based on this finding, we focus on Picture Vocabulary and Labeled-S, leaving evaluation of touch datasets on both Count and Who Has More to future work.
Memory, Left/Right, Spatial Details and Visual Delayed Response: Because the DevCV Toolbox was designed to evaluate vision language Q&A models, three of the tasks, Memory, Left/Right, and Spatial Details, are too simple to provide a meaningful signal on the quality of a ViT due to its architecture. We leave an evaluation of touch models on the only video-based task in the DevCV Toolbox, Visual Delayed Response, for future work as we pretrained with single frames only.

ACKNOWLEDGMENT

Special thanks to Jessica Sullivan and Arjun Chandra for their support and feedback throughout the project. Additionally, thanks to Aryan Sharma, Batyrkhan Baimukhanov, Raman Deep Shiva Murthy, and Yu Sheng for their help annotating touch events. The project is supported in part by NSF 2540851 and a Sony Faculty Innovation Award.

References

[1] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021) VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §III.
[2] G. M. Edelman and J. A. Gally (2013) Reentry: a key mechanism for integration of brain function. Frontiers in integrative neuroscience 7, pp. 63. Cited by: §I.
[3] L. Gershkoff-Stowe and D. H. Rakison (2005) Building object categories in developmental time. Psychology Press. Cited by: §I.
[4] J. J. Gibson (2014) The ecological approach to visual perception: classic edition. Psychology press. Cited by: §I.
[5] Y. C. Han, E. M. Dworak, M. Mansolf, H. Adam, L. Yao, M. A. Novack, S. Pila, R. M. Flynn, A. M. Flagg, V. Ustsinovich, et al. (2025) NIH baby toolbox® methodology and norms development. Infant Behavior and Development 80, pp. 102117. Cited by: §IV-B.
[6] S. J. Lederman and R. L. Klatzky (1987-07) Hand movements: A window into haptic object recognition. Cognitive Psychology 19 (3), pp. 342–368 (en). External Links: ISSN 00100285, Link, Document Cited by: §II-B3.
[7] J. Li, D. Li, C. Xiong, and S. Hoi (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, Cited by: §II-C2.
[8] B. Long, R. Z. Sparks, V. Xiang, S. Stojanov, Z. Yin, G. E. Keene, A. W. M. Tan, S. Y. Feng, C. Zhuang, V. A. Marchman, D. L. K. Yamins, and M. C. Frank (2025-07) The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences. arXiv. Note: arXiv:2406.10447 [cs]Comment: 9 pages, 3 figures, 4 tables and Appendix. Published in the Proceedings of the 8th Annual Conference on Cognitive Computational Neuroscience External Links: Link, Document Cited by: §II-A1.
[9] J. B. McQueen (1967) Some methods of classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Math. Stat. and Prob., pp. 281–297. Cited by: §II-D.
[10] J. Piaget and M. T. Cook (1952) The origins of intelligence in children. WW Norton & Company. Cited by: §I.
[11] A. Radford, J. W. Kim, C. Hallacy, et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: §II-C2, §II-D.
[12] L. K. Slone, L. B. Smith, and C. Yu (2019) Self-generated variability in object images predicts vocabulary growth. Developmental science 22 (6), pp. e12816. Cited by: §I.
[13] L. B. Smith and M. Gasser (2005) The development of embodied cognition: six lessons from babies. Artificial Life 11 (1-2), pp. 13–29. Cited by: §I, §IV-A, §IV-C.
[14] K. C. Soska, K. E. Adolph, and S. P. Johnson (2010) Systems in development: motor skill acquisition facilitates three-dimensional object completion.. Developmental psychology 46 (1), pp. 129. Cited by: §I.
[15] J. Sullivan, M. Mei, A. Perfors, E. Wojcik, and M. C. Frank (2021-05) SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded From the Infant’s Perspective. Open Mind 5, pp. 20–29. External Links: ISSN 2470-2986, Link, Document Cited by: §II-A1.
[16] G. D. Team (2023) Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Cited by: §II-C1, §II-C.
[17] E. Thelen and L. B. Smith (1994) A dynamic systems approach to the development of cognition and action. MIT press. Cited by: §I.
[18] W. K. Vong, W. Wang, A. E. Orhan, and B. M. Lake (2024-02) Grounded language acquisition through the eyes and ears of a single child. Science (New York, N.Y.) 383 (6682), pp. 504–511 (eng). External Links: ISSN 1095-9203, Document Cited by: §I, §II-A1, §III, §V-C.
[19] S. Wang, A. Chandra, A. Liu, V. Saligrama, and B. Gong (2025-10) BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning. arXiv. Note: arXiv:2504.09426 [cs]Comment: Accepted to ICCV 2025 External Links: Link, Document Cited by: §I.
[20] S. Wang, W. Wang, Z. Wang, M. Whitton, M. Wakeham, A. Chandra, J. Huang, P. Zhu, H. Chen, D. Li, J. Li, S. Li, A. Zagula, A. Zhao, A. Zhu, S. Nakamura, Y. Yamamoto, J. J. Yokono, A. Mueller, B. A. Plummer, K. Saenko, V. Saligrama, and B. Gong (2025-12) BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models. arXiv. Note: arXiv:2512.10932 [cs] External Links: Link, Document Cited by: §I, §II-A1, §III, §IV-B.