Multimodality as Supervision:
Self-Supervised Specialization to the Test Environment via Multimodality

Kunal Pratap Singh*, Ali Garjani*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

* Equal Contribution

ICLR 2026

Multimodality as Supervision. The sensed data in a deployment environment is often multimodal, which, besides RGB images, can contain various modalities, such as depth, motion sensing, surface normals, tactile, etc. This enables Cross-Modal learning, i.e., predicting the response of one sensor from another, as a self-supervised method for learning a representation. We leverage this concept to learn a rich joint representation for the test space in a self-supervised with minimal reliance on external data. A standard approach to train and deploy vision models in a desired test space is generalist pre-training. It uses external data, such as images from the Internet or other spaces similar to the test one. As an alternative, we study multimodal Test-Space Training, which performs self-supervised pre-training on unlabeled multimodal data collected in the test space. This leads to a model specialized to that space with SOTA performance with only the data from the test space. We evaluate this approach on several downstream tasks (semantic segmentation, here) and show that test-space specialists compares favorably against strong generalist pre-training baselines, including those trained on large-scale Internet-based datasets or many other external spaces.

Abstract

Many practical applications, e.g., deploying a household robot, involve devices that are equipped with a rich set of sensors that enable multimodal sensing in their test environment. The multimodal data collected from these sensors can be used to perform cross-modal learning, therefore using multimodality as a source of self-supervision. To study this, we propose a setup where we restrict a user device to just a given test environment. This allows us to study the ability of multimodal pre-training to learn a distribution agnostic to its generalization abilities. Under this setup, we develop a framework called Test-Space Training (TST), which performs multimodal data collection in the test environment and self-supervised pre-training on it. We evaluate these models on various downstream tasks in the same environment. TST reveals various interesting insights, such as by collecting rich multimodal data only from the test environment and leveraging cross-modal learning, we can achieve competitive results with generalist models (e.g., DINOv2 and CLIP) pre-trained on large-scale internet datasets. We also present a set of analyses and ablations that raise intriguing points on substituting data with (multi)modality, enabling an alternative scenario where the need for external Internet-scale datasets for pre-training models is reduced, and how varying pre-training data enables a trade-off between a model's ability to specialize to a test environment and generalize to held-out spaces.


Test-Space Training

Learn how our framework builds specialist models for the test-space.

TST vs. Internet

How does test-space training compare to today's internet based vision backbones?

Multimodality in TST

Scaling data vs modalities, and the role of individual modalities.

Measuring Specialisation in TST

Cross-house evaluations to measure test-space specialization.

TST with no external access

Exploring the limits of test-space training with only sensory data.

Introduction

TST block

Figure 1. Various platforms like a household robot, or simply a user's mobile phone are often equipped with a rich set of sensors, which can provide multimodal data beyond RGB images, like depth maps and surface normals.

Cross Modal Learning

Figure 2. Cross Modal Learning. Predicting the response of one sensor from another has been shown to be a strong signal to learn representations of the world. Each sensor represents a different representation of the same underlying physical world.

Many user devices, e.g., a household robot, augmented reality glasses, or an iPhone, are equipped with a rich array of sensors that enable multimodal data collection. They can acquire modalities beyond RGB images, such as depth, motion sensing, and haptic feedback. Such data enables cross-modal learning, i.e., predicting the response of one sensor from another, thereby leveraging multimodality as a signal for self-supervised learning (see Figure 2). There has also been evidence in developmental psychology that suggests that multimodality is employed by biological organisms to bootstrap better representations of their environment.

Figure 3. Many practical applications require vision models to operate within certain test space, for instance a household robot in user home. In such scenarios, we primarily care about the performance of the model in the test space, regardless of its generalization performance elsewhere.


In this work, we are interested in studying the potential of multimodality as a source of self-supervision in pre-training vision models. To do so in a controlled manner, we construct a sandbox setup wherein we reduce the operating space of the user device to a specific physical space, or as we refer to it, the test space. This implies that self-supervised pre-training of visual representations and downstream evaluations both happen in the same user space.

Figure 4. In this work, as opposed to the de-facto approach of building models with large-scale internet-based data, which rely on their generalization abilities, we study whether we can build performant models for the test space, by just collecting modality-rich data in just the test space. This serves as a sandbox to study the potential of multimodality as a source of self-supervision in pre-training vision models, and understand the trade-offs between specialization to the test space and generalization to other spaces.

This sandbox has several desirable properties. As opposed to the de facto setup of learning a generalist model, pre-trained on large and diverse pre-training data (often based on the internet), and relying on its generalization abilities to perform well in the test space, our setup allows us to study the ability of multimodal pre-training to learn a distribution agnostic to its generalization abilities. Additionally, it also represents practical evaluation scenarios such as household robotics, where user devices are expected to be highly performant in their own space, regardless of their generalization abilities elsewhere.

Figure 5. Generalist models (left), which are pre-trained once, on large-scale internet based datasets, and then deployed to various downstream spaces. Test-space training (right), contrary to generalists, pre-trains the models on unlabelled data from the test space itself. We show that test-space training, results in specialist models that outperform generalist models, when evaluated on the test space.

To learn representations under this setup, we propose Test-Space Training (Figure 5, right), which enables multimodal pre-training data collection in the same space that the device would be deployed on. We use this data to perform self-supervised pre-training via cross-modal learning, leading to TST-MM.

Method Overview


Figure 6. 1. First, we collect (multimodal) data from the test space. 2. We then use this data for self-supervised multimodal pre-training, resulting in a model specialized to the test space. 3. After pre-training, the model is fine-tuned on a small external transfer dataset to solve a desired downstream task, e.g. semantic segmentation. 4. This model is subsequently deployed and evaluated in the same test space where it was pre-trained, thereby realizing specialization to that space.

Our framework consists of the following steps:

1. Data Collection: We collect pre-training data from the test space, by placing a camera at various points to cover the space densely. In addition to RGB frames, we collect data from various sensors and modalities available in the test space, such as 3D data using a depth sensor. We can optionally process this data to create additional modalities using off-the-shelf pseudolabel networks. This results in an aligned multimodal dataset that we use for pre-training.

2. Self-Supervised Pre-training: We employ self-supervised learning to pre-train a task-agnostic, representation learning model on the multimodal data from the test space. We explore different self-supervised learning objectives, masked modelling (MAE, 4M) and contrastive learning (DINOv2). Among these, we found multimodal masked modelling to be the most effective for setup. We refer to this multimodal variant as TST-MM.

3. Transfer Learning: We evaluate the effectiveness of the pre-trained model by performing transfer learning evaluations. For this stage, we assume access to a small external dataset, with task-specific annotations, e.g., semantic segmentation masks or image captions. We add a task-specific head to the pre-trained model and finetune it on this external dataset. Importantly, we do not have access to any task-specific annotations from the test space itself. We benchmark against various off-the-shelf vision models, by finetuning them on the same transfer data.

4. Deployment & Evaluation: The fine-tuned model is deployed in the same test space where it was pre-trained. We benchmark against various internet pre-trained generalists, and task-specific models. We evaluate on three vision-based tasks, namely, semantic segmentation, object detection, and image captioning. We show that TST-MM outperforms all baselines, including those trained on large-scale internet datasets or many similar spaces to the test space.

Test-Space Training vs. Internet-based methods

We compare Test-space Training, specifically, our TST-MM model against various internet-based methods. We compare against generalist models, trained on large-scale datasets, such as CLIP, DINOv2, and 4M. Additionally, we also compare against task-specific models, such as Mask2Former for semantic segmentation, ViTDet for object detection. Please refer to the paper for more results on other datasets and tasks.

I. Semantic Segmentation

We compare against generalist models like CLIP, DINOv2, 4M trained on large-scale internet datasets, and show that TST-MM outperforms them on various downstream tasks. This suggests that we can outperform large-scale generalist models trained on millions of images from the Internet by using data from the test space. We present qualitative video evaluations on semantic segmentation, on Scannet++ and Replica datasets.

RGB Input

Ground Truth

Test-Space Training

Speed: ×2.00

We also present a comparison against Mask2Former, an off-the-shelf task specialist model for semantic segmentation. Note how TST-MM predictions are notably more consistent across various viewpoints.

RGB Input

Ground Truth

Test-Space Training

Mask2Former

Speed: ×0.25

II. Quantitative Results

We present quantitative comparison on semantic segmentation, object detection, and image captioning tasks. We use Scannet++, Replica, and ProcTHOR datasets. We find that TST-MM outperforms or is on par with all internet-based models, including self-supervised generalists trained on large-scale internet datasets or task specialist models.

Method Semantic Segmentation Object Detection Captioning
Scannet++
mIoU
ProcTHOR
mIoU
Replica
mIoU
Scannet++
mAP
ProcTHOR
mAP
ProcTHOR
CIDEr
ProcTHOR
SPICE
Scratch - no pre-training 7.49 28.62 9.23 2.35 24.59 17.1 14.8
4M (RGB-only) / MAE 13.74 46.29 18.18 18.31 37.17 30.4 19.1
4M-21 27.59 53.24 26.30 25.91 41.43 36.2 20.3
DINOv2 28.6 54.50 26.72 23.67 40.28 14.7 13.5
CLIP 23.02 48.66 20.92 19.75 38.47 18.4 16.2
Task Specific Methods / SOTA 34.75 56.72 28.51 23.59 44.10 40.6 21.0
TST-MM 34.49 60.85 32.87 31.54 49.38 34.3 20.4
TST-MM (Adapted) 36.44 60.59 34.53 35.83 51.25 39.9 20.5

III. Adaptation

For all the results until this point, we start from scratch and train the model from random initialization. However, TST can also serve as an adaptation mechanism for existing generalist models, making them more performant in the test space. Here, we start from a pre-trained 4M-21 model and fine-tune it on data from the test space, using the multimodal masked modeling objective of TST-MM. We find that the resulting model, TST-MM (adapted), significantly improves the performance over 4M-21 in the test space.

Multimodality in TST

I. Scaling data vs Scaling modalities

We study the trade-off between using smaller-scale but modality-rich test-space data, versus large-scale unimodal external data (RGB-only). Starting from unimodal pre-training within the test space, as the plot below shows, scaling data via additional modalities (scaling modalities) yields significantly better performance than increasing the amount of unimodal data from external sources (scaling data). This suggests that, for building high-performing models in a specific test space, collecting data within that space using a richer set of modalities is more effective than relying on large-scale, unimodal data collected from external sources.

II. Is a single modality doing the job?

TST-MM employs cross-modal learning, using various modalities to pre-train a performant specialist model in the test space. However, to understand whether a single modality is contributing to most of the improvement, overall a unimodal model, we conduct a modality-scaling analysis. We plot a curve: starting from an RGB-only model on the left and ending with TST-MM (9 modalities) on the right, we add one modality at a time and average results over eight random set of modalities, for each modality count. Regardless of which modalities are chosen, performance steadily improves as we include more—demonstrating that simply increasing modality count, rather than hand-picking them, boosts performance and that TST-MM's gains arise from their collective interplay, ie multimodality.

Contribution of different modalities to TST performance. We study the effect of each modality on TST by doing a drop one combinations from TST-MM, and add one to TST-MAE (RGB-only TST). We use the ViT-S backbone. We find that even though some modalities provide higher gains than others when added to the RGB-only TST-MAE, the performance of TST-MM stays relatively stable, agnostic to the choice of the dropped modality.

Measuring Specialisation in TST

I. Does it specialize to the test space?

The goal of test-space training is to learn a specialized vision model for a given space. In this analysis, we attempt to measure this specialization using cross house evaluations. Specifically, we ask the question, do we really need access to the test space itself, or we can substitute it with a similar space? Note that with similar space, we mean a space that is similar in terms of the layout and furniture appearance, but not necessarily the same.

We consider 3 spaces. We pre-train a model in each space, and evaluate it in all 3 space. As can be seen in the figure below, performance is highest along the diagonal, where pre-training and evaluation occur within the same space, showing the value of pre-training data from the test-space.

II. Specialization-Generalization Tradeoff

We observed that the best performance on a given test space is achieved when pre-trained on the same space. However, we would expect this specialized model to not generalize well to new houses. Can we keep (or improve) this specialization performance while gaining generalization capabilities by adding more houses during pre-training in addition to the test house?

As the plot above shows, as we add more houses during pre-training, the performance on the held-out new houses increases, as expected. However, the performance on the original test space drops compared to the specialist single-space pre-training, demonstrating a specialization-generalization trade-off of the pre-trained model.

III. How much external data is 1 test space worth?

We find that pre-training on the test space is important for specialization, and cannot be substituted by a similar space. However, we take this one step further, and ask, can we substitute test-space data with data from many similar spaces? How many spaces would we need? We use ProcTHOR to generate a large number of similar houses (IID to the test space) and pre-train models using an increasing number of them.

The curve below shows the performance of each model on the test space not seen during pre-training compared to pre-training on the corresponding test space. We find that even thousands of similar spaces are not enough to substitute pre-training on the exact same space that we deploy in.

TST with no external access

I. How far can we go with just access to the test space?

We explore how far we can go when bootstrapping a vision representation for the test space, with no external access. This can imply no external access for pre-training data, or for pseudo-labelling networks to generate additional modalities. To probe this, we pre-train a model with multimodal masked modeling using only sensory modalities (TST-MM, Sensors only) collected from the test space. Here, we refer to RGB images, Depth maps, Surface Normals, and Canny Edges as sensory modalities, as they can be directly collected using hardware sensors, or are simple mathematical transformations over them.

TST sensors

Various platforms like a household robot, or simply a user's mobile phone are often equipped with a rich set of sensors, which can provide multimodal data beyond RGB images, like depth maps and surface normals.

We find that that pre-training with multimodal masked modeling using only sensory data collected from the test space performs competitively with DINOv2, which is pre-trained on millions of frames collected from the Internet.

II. Sensory Data Collection App

To support future research on test-space training with only sensory data, we release a Swift-based iOS application that leverages the ARKit API, that enables any user collect multimodal sensory data from a consumer iPhone in their own space. The application collects various sensors including, but not limited to RGB images, LiDAR maps, IMU, Ambient Lighting, magnetometer, and gyroscope data. Check out this webpage for more details.

Discussion & future work

TST presents an alternative to the conventional approach to collect large-scale, diverse data to train generalist models. It leverages multimodality as supervision to enable an intriguing, yet practically useful scenario in which performant vision models can be developed for a space without directly accessing external data. We present a consolidated picture of our results in the figure below.

Quantitative summary of TST. The lower bound is scratch (i.e., no pre-training and learning using the external transfer set only). The upper bound is approximated by a fully supervised model trained with a large number of annotated segmentation images. The gap between the lower and upper bounds is the playfield for the pre-training methods to fill. All methods share the same model architecture. The results here correspond to semantic segmentation on Scannet++.

We highlight the following key takeaways:


This work serves as a formulation of leveraging data from the deployment space, as opposed to a method that fully solves the problem. The goal of future research is to advance the current methodology to achieve a paradigm, which inches closer to the upper bound, without any external access. Future directions on building better pre-training objectives that leverage multi-view consistency, and various hardware based sensors such as IMU, gyroscope, magnetometer is something we are interested in.

Citation

@inproceedings{singh2026tst,
            title={Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality
            },
            author={Kunal Pratap Singh and Ali Garjani and Rishubh Singh and Muhammad Uzair Khattak and Efe Tarhan and Jason Toskov and Andrei Atanov and O{\u{g}}uzhan Fatih Kar and Amir Zamir},
            booktitle={International Conference on Learning Representations (ICLR)},
            year={2026}
        }