Description of our method for performing test-time adaptation.
Examples of test-time signals that we employed.
Results on adaptation with RNA vs TTO and on different tasks.
Neural networks are found to be unreliable against distribution shifts. Examples of such shifts include blur due to camera motion, object occlusions, changes in weather conditions and lighting. Dealing with such shifts is difficult as they are numerous and unpredictable. Therefore, training-time strategies that attempt to take anticipatory measures for every possible shift (e.g., augmenting the training data or changing the architecture with corresponding robustness inductive biases) have inherent limitations. This is the main motivation behind test-time adaptation methods, which instead aim to adapt to such shifts as they occur. In other words, these methods choose adaptation over anticipation. In this work, we propose a test-time adaptation framework that aims to perform an efficient adaptation of a main network using a feedback signal.
Adaptive vs non-adaptive neural network pipelines.
In order to be robust, non-adaptive methods include training-time interventions that anticipate and counter the distribution shifts that will occur at test-time (e.g., via data augmentation). Thus upon encountering an out-of-distribution input, its predictions may collapse.
Adaptive methods create a closed loop and use an adaptation signal at test-time. The adaptation signal is a quantity that can be computed at test-time from the environment. \(h_\phi\) acts as a "controller" by taking in an error feedback, computed from the adaptation signal and model predictions, to adapt \(f_\theta\) accordingly. It can be implemented as a (i) standard optimizer (e.g., using SGD) or (ii) neural network. The former is equivalent to test-time optimization (TTO), while the latter aims to amortize the optimization process, by training a controller network to adapt \(f_\theta\) - thus, it can be more efficient and flexible. In this work, we study the latter approach and show its efficiency and flexibility.
Architecture of RNA.
While developing adaptation signals is not the main focus of this study and is independent of the RNA method, we still need to choose some for experimentation.
Existing test-time adaptation signals, or proxies, in the literature include prediction entropy (Wang et al.), spatial autoencoding (Gandelsman et al.) and self-supervised tasks like rotation prediction (Sun et al.), contrastive (Liu et al.) or clustering (Boudiaf et al.) objectives.
The more aligned the adaptation signal is to target task, the better the performance on the target task (Sun et al., Liu et al.). More importantly, a poor signal can cause the adaptation to fail silently (Boudiaf et al.,Gandelsman et al.).
The plot below shows how the original loss on the target task changes as different proxy losses from the literature, i.e. entropy, consistency between different middle domains are minimized.
In all cases, the proxy loss decreases, however, the improvement in the target loss varies. Thus, successful optimization of existing proxy losses does not necessarily lead to better performance on the target task.
Adaptation using different signals. Not all improvements in proxy loss translates into improving the target task's performance. We show the results of adapting a pre-trained depth estimation model to a defocus blur corruption by optimizing different adaptation signals: prediction entropy, a self-supervised task (sobel edge prediction error), and sparse depth obtained from SFM. The plots show how the \(\ell_1\) target error with respect to ground-truth depth (green, left axis) changes as the proxy losses (blue, right axis) are optimized (shaded regions represent the 95% confidence intervals across multiple runs of SGD with different learning rates). Only adaptation with the sparse depth (SFM) proxy leads to a reduction of the target error. This signifies the importance of employing proper signals in an adaptation framework.
We show some examples of test-time adaptation signals for several geometric and semantic tasks below. Our focus is not on providing an extensive list of adaptation signals, but rather on using practical ones for experimenting with RNA as well as demonstrating the benefits of using signals that are rooted in the known structure of the world and the task in hand. For example, geometric computer vision tasks naturally follow the multi-view geometry constraints, thus making that a proper candidate for approximating the test-time error, and consequently, an informative adaptation signal.
Examples of employed test-time adaptation signals. We use a range of adaptation signals in our experiments. These are practical to obtain and yield better performance compared to other proxies. In the left plot, for depth and optical flow estimation, we use sparse depth and optical flow via SFM. In the middle, for classification, for each test image, we perform \(k\)-NN retrieval to get \(k\) training images. Each of these retrieved image has a one hot label associated with it, thus, combining them gives us a coarse label that we use as our adaptation signal. Finally, for semantic segmentation, after performing \(k\)-NN as we did for classification, we get a pseudo-labelled segmentation mask for each of these images. The features for each patch in the test image and the retrieved images are matched. The top matches are used as sparse supervision.
Here are some qualitative results. Zoom in to see the fine-grained details. See the paper for full details.
Key takeaways:
• different distribution shifts (common corruptions, 3D common corruptions, cross datasets),
• tasks (depth, optical flow, dense 3D reconstruction, semantic segmentation, image classification),
• and datasets (Taskonomy, Replica, ImageNet, COCO, ScanNet, Hypersim). See the following section for results.
Here is a summary of our observations from adapting with RNA vs TTO. TTO represents the approach of closed-loop adaptation using the adaptation signal but without benefiting from any amortization (the adaptation process is fixed to be standard SGD). These observations hold across different tasks.
The video below demonstrates the efficiency of RNA. It shows an image that has been corrupted with gaussian noise. The test-time signal is noisy sparse depth from SFM and has been overlayed over the input image. The predictions at iteration 0 are the same for all methods as this is before any adaptation. Note that RNA is able to attain an improved prediction after a single forward pass. The top right plot shows how the l1 error changes with iteration. RNA, show in green, significantly reduces error.
The video below demonstrates the performance of RNA with increasing supervision. It shows an image that has been corrupted with gaussian noise. The test-time signal is click annotations and has been overlayed over the image (2nd row, 1st col). RNA is able to attain improved predictions with as few annotations.
We now show evaluations for various target tasks and adaptation signals.
Lets first look at qualitative results of RNA vs the baselines for semantic segmentation on random query images on COCO-CC (left) and depth on images from ScanNet, Taskonomy-3DCC and Replica-CC (right). The predictions with adaptation signals described above are shown in the last two rows. They are noticeably more accurate compared to the baselines. Comparing TTO and RNA, RNA's predictions are more accurate for segmentation, and sharper than TTO for depth (see the ellipses) while being significantly faster.
Adaptation results for semantic segmentation and depth. For semantic segmentation, we use 15 pixel annotations per class. For Taskonomy-3DCC, we use sparse depth with 0.05% valid pixels (30 pixels per image). For ScanNet and Replica-CC, the adaptation signal is sparse depth measurements from SFM with similar sparsity ratios to Taskonomy-3DCC.
We also demonstrate the effectiveness of RNA on dense 3D reconstruction. The goal is to reconstruct a 3D pointcloud of an apartment given a sequence of corrupted images in it. The depth predictions from the pre-adaptation baseline (2nd column) has poor predictions and results in a pointcloud that has large artifacts and frequent discontinuties in the scene geometry. To perform adaptation, we compute the noisy sparse depth from SFM and use it to adapt the depth model. The predictions from the adapted models are then used in the backprojection to attain a 3D point cloud. RNA and TTO both can significantly correct such errors and recover a 3D consistent pointcloud. RNA is able to achieve this orders magnitude faster than TTO.
Adaptation results on 3D reconstruction. Camera poses and 3D keypoints are first obtained from SFM. They are then used to adapt monocular depth predictions for each image, which are then backprojected into a 3D pointcloud.
We also have supportive results on ImageNet classification.
The table on the right shows the results from using 45-coarse labels on ImageNet-{C,3DCC,V2}. This corresponds to 22x coarser supervision compared to the 1000 classes that we are evaluating on.
See the paper Section 4.1 for how these coarse labels are computed.
TENT seems to have notable improvements in performance under corruptions for classification, unlike for semantic segmentation and depth.
Using coarse supervision results in even better performance, about a further 5 pp reduction in error. Furthermore, on uncorrupted data, i.e. clean, and ImageNet-V2, RNA gives roughly 10 pp improvement in performance compared to TTO. Thus, coarse supervision provides a useful signal for adaptation while requiring much less effort than full annotation.
We also have results on adaptation using coarse labels computed using DINO pre-trained features, see the paper Table 3 for results.
Adaptation Signal | Dataset | Clean | IN-C | IN-3DCC | IN-V2 | Rel. Runtime |
---|---|---|---|---|---|---|
- | Pre-adaptation Baseline | 23.9 | 61.7 | 55.0 | 37.2 | 1.0 |
Entropy | TENT | 24.7 | 46.2 | 47.1 | 37.1 | 5.5 |
Coarse labels (wordnet) |
Densification | 95.5 | 95.5 | 95.5 | 95.5 | - |
TTO (Online) | 24.7 | 40.6 | 42.9 | 36.8 | 5.7 | |
RNA (frozen \(f\)) | 16.7 | 41.2 | 40.4 | 25.5 | 1.4 | |
Quantitative adaptation results on on ImageNet (IN) classification task. We report average error (%) for 1000-way classification task over all corruptions and severities.
RNA is not specific to the choice of architecture of \(f\).
In the table on the right, we show the results for RNA applied to the Dense Prediction Transformer (DPT) (Ranftl et al.) for depth estimation (left) on Taskonomy dataset, and ConvNext (Liu et al.) for ImageNet classification (right).
In both cases RNA achieves better performance and runtime than TTO.
Task (Arch.) | Depth (DPT) | Classification (ConvNext) | ||||
---|---|---|---|---|---|---|
Shift | Clean | CC | Rel. Runtime | Clean | IN-C | Rel. Runtime |
Pre-adaptation Baseline | 2.2 | 3.8 | 1.0 | 18.1 | 43.0 | 1.0 |
TTO (Online) | 1.8 | 2.6 | 13.9 | 17.8 | 41.4 | 11.0 |
RNA (frozen \(f\)) | 1.1 | 1.6 | 1.0 | 14.3 | 38.0 | 1.1 |
RNA works across different architectures. Lower is better. \(\ell_1\) errors for depth estimation are multiplied by 100 for readability.
We also have additional results showing the following: