A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

Barbato, F.; Yucel, M. K.; Zanuttigh, P.; Ozay, M.; Michieli, U.

doi:10.1145/3625468.3647623

Performance degradation caused by corrupted multimedia samples is a critical challenge for machine learning models. Previously, three groups of approaches have been proposed to tackle this issue: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All have drawbacks limiting applicability; the first requires paired clean-corrupted data for training and has an high computational cost, while the others can only be used on the same task they were trained on. In this paper, we propose SyMPIE to solve these shortcomings, designing a small, modular, and efficient system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5% relative accuracy gain across the board1.