Dissecting multimodal learning via regularized masking of multimodal features