Publications | Dominick Reilly

2024

×

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Ali Khaleghi Rahimian, Manish Kumar Govind, Subhajit Maity, Dominick Reilly, and 3 more authors

In arXiv, 2024

Abs arXiv Code

Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed to reduce the computational complexity of self-attention in ViTs, leading to the development of efficient and sparse transformer architectures. In this paper, viewing through the efficiency lens, we realized that introducing any sparse self-attention strategy in ViTs can keep the computational overhead low. However, these strategies are sub-optimal as they often fail to capture fine-grained visual details. This observation leads us to propose a general, efficient, sparse architecture, named Fibottention, for approximating self-attention with superlinear complexity that is built upon Fibonacci sequences. The key strategies in Fibottention include: it excludes proximate tokens to reduce redundancy, employs structured sparsity by design to decrease computational demands, and incorporates inception-like diversity across attention heads. This diversity ensures the capture of complementary information through non-overlapping token interactions, optimizing both performance and resource utilization in ViTs for visual representation learning. We embed our Fibottention mechanism into multiple state-of-the-art transformer architectures dedicated to visual tasks. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention in conjunction with ViT and its variants, consistently achieves significant performance boosts compared to standard ViTs in nine datasets across three domains — image classification, video understanding, and robot learning tasks.
×

Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily Living

Dominick Reilly, and Srijan Das

In CVPR, 2024

Abs arXiv Code

Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or π-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of π-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows π-ViT to discard the modules during inference. Notably, π-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.
×

Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders

Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, and 5 more authors

In WACV, 2024

Abs arXiv Code

Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability.

2023

2022

×

DP-Shield: Face Obfuscation with Differential Privacy

Muhammad Usama Saleem, Dominick Reilly, and Liyue Fan

In International Conference on Extending Database Technology (EDBT), 2022

Abs PDF Website

An immense amount of image data is captured and shared nowadays, e.g., social media and surveillance databases. Such image data may contain sensitive information, such as faces, which can be misused if in the hands of an adversary. Widely used image privacy solutions obfuscate faces, e.g., via pixelization and blurring, before sharing with untrusted parties. However, they do not provide quantifiable privacy guarantees and are prone to inference attacks. In this demo, we present DP-Shield, an interactive framework for face image obfuscation under the rigorous notion of differential privacy. DP-Shield showcases our recently proposed obfuscation methods, namely DP-Pix and DP-SVD, and also includes two alternative methods for comparison. The audience will be able to learn about existing DP methods by interacting with them using real-world face image datasets. Furthermore, DP-Shield integrates widely used image quality measures and practical privacy risk measures (i.e., face recognition) to illustrate the efficacy of our methods.

2021

×

A Comparative Evaluation of Differentially Private Image Obfuscation

Dominick Reilly, and Liyue Fan

In IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), 2021

Abs PDF Code

Image data may contain sensitive information, such as face and iris, which can be misused if in the hands of an adversary. As image data is continuously being collected and shared, it is imperative to ensure the privacy of image data. Widely used image obfuscation methods apply blurring or pixelization to those sensitive regions. However, they are prone to inference attacks, and do not provide quantifiable privacy guarantees. Recently, several obfuscation approaches have been proposed to satisfy the rigorous notion of differential privacy. The goal of this work is to provide a comparative evaluation of those previously proposed approaches in the context of obfuscating face and iris images. We synthesize existing differentially private obfuscation methods and analyze their privacy guarantees. Furthermore, we conduct an extensive empirical evaluation regarding practical utility and privacy protection, with real-world face and iris image datasets. We find that DP-SVD outperforms other methods on several privacy and utility measures. Moreover, we provide an in-depth discussion of our results and point to several considerations when applying those differentially private image obfuscation methods.