3

Sisyphus: A Cautionary Tale of Using Low-Degree Polynomial Activations in Privacy-Preserving Deep Learning

Privacy concerns in client-server machine learning have given rise to private inference (PI), where neural inference occurs directly on …

Karthik Garimella, Nandan Kumar Jha, Brandon Reagen

Sisyphus: A Cautionary Tale of Using Low-Degree Polynomial Activations in Privacy-Preserving Deep Learning

CryptoNite: Revealing the Pitfalls of End-to-End Private Inference at Scale

In this paper we demonstrated that how the current trend in private inference myopically optimized the performance only for zero arrival rate; in particular, they have developed the mechanism to mitigate the bottlenecked caused by non-linearity in neural networks. However, in a real-world scenario when inference request comes even with a moderate arrival rate the homomorphic encryption becomes the main bottleneck since we can no longer pre-process it in the offline computation phase.

Karthik Garimella, Nandan Kumar Jha, Zahra Ghodsi, Siddharth Garg, Brandon Reagen

CryptoNite: Revealing the Pitfalls of End-to-End Private Inference at Scale

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. …

Nandan Kumar Jha, Brandon Reagen

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

AERO: Softmax-Only LLMs for Efficient Private Inference

In this work, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23x communication and 1.94x latency reduction.

Nandan Kumar Jha, Brandon Reagen

AERO: Softmax-Only LLMs for Efficient Private Inference

Entropy-Guided Attention for Private LLMs

We introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a …

Nandan Kumar Jha, Brandon Reagen

Entropy-Guided Attention for Private LLMs