A New Wave of Privacy-Preserving Large Language Models

PUBLISHED

11.06.2024

AUTHORS

Dimitris Mouris, Manuel B. Santos, Mehmet Ugurbil, José Cabrero-Holgueras, Miguel de Vega, Shubho Sengupta

Why We Care

Large language models (LLMs) such as GPT-4, BERT, and LLaMA have demonstrated remarkable advances in artificial intelligence, reshaping how individuals and businesses handle everyday tasks. From text generation and machine translation to question answering, LLMs provide versatile assistance by producing human-like responses based on vast amounts of training data. However, as these AI models become integrated into our daily lives, privacy concerns emerge. LLM-powered applications often process sensitive data, such as personally identifiable information (PII), financial information, or healthcare records. According to a report by Business Insider,¹ workers across industries are using tools like ChatGPT to boost productivity, but major companies like Apple, Amazon, and Verizon have imposed restrictions due to fears of data leaks, highlighting the growing tension between AI adoption and data privacy. The potential for unauthorized access to sensitive information, amplified by legal challenges against OpenAI,² underscores the importance of safeguarding personal and proprietary data.

A compelling example of MPC’s potential lies in secure vulnerability detection. Company 1 owns a private large language model specialized in identifying security vulnerabilities, while Company 2 wants to test its proprietary codebase on the first company’s private model. Through MPC, both parties can collaborate to run privacy-preserving LLM inference — ensuring that the LLM identifies potential vulnerabilities in the code without either company revealing their respective assets (i.e., the model and the codebase). First, Company 1 splits the model into random numbers called secret shares, keeps one share, and sends the other to Company 2. In turn, Company 2 secret shares the proprietary codebase, keeping one share and sending the other to Company 1. Both companies now hold shares of the model and shares of the code, which do not reveal anything about the original model and code, respectively. Now both companies can use their shares to perform inference. Lastly, the company with the proprietary code receives the final result which indicates the identified vulnerabilities (or their absence), while both entities maintain complete confidentiality over their inputs. Let’s see this example visually:

Unfortunately, while MPC addresses any privacy concerns regarding the confidential inputs of the companies, existing approaches face multiple challenges. More specifically, when handling complex non-linear operations which are a crucial component of LLMs, techniques such as polynomial approximations or lookup tables are often employed. However, the efficiency and scalability of these techniques often fall short of unlocking the full potential of PPML in real-world applications.

Previous Work on CrypTen: A Research Tool for Secure Machine Learning in PyTorch

CrypTen is a privacy-preserving machine learning framework using secure multiparty computation.³ Developed by Meta Research, it integrates seamlessly with the widely-used PyTorch framework, offering a familiar tensor-based interface that makes it easy for machine learning practitioners to work with encrypted data. CrypTen supports a range of operations necessary for training and inference on encrypted models, including approximations of non-linear functions, enabling secure execution of complex machine learning tasks. Its goal is to promote the adoption of privacy techniques in machine learning by providing a flexible framework that makes these advanced privacy-preserving methods accessible to researchers and developers, even those without a cryptography background.

As a result, CrypTen has been widely adopted by both the machine learning and cryptography communities. Despite its extensive capabilities, the recent advancements in machine learning and LLMs have introduced some significant challenges:

Activation functions: Although it already supports a variety of activation functions such as sigmoid, hyperbolic tangent, and error function (erf), it lacks support for others that are crucial for LLMs like GeLU and SiLU.
Polynomial approximations: CrypTen relies on polynomial approximations for non-linear functions, which results in accuracy loss – for example, its error function, crucial for implementing GeLU, becomes unreliable due to approximation errors.
Embedding lookups: A core part of LLMs is to map input sequences to encodings that the model understands. This process, called embeddings, is not directly supported in CrypTen and requires workarounds that render it impractical.
Truncation accuracy: Lastly, transforming machine learning models to their privacy-preserving equivalents requires using fixed point arithmetic in MPC (i.e., performing operations over numbers like 13.92723), which involves a subprocess called truncation. Unfortunately, CrypTen’s truncation protocol introduces errors with very low probability during inference. While these errors may not be noticeable in smaller models due to their rarity, they can accumulate considerably in larger models, like GPT-2, which requires thousands of truncations.

In Curl,⁴ we have introduced novel research ideas and have resolved all the aforementioned issues. Before we delve into the internals of how Curl works, let’s take a quick look at some core components of LLMs.

Intro to Large Language Models (LLM)

Let’s look at the basic components of LLMs and how we can convert those to MPC equivalents. Certain parts only consist of arithmetic operations such as matrix multiplications and additions, which are straightforward to do in MPC, while other parts require computing non-linear functions (such as a Softmax, GELU, etc.) that require specialized protocols. Before we delve into these protocols, we’ll first need to understand the LLM components for which the protocols will be designed.

Word Embeddings

Words are input into the model, but since the model operates mathematically, they need to be represented in numerical form to be processed. Each word is mapped to a unique numerical identifier, called a token. The model has been pre-trained on these tokens, allowing it to derive semantic information from these numerical representations.

However, since these tokens will be private in the MPC context, these lookups would need to be performed over secrets to attain the weights associated with each token.

Additionally, the position of each word within a sentence is encoded separately, as the positional context is critical for comprehension. Since the position of the token is public, this lookup can be done easily in plaintext. The model then calculates a weighted sum of these two vectors – the token representation and the positional encoding using pre-trained weights. This combined representation enables the model to capture the contextual meaning of each word within the sentence.

Layer Normalization

The resulting values are passed through repeated transformer blocks. Each consecutive block learns a different, deeper representation of the input text. The input into each block, however, should be standardized so that the layers in the block do not grow the values exponentially large or shrink them minuscule. This is where layer normalization becomes an important first step.

This step, however, requires computing the mean and standard deviation, then normalizing each value by subtracting the mean and dividing by the standard deviation. The tricky part in MPC arises during the division by the standard deviation.

Self Attention

The heart of the transformer block is the self-attention mechanism. This mechanism enables the model to focus on relevant parts of the input sequence, allowing it to capture complex relationships and dependencies within the data. This is how the model goes from understanding smaller bits of information to understanding bigger chunks of values, such as from words to sentences.

QKV CALCULATION

First, each token’s embedding vector is transformed into three vectors: Query (Q), Key (K), and Value (V). These three vectors represent a lookup from a dictionary. We have a query for what we are looking for, that we match against the keys, and return the corresponding values.

MASKED SELF ATTENTION

Next, Masked Self-Attention allows the model to generate sequences by focusing on relevant input parts. It uses dot products, masking, and finally, a softmax. The dot products map queries against keys, while masking allows it to be causal by only focusing on past information. The softmax normalizes the inputs, putting more attention on the important values.

As the softmax requires exponential and division non-linearities, this poses yet another challenge to evaluate under MPC.

MULTIPLE HEADS

This whole process is repeated for multiple heads, each extracting different information out of the connections of the inputs. These are then combined and fed to a fully connected neural network.

Multilayer Perceptron (MLP)

To generate the output, first, the model uses a Multilayer Perceptron (MLP) layer to project the self-attention representations into higher dimensions to enhance the model’s representational capacity. MLP consists of two linear transformations with an activation function in between such as GeLU.

GeLU is another non-linear function that cannot be evaluated directly under MPC.

Transformer Head

After the input is passed through all the transformer blocks, the output is finally passed through the final linear layer to prepare it for its final task such as token prediction or classification.

Privacy-Preserving LLMs with Curl

At this point, we have covered the problem we want to solve (i.e., privacy-preserving LLM inference while keeping the inputs, the outputs, and the model private), as well as how LLMs work and what are the critical parts that are challenging to do under MPC (i.e., look-ups like embeddings and non-linear functions like layer normalization, softmax, GeLU, etc.). Now, let’s take a deep dive into how MPC can evaluate the non-linear layers.

Look-Up Table Evaluation for Non-Linear Layers

Any function can be encoded as a look-up table; that is a table consisting of input-output value pairs. Take for example the square function f(x) = x², defined for x in {1, 2 ,3, 4}. We can create a look-up table for this function as:

[table id=1 /]

This way, instead of evaluating the function, we can “look up” the output based on the input.

Transitioning to our two-party setting from before, two companies want to evaluate the GeLU function under MPC on a private input that Company 1 holds. The private value x of Company 1 is equal to -1. Company 1 picks a random number -2, sends it to Company 2, and keeps the difference (1). Now, the two companies hold additive secret shares of the input x. Notice that Company 2 just by viewing its own share does not learn anything about -1. Next, the two companies encode the GeLU function they want to evaluate as a look-up table:

[table id=2 /]

The goal of the protocol is to evaluate the GeLU look-up table on the private inputs that correspond to -1 and acquire GeLU(-1) = 0.17 privately. That means that none of the companies learn the final result. To help them, the two companies bring in an additional company, called the Dealer. The Dealer generates a random number r = 2 and shares it to the two companies (Company 1 gets [r]₁ = 2 and Company 2 gets [r]₂ = 0). Notice here that neither company learns anything about r from the numbers they received (2 and 0) as they don’t know each other’s private numbers. The dealer also encodes the random number r as a vector of all zeros with a single one at the position of r; namely {0, 0, 0, 0, 0, 0, 1, 0} with the one being in the second position (starting from the right). Finally, the dealer also secret shares the one-hot vector to the two companies such that Company 1 gets [-1, 0, -2, -1, 1, 3, 0, 1]₁ and Company 2 gets [1, 0, 2, 1, -1, -3, 1, -1]₂.

Both companies add the random numbers [r]₁ and [r]₂ to their private shares of x as:

Company 1 computes [x – r]₁ = [x]₁ – [r]₁ = 1 – 2 = -1,
Company 2 computes [x – r]₂ = [x]₂ – [r]₂ = -2 – 0 = -2,

and exchange them to compute: x-r = [x – r]₁ + [x – r]₂ = -1 + (-2) = -3. They use the result to rotate the public lookup table:

[table id=3 /]

Next, both companies multiply the one-hot vectors with the public look-up table, which results in each company having a new vector that has shares of 0 everywhere and shares of -0.17 in the one from the last index. More specifically:

Company 1 ends up with [-0.841, 0, 5.992, 0.001, -0.004, -0.477, 0, 0]₁,
Company 2 ends up with [0.841, 0, -5.992, -0.001, 0.004, 0.477, -0.17, 0]₂.

Finally, each company takes the sum of all the entries on their vector:

Company 1 computes the sum 4.671,
Company 2 computes the sum -4.841.

And voilà! The two companies ended up with secret shares of -0.17, which was the intended value for GeLU(-1). Neither of the companies learned anything about that value; Company 1 has a random-looking number (4.671) and Company 2 has another random-looking number (-4.841). But together, they hold secret shares of -0.17 (= 4.671 – 4.841).

The aforementioned protocol is also depicted in the following animation:

Are we done yet?? Ugh, not quite. Although we computed a non-linear function under MPC in the previous example, in reality, we abstracted away many challenges. For instance, GeLU(-1) is not quite -0.17, but -0.1587. You might think this is a teeny tiny difference but in an LLM this small error might be propagated and cause a wrong inference. This inaccuracy is directly related to the size of the look-up table and the precision used – i.e., how many entries we used to encode a certain range of inputs and how many bits we used to do that. Although adding more entries to the look-up table improves accuracy, unfortunately, it also significantly increases computation and communication costs, which slow down the whole process. We’ve got a solution for that, and it’s called Discrete Wavelet Transforms – or DWT for short.

Discrete Wavelet Transforms (DWT)

A discrete wavelet transform – or DWT – is a technique from the signal processing domain and it splits a signal into approximation and detail coefficients. The approximations contain most of the information of the original signal while the details contain information about small errors in the signal. Given the approximations and the coefficients, we can reconstruct the exact original signal. What is more interesting, is that we apply DWT multiple times and get exponentially smaller and smaller signals until we end up with a very compressed version of our signal. Interestingly, for smooth functions, this compressed signal is still a very good approximation of the original signal. This technique can be visualized as:

Putting All the Pieces Together

Now we have all the tools needed to describe Curl.⁴ Our unique observation is that compressing signals with DWT to evaluate them under MPC results in significant latency improvements, communication reduction, and better approximations! This effectively reduces the size of original look-up tables while preserving the accuracy of non-linear functions, thus surpassing traditional polynomial piecewise approximation methods (like in CrypTen). By minimizing communication costs, Curl significantly enhances end-to-end runtime performance and supports a wide range of activation functions including GeLU, SiLU, Gaussian error function, Sigmoid, and Hyperbolic tangent. Let’s see an end-to-end example of evaluating an LLM with Curl:

Finally, let’s see how Curl improves upon CrypTen for different non-linear functions and how Curl performs for different LLM architectures. The overall runtime of an MPC protocol is influenced by three key factors:

Latency: local computation time,
Rounds: the number of times the parties have to communicate,
Communication: the volume of data exchanged between the parties.

We measured these factors for different popular functions and found that the secure DWT-LUT approach implemented within CrypTen (a.k.a Curl) improves all three areas while keeping similar or higher levels of accuracy. The following plots illustrate the results for four operations: inverse square-root (invsqrt), reciprocal, GeLU, and sigmoid.

The tools provided by Curl enable secure evaluation of various LLM models, including BERT and GPT models. Below, we demonstrate how Curl performs across the three key factors that impact the overall runtime of an MPC protocol: latency, number of rounds, and communication.

<br />

From Research to Practice

Our mission is to make cutting-edge research innovations accessible to everyday users. To achieve this, our next step is integrating Curl optimizations into a product called AIVM, a platform designed to simplify access to privacy-preserving AI models.

AIVM is built on a client-server architecture, where data owners (clients) interact with a Curl-powered cluster of data processors (servers). We envision AIVM as a Machine Learning as a Service (MLaaS) platform, offering two core features:

Privacy-preserving inference on existing models: Clients secret share their data and send these shares to different cluster nodes, which are managed by different stakeholders. The data processor nodes collaborate to produce an inference result without accessing the raw data at any point during the execution.
Privacy-preserving model upload: Model owners can offer their models in a privacy-preserving manner by uploading them to AIVM. Similar to privacy-preserving inference, models are secret shared and distributed to the cluster so that no single node has access to the cleartext data. In that way, inference samples are secured, and the models are preserved securely and users retain their intellectual property.

Although AIVM is still in the early development stages, our system already supports high-performant models like LeNet and BERT Tiny, capable of handling tasks in both image classification and language processing. For image classification, we provide examples such as cat vs. dog classification and handwritten digit recognition, while in language processing, users can explore examples like spam detection and sentiment analysis. Additionally, users can easily retrain models for other tasks using our provided training scripts, making AIVM a versatile and scalable platform for a wide range of applications. Our AIVM is documented at https://docs.nillion.com/aivm.

References

¹ Business Insider. “Companies that have Issued Bans or Restrictions on ChatGPT.” July 2023. https://www.businessinsider.com/chatgpt-companies-issued-bans-restrictions-openai-ai-amazon-apple-2023-7.

² The Washington Post. “Companies that have Issued Bans or Restrictions on ChatGPT.” April 2024. https://www.washingtonpost.com/technology/2024/04/09/openai-lawsuit-regulation-lawyers.

³ Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten.
Crypten: Secure multi-party computation meets machine learning.
In Advances in Neural Information Processing Systems (NeurIPS), 2021.
PDF: https://arxiv.org/pdf/2109.00984
Code: https://github.com/facebookresearch/CrypTen

⁴ Manuel B. Santos, Dimitris Mouris, Mehmet Ugurbil, Stanislaw Jarecki, José Reis, Shubho Sengupta, and Miguel de Vega.
Curl: Private LLMs through Wavelet-Encoded Look-Up Tables.
In Conference on Applied Machine Learning for Information Security (CAMLIS), 2024.
PDF: https://eprint.iacr.org/2024/1127.pdf
Code: https://github.com/jimouris/curl

Discover

Developer Tools