Contrastive Learning of Sentence Embeddings from Scratch

“You can have data without information, but you cannot have information without data.” – Daniel Keys Moran.

When it comes to natural language processing (NLP), the same principle applies: you can have a sentence, but without understanding its meaning, that sentence is just a string of words. That’s where sentence embeddings come into play. They help turn those words into a format that machines can actually understand—by converting sentences into numerical vectors. This transformation is a big deal because, in today’s world of semantic search, machine translation, and sentiment analysis, machines need to grasp the meaning behind the words, not just the words themselves.

You might be wondering, why are we talking about contrastive learning? Well, here’s the deal: traditional methods of learning sentence embeddings (think Word2Vec or GloVe) don’t always capture the nuances of language. They often fall short, especially in understanding context or subtle relationships between sentences. That’s where contrastive learning shines. It’s like giving your model a sense of contrast—teaching it to understand not only what’s similar but also what’s different.

Objective:

So, in this blog, I’m going to walk you through exactly how you can build sentence embeddings from scratch using contrastive learning techniques. No fluff, no vague explanations—just actionable steps and a solid foundation. Whether you’re a beginner to this space or looking to enhance your understanding of embeddings, by the end of this post, you’ll be equipped with everything you need to get started.

Overview:

Here’s what we’ll cover:

What are sentence embeddings? We’ll talk about what they are and why they’re crucial in NLP.
What is contrastive learning? We’ll dig into how it works, specifically in the context of sentence embeddings.
How can you implement it? I’ll show you step-by-step how to build your own sentence embeddings using contrastive learning, even if you’re starting from scratch.
Use cases and applications. Finally, we’ll explore where these embeddings shine and how they’re used in real-world scenarios.

Buckle up, because this is going to be an exciting ride into the world of sentence embeddings and contrastive learning!

Introduction to Contrastive Learning

Basic Concept:

Let’s start with a fundamental question: How does a machine know if two sentences mean the same thing? This is where contrastive learning steps in.

In simple terms, contrastive learning teaches a model by comparison—it shows the model pairs of examples and asks it to learn what’s similar and what’s different. It’s kind of like giving your brain a crash course in spotting the difference between “apple” and “apple pie.” The idea is to create positive pairs (similar sentences) and negative pairs (dissimilar sentences), then pull the positive pairs closer together in the vector space and push the negative pairs farther apart.

Imagine you’re teaching a child to distinguish between different dog breeds. You show them pictures of two Golden Retrievers (positive pair) and ask them to notice the similarities. Then, you show them a Golden Retriever and a Bulldog (negative pair) to emphasize the differences. That’s contrastive learning. The model learns what makes things similar and, just as importantly, what sets them apart.

Why Contrastive Learning for Sentence Embeddings?

Now you might be wondering: Why use contrastive learning specifically for sentence embeddings? Here’s the deal: traditional methods like word embeddings (e.g., Word2Vec) only capture meanings at a word level. Sure, they’re good at understanding words in isolation, but they miss the bigger picture—the context that emerges when you string words into sentences.

Contrastive learning excels in sentence-level understanding because it focuses on subtle differences between sentences, not just words. Think of two sentences like “The cat sat on the mat” and “The cat lay on the mat.” Traditional methods might think these sentences are almost identical, but with contrastive learning, the model picks up on those small variations in meaning (e.g., “sat” vs. “lay”). It’s not just recognizing words; it’s understanding the intent and context behind them.

In short, contrastive learning doesn’t just help machines understand language—it helps them discriminate between what’s alike and what’s not, giving them a more nuanced understanding of meaning.

Popular Contrastive Learning Techniques:

If you’re thinking, Okay, but how do I actually implement this? — great question! There are a few popular models out there that use contrastive learning specifically for sentence embeddings:

SimCSE (Simple Contrastive Learning of Sentence Embeddings): A popular choice because of its simplicity. It works by generating positive pairs by feeding the same sentence through the model twice with dropout, essentially creating two slightly different views of the same sentence.
MoCo (Momentum Contrast): Originally developed for visual representation, MoCo uses a memory bank to store negative pairs, making it super efficient when working with large datasets.
InfoNCE (Information Noise Contrastive Estimation): The key idea here is to use negative samples to separate similar and dissimilar sentences. This loss function has been widely adopted in NLP and computer vision due to its versatility.

These techniques allow you to train your model to distinguish between similar and different sentences efficiently, giving your embeddings a deeper level of understanding.

Key Components of Contrastive Learning in NLP

Positive and Negative Pairs:

At the heart of contrastive learning are positive and negative pairs. Positive pairs can be as simple as two slightly different versions of the same sentence (for example, after applying some data augmentation), while negative pairs are sentences that differ semantically.

You might be thinking: How do I generate positive pairs for my data? Great question! One common approach is using data augmentation techniques such as:

Paraphrasing: You create different versions of the same sentence by changing its structure without altering its meaning. For example, “The dog is running fast” becomes “The dog is sprinting quickly.”
Back translation: Translate a sentence into another language (like French) and back to English. This often creates small variations in sentence structure, which can be used as positive pairs.
Token-level perturbations: This involves small changes like dropping or swapping words in a sentence, creating enough variation for positive pairs.

The negative pairs are simply sentences that are unrelated or have different meanings. For instance, “The cat is sleeping on the couch” and “The dog is running in the yard” form a negative pair.

Contrastive Loss Function:

Now, here’s the math-heavy part, but don’t worry, I’ll keep it simple. The contrastive loss function is what ensures that the model brings similar sentences closer together and pushes dissimilar ones apart. It’s like setting up a goalpost and training the model to score consistently.

Most commonly, you’ll see cosine similarity or cross-entropy used in the contrastive loss function. The objective is to minimize the distance between positive pairs and maximize the distance between negative pairs.

Imagine two magnets—positive pairs attract (similar sentences), and negative pairs repel (dissimilar sentences). The loss function helps guide the model in balancing this attraction-repulsion force.

Data Augmentation for Contrastive Learning:

Data augmentation is crucial for generating those positive pairs I mentioned earlier. Here are a few strategies that work well for sentences:

Synonym Replacement: Swap out words in a sentence with their synonyms. For example, “happy” becomes “joyful.” This creates a positive pair while keeping the meaning intact.
Dropout Masks: You randomly mask certain words in a sentence. This forces the model to understand the underlying structure even when part of the sentence is missing.
Sentence Reordering: Slightly change the order of clauses within a sentence. For example, “After eating lunch, I went to the park” becomes “I went to the park after eating lunch.”

These techniques ensure that your model has a rich variety of positive samples to learn from, making it more robust in distinguishing sentence meaning.

How to Build Sentence Embeddings from Scratch with Contrastive Learning

Prerequisites:

Before you jump into coding, ensure you have the necessary tools:

Libraries: Install PyTorch, Hugging Face’s Transformers, and other essentials.

pip install torch transformers datasets

Pre-trained Models: I recommend using pre-trained models like BERT or RoBERTa for fine-tuning instead of starting completely from scratch. Hugging Face’s Transformers makes this easy.

Step-by-Step Guide:

Now let’s get into the nitty-gritty. I’ll walk you through how to build sentence embeddings with code snippets along the way.

Step 1: Data Preparation

First, we need data, specifically positive and negative sentence pairs. Let’s assume you’re working with the SNLI dataset, which provides labeled sentence pairs for supervised learning.

You can load it using Hugging Face’s datasets:

from datasets import load_dataset

dataset = load_dataset("snli")
train_data = dataset['train']

For unsupervised learning, you can create positive pairs by augmenting sentences. Here’s a basic example of back translation using Google Translate API to generate positive pairs:

from transformers import MarianMTModel, MarianTokenizer

# Define the translation models
src_model_name = 'Helsinki-NLP/opus-mt-en-de'
tgt_model_name = 'Helsinki-NLP/opus-mt-de-en'

src_tokenizer = MarianTokenizer.from_pretrained(src_model_name)
src_model = MarianMTModel.from_pretrained(src_model_name)

tgt_tokenizer = MarianTokenizer.from_pretrained(tgt_model_name)
tgt_model = MarianMTModel.from_pretrained(tgt_model_name)

# Translate sentence to another language (e.g., English to German)
def translate(sentence, model, tokenizer):
    tokenized_text = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    translation = model.generate(**tokenized_text)
    translated_sentence = tokenizer.decode(translation[0], skip_special_tokens=True)
    return translated_sentence

sentence = "The cat sat on the mat."
translated_sentence = translate(sentence, src_model, src_tokenizer)
back_translated_sentence = translate(translated_sentence, tgt_model, tgt_tokenizer)

# These two sentences are now positive pairs
print(sentence, back_translated_sentence)

Step 2: Model Architecture

For contrastive learning, we can use Siamese Networks with pre-trained models like BERT or RoBERTa as encoders.

Here’s an example using a Siamese BERT architecture:

Step 2: Model Architecture

For contrastive learning, we can use Siamese Networks to compare two inputs—perfect for tasks like sentence similarity. By using pre-trained models like BERT or RoBERTa as encoders, you can save time and improve performance without building the entire model from scratch.

Here’s an example using Siamese BERT to create sentence embeddings:

import torch
from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model and tokenizer
bert_model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define the Siamese architecture
class SiameseBERT(torch.nn.Module):
    def __init__(self, model):
        super(SiameseBERT, self).__init__()
        self.bert = model
    
    def forward(self, input_ids, attention_mask):
        # Encode both input sentences using BERT
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Take the [CLS] token embedding as the sentence representation
        cls_output = outputs.pooler_output
        return cls_output

# Instantiate SiameseBERT
model = SiameseBERT(bert_model)

# Tokenizing sentences (Positive Pair Example)
sentence1 = "The cat is on the mat."
sentence2 = "The feline is sitting on the mat."

inputs1 = tokenizer(sentence1, return_tensors='pt', padding=True, truncation=True)
inputs2 = tokenizer(sentence2, return_tensors='pt', padding=True, truncation=True)

# Forward pass to get embeddings
embedding1 = model(inputs1['input_ids'], inputs1['attention_mask'])
embedding2 = model(inputs2['input_ids'], inputs2['attention_mask'])

# Cosine similarity between the two embeddings
cos_sim = torch.nn.functional.cosine_similarity(embedding1, embedding2)
print(f"Cosine Similarity: {cos_sim.item()}")

What’s happening here?

Pre-trained BERT Model: We load the BERT model using the Hugging Face library. This model acts as the encoder that transforms sentences into embeddings.
Siamese Network: We define a custom neural network class, SiameseBERT, that uses BERT as its base model. This architecture processes two sentences in parallel (hence the term Siamese) and outputs sentence embeddings.
Forward Pass: The forward pass runs both input sentences through BERT, and we use the [CLS] token representation as the sentence embedding (you can customize this based on your preference).
Cosine Similarity: We then compute the cosine similarity between the two sentence embeddings. The higher the cosine similarity score, the more semantically similar the sentences are.

Using RoBERTa Instead of BERT:

If you want to use RoBERTa instead, you just need to swap out the model and tokenizer:

from transformers import RobertaModel, RobertaTokenizer

# Load RoBERTa model and tokenizer
roberta_model = RobertaModel.from_pretrained('roberta-base')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Instantiate SiameseBERT with RoBERTa
model = SiameseBERT(roberta_model)

Conclusion

We’ve taken a deep dive into building sentence embeddings from scratch using contrastive learning—one of the most powerful techniques for extracting semantic relationships between sentences. From understanding the basic concept of contrastive learning, and setting up the right architecture, to coding your own Siamese BERT model, you’ve now got the blueprint to start building high-quality sentence embeddings.

Here’s the deal: while traditional methods for sentence embeddings like Word2Vec or even BERT without contrastive learning do a decent job, they sometimes miss the subtleties that really define relationships between sentences. Contrastive learning, on the other hand, focuses on fine-grained differences and similarities, making your sentence embeddings not just good, but great.

By using positive and negative pairs, you’re teaching your model to understand both the context and the nuances of sentence relationships. This method helps capture more contextual meaning and creates embeddings that can be used in real-world applications like semantic search, question answering, or even paraphrase detection.

What’s Next?

You’ve learned the fundamentals of building sentence embeddings using contrastive learning, but this is just the beginning. As you continue to experiment, you can:

Fine-tune pre-trained models like BERT or RoBERTa for your specific use cases.
Explore different contrastive loss functions or data augmentation techniques to improve your embeddings.
Apply these embeddings to downstream NLP tasks to see just how powerful they can be.

Remember, at the end of the day, the key is in the details—how you fine-tune your model, the quality of your data, and the techniques you apply. By following the steps outlined here, you’re well on your way to building sentence embeddings that are not only accurate but also highly performant.

Now it’s your turn—take what you’ve learned and start building!

Objective:

Overview:

Introduction to Contrastive Learning

Basic Concept:

Why Contrastive Learning for Sentence Embeddings?

Popular Contrastive Learning Techniques:

Key Components of Contrastive Learning in NLP

Positive and Negative Pairs:

Contrastive Loss Function:

Data Augmentation for Contrastive Learning:

Prerequisites:

Step-by-Step Guide:

Step 1: Data Preparation

Step 2: Model Architecture

Step 2: Model Architecture

What’s happening here?

Using RoBERTa Instead of BERT:

Conclusion

What’s Next?

Leave a Comment Cancel Reply