5 min read

Transfer Learning with BERT

Transfer learning has revolutionized how we approach machine learning tasks, particularly in Natural Language Processing (NLP). It allows us to leverage pre-trained models, like BERT, and adapt them to specific tasks, reducing the need for extensive data and computational resources. In this blog, we will explore how transfer learning works and apply it step-by-step using BERT, with practical examples taken directly from my jupyter notebook, which you can find in my GitHub repository.

What is Transfer Learning

Transfer learning is a machine learning technique in which a model pre-trained on one task is reused and fine-tuned on a different but related task where “General features” are learned during pre-training and then adapted to a specific problem. This approach not only reduces training time but also improves the performance of models, especially when data is scarce.

In NLP, BERT (Bidirectional Encoder Representations from Transformers) is a popular model used for transfer learning. It has been pre-trained on Wikipedia and BooksCorpus and can be fine-tuned to perform tasks such as text classification, sentiment analysis, and question-answering.

Why BERT?

  1. Leverage Pre-trained Knowledge: BERT models come pre-trained with a robust understanding of language. This helps when we want to use BERT for new tasks.

  2. Efficient Fine-tuning: With BERT, Instead of training the model from scratch, “transfer learning” allows us to fine-tune a pre-trained model, saving time and cost (computing resources, storage, software, etc).

  3. Improved Accuracy: Transfer learning leads to better generalization (the ability of a model to perform on unseen data) on smaller datasets and complex tasks.

  4. Feature Reuse: Pre-trained models like BERT learn deep, contextualized representations of text, which can be adapted for various downstream tasks. This adaptation helps improve performance on specific tasks by leveraging the general linguistic knowledge acquired during pre-training. This allows the model to quickly fine-tune to new data, even with limited labeled examples (as in our dummy case). It also enables better generalization, as the model can apply its understanding of language structure and semantics so that more accurate predictions can be done on unseen data, reducing the risk of overfitting.

Applying Transfer Learning

Pre-processing

The first step is to create a text dataset with the labels. The labels are obvious, ‘1’ stands for positive, ‘0’ stands for negative. In the real world, this obviously is going to be a pandas dataset or some other real-world data format.

texts = ["I am a cool guy", 
         "AI will change this world",          
         "I hate phone scams",         
         "I love cheerful people",         
         "I love friendly people",         
         "I hate rude people"]
labels = [1, 1, 0, 1, 1, 0]

The next step will be to Tokenize the input test using the pre-trained tokenizer

from transformers import BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
max_len = 128
encodings = tokenizer(texts, padding=True, truncation=True, 
                      max_length=max_len, return_tensors='pt')

Tokenization converts raw texts into the format required by the BERT model, including input IDs and attention masks.

Finetuning the model


from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Next, we split the data into training and validation sets before encoding them.


from sklearn.model_selection 
import train_test_split

train_texts, validation_texts, train_labels, validation_labels = train_test_split(texts, labels, test_size=0.2)

train_encodings = tokenizer(train_texts, padding=True, truncation=True, max_length=max_len, return_tensors='pt')

validation_encodings = tokenizer(validation_texts, padding=True, truncation=True, max_length=max_len, return_tensors='pt')

Choose the return tensors appropriately in this case, pt stands for pytorch, tf stands for tensorflow

Training

train_input_ids = train_encodings['input_ids']
train_attention_masks = train_encodings['attention_mask']
train_labels_tensor = torch.tensor(train_labels)

Validation

validation_input_ids = validation_encodings['input_ids']
validation_attention_masks = validation_encodings['attention_mask']
validation_labels_tensor = torch.tensor(validation_labels)

Now, two things are required before setting up the training loop, the loss function and the optimizer. Once this is set up, we can start the training loop.

Training loop

model.train()  # Set the model to training mode
epochs = 13  # Define the number of epochs
loss_fn = torch.nn.CrossEntropyLoss()  # Define loss function

for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    
    # Forward pass
    optimizer.zero_grad()  # Clear previous gradients

    logits = model(input_ids = train_input_ids.to(device), attention_mask=train_attention_masks.to(device), labels=train_labels_tensor.to(device)).logits
    
    # Calculate loss
    loss = loss_fn(logits, train_labels_tensor.to(device))  # Actual loss calculation # Backward pass loss.backward() # Calculate gradients optimizer.step() # Update weights print(f"Loss: {loss.item()}")

Saving the model

We would now save the fine-tuned model.

model.save_pretrained("./fine_tuned_bert")
tokenizer.save_pretrained("./fine_tuned_bert")

Generating Sentence Embeddings

We are almost done, let’s see if we can generate sentence embeddings from our fine-tuned model.

from langchain_huggingface import HuggingFaceEmbeddings

hf_embedding = HuggingFaceEmbeddings(model_name="./fine_tuned_bert")

sentence = "I love using BERT for transfer learning." 
# or any sentence you can think of

embedding = hf_embedding.embed_query(sentence)print("Sentence Embedding:", embedding)

You should get something like this, which are nothing but a list of vectors.

Sentence Embedding: [0.34589746594429016, 0.36306294798851013, 0.0735308825969696, 
0.17232845723628998, 0.25641950964927673, -0.6112189292907715, 0.18540552258491516, 
0.6184269785881042, 0.028799623250961304, -0.3970279097557068, 0.06879080832004547, 
-0.38036468625068665, 0.2705947756767273, 0.4087296426296234, -0.11404649913311005, ... 
]

This indicates that we have successfully fine-tuned BERT and used it to generate embeddings for our use.

How Does This Example Demonstrate Transfer Learning?

Pre-training: We started with a BERT model pre-trained on data sources like Wikipedia. The model already understands linguistic structures and semantics.

Fine-tuning: We adapted this model by fine-tuning it on our sentiment classification task. The general language knowledge is transferred to a more specific task with a smaller dataset.

Embeddings: After fine-tuning, we generated sentence embeddings that reflect the nuances of the task, such as sentiment. These embeddings can now be used in various downstream applications, including query matching, search engines, or similarity detection.

Conclusion

Transfer learning, particularly with models like BERT, is an incredibly powerful tool for solving NLP tasks efficiently and effectively. By leveraging pre-trained models and fine-tuning them for specific tasks, we can drastically reduce training time and improve performance, especially in situations where we have limited data.

The full code for this example is available on GitHub in the form of a jupyter notebook, feel free to use it as you please.