Comparing pre-trained vs. Learned Embeddings for NLP Tasks
Our project investigates a fundamental trade-off in Natural Language Processing (NLP): When is it better to use general-purpose, pre-trained word embeddings versus training them from scratch for a specific task?
We explored this question by comparing three different approaches on a spam classification task using the Enron-Spam dataset:
-
Learned Embeddings: A model where word vectors start as random values and are learned specifically for our spam detection task.
-
Pre-trained Embeddings (Fixed): A model that uses GloVe embeddings (vectors trained on a massive, general dataset) but keeps them “frozen” and does not update them during training.
-
Pre-trained Embeddings (Fine-Tuned): A model that starts with GloVe embeddings but allows them to be further trained and adjusted (“fine-tuned”) for our specific spam detection task.
Our goal was to measure the impact of these choices on two key metrics: model accuracy and training time.
We found a clear and interesting trade-off between these methods:
- Fixed Pre-trained Embeddings are fast but inaccurate.
- Learned Embeddings are accurate but take time to learn.
- Fine-Tuned Pre-trained Embeddings offer the best of both worlds.