Our project investigates a fundamental trade-off in Natural Language Processing (NLP): When is it better to use general-purpose, pre-trained word embeddings versus training them from scratch for a specific task?

We explored this question by comparing three different approaches on a spam classification task using the Enron-Spam dataset:

Learned Embeddings: A model where word vectors start as random values and are learned specifically for our spam detection task.
Pre-trained Embeddings (Fixed): A model that uses GloVe embeddings (vectors trained on a massive, general dataset) but keeps them “frozen” and does not update them during training.
Pre-trained Embeddings (Fine-Tuned): A model that starts with GloVe embeddings but allows them to be further trained and adjusted (“fine-tuned”) for our specific spam detection task.

Our goal was to measure the impact of these choices on two key metrics: model accuracy and training time.
We found a clear and interesting trade-off between these methods:

Fixed Pre-trained Embeddings are fast but inaccurate.
Learned Embeddings are accurate but take time to learn.
Fine-Tuned Pre-trained Embeddings offer the best of both worlds.