15 - Feature extraction using BERT
After the poor results of Zero Shot Classification on Kaggle’s Feedback Prize competition, I decided to try feature extraction using BERT.
This is motivated by the limited resources I have using Kaggle. Fine-tuning BERT would probably lead to better results, but it requires much more resources than I have at my disposal.
I learned that the most effective method of feature extraction is to:
- Use a pre-trained model like BERT to create word embeddings from the text, and
- Extract some subset or combination of the transformer’s internal state.
The authors of the original BERT paper show Named Entity Recognition results for several different combinations of extracted features. For example, summing the last 4 layers, taking the second to last layer, etc. Hugging Face’s feature-extraction
pipeline appears to just extract the final layer.
Using this final layer as extracted features from the text, I trained a simple Random Forest model to predict the text’s effectiveness, and was able to double my accuracy from random chance - 1/3 to about 2/3. However, the model seems to over-select for the moderate class “Adequate”, so more work will need to be done on the modeling side.
I think next step for me is to see how far I can take the supervised learning side of this before I need to revisit the feature extraction. I plan on using XGBoost, Adaboost, SVM, and a simple feed-forward neural network and see which one shows promise.
# Sources
- A great paper for visualizing BERT which goes over how to use BERT for feature extraction.
- A link to my code for today.
I'm a freelance software developer located in Denver, Colorado. If you're
interested in working together or would just like to say hi you can reach me
at me@
this domain.