16 - Scraping by

Last updated Aug 9, 2022 Edit Source

Daily log

When I was working on Kaggle today, I came across a warning that I was using the pipeline function incorrectly by calling it each time I iterated through the data. Interested in speeding up the training and learning something new, I set about building my own pipeline for feature extraction.

Unfortunately, Hugging Face’s feature extraction pipeline crashed my machine because it tried to allocate too much RAM. The feature-extraction pipeline returns the state of the final transformer.

Using DistilBERT, this is a tensor of size $(n, 768)$ where $n$ is the number of tokens in the input.

So I decided to iterate through my data, only taking the embeddings for the [CLS] token which some developers have used and reported good results on feature extraction tasks. Running this iteration I ran into the error described above, and figured that I could speed up the feature extraction by implementing a pipeline using Hugging Face’s pipeline object.

However, I got stuck because I couldn’t import the Pipeline base object, and something else came up.

# Lack of productivity

I did some research this evening about more productivity hacks for working at home. The past couple of days I’ve noticed the traffic noise more, and it’s becoming more and more difficult to stay focused. I’ve also suffered from poor sleep, and am having trouble staying off of sites like YouTube and listening to podcasts.

I’m implementing some changes tomorrow, like getting dressed for work and adding a Freedom blocklist for my work time.

I'm a freelance software developer located in Denver, Colorado. If you're interested in working together or would just like to say hi you can reach me at me@ this domain.

🐙 Evan Azevedo

16 - Scraping by

# Lack of productivity

Backlinks

Interactive Graph