With all progress achieved in machine learning in recent years text classification tasks have received considerable attention and exciting solutions have been developed.
Here in Aboutgoods Company we are currently challenged by a classical text classification problem. Classical isn’t it?..
The task is to attribute an exhaustive category to a line item that can be found in a receipt mainly from supermarkets.
pineapple juice 1L belongs to
juice category. And
pineapple extra import to
fresh fruit. Or if you are looking for a little more challenge what category would you attribute to this label
DBL GLOUC? (It would be double gloucester cheese 🧀).
Sounds a lot like the twitter sentiment analysis. It does, but let’s have a closer look.
- First, we are dealing with over 70 categories, so it is really a multi-class text classification problem.
- Second, the vocabulary of the retail universe is very particular and does not exactly can be described using the ready-to-use language models.
- And third — eternal data science obstacle — very limited and unbalanced dataset. Indeed, in our dataset we are working with 100–4000 labels per category/class. If you think of it, there are not so many ways to describe eggs in a receipt…
Here we will focus on describing our currently working solution although a variety state-of-the-art approaches were tested.
The architecture of the current model comprises:
- Custom word2vec model obtained from scratch
- CNN (Convolutional Neural Network) trained using Keras framework
Word2vec: why and how.
As I have mentioned the language of retail receipts is very specific. First of all, there are no verbs, very little articles and few adverbs. Retail universe is all about nouns and adjectives. That is why using a pre-trained word2vec language model (like FastText) did not work out very well for our case: there are just too many words we do not need.
If you are not familiar with Word2vec concept you can find some great explanations here:
- https://ronxin.github.io/wevi/ A must to understand how w2vec works
In short: Word2vec is a shallow neural network for learning word embeddings from raw text. Its input is a text corpus and its output is a set of vectors: word embeddings. It turns text into a numerical form that deep neural network can understand and be trained on.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace.
Now our interest in using Word2vec is clear: before training deep neural network on categories we want to have a pre-trained vectorspace of words (represented by word embeddings) that are already connected semantically.
So before attacking our own Word2vec model we need a clean, varied dataset of labels in order to generate tokens and their vector representations that actually make sense.
At Aboutgoods we worked hard to build this dataset as in Europe there is no official labels and products database. Don't expect much from Open food fact, because the initiative works mainly in French, so not very applicable to our multilanguages system.
Pre-processing the dataset
Pre-processing of the dataset is tremendously important (as you may already know).
Classical transformations were applied on labels in our dataset:
- Lowercase, unicode transformation
- Stop-words removal (you will never see them in receipt, every letter counts in the receipt line)
- Replacement of all mass (kg, g, etc), volume (l, cl, etc), packaging, percentage information by generic keywords
- Removal of common words that are not specific for any category. Like retailer brand:
tesco, for example.
- Tokens (e.g. separate words) are created by splitting the label with space separator.
“KELLOGG’S BAR.CEREAL 400G” -> [“kellogg’s”, “bar”, “cereal”,”<UNIT_SOLID>”]
Next step: training the Word2vec model
gensim library we obtained the skip-gram Word2vec model by training on over 70k labels. In average we obtain a vocabulary of 12k words.
With Tensorboard projector we can explore and play really cool with the vectorspace. Here are some neighbours of the word
espresso ☕️ for example in the French Word2vec model.
CNN: training on categories.
Once we have obtained the Word2vec representation of the labels we can train a neural network (NN) that can predict a category of a given input label. We got inspired by this paper and used the architecture proposed there: https://arxiv.org/pdf/1408.5882v2.pdf
Neural networks are capricious. They accept only numeric input of fixed size. But we have labels that are strings. And we have their word embeddings from Word2vec model. Luckily for us
Keras library has all the necessary to combine all this. It offers an
Embedding layer that can be used in neural networks trained on text data. It can be either initialized with random weights OR
can also use a word embedding learned elsewhere (sounds like our case 🙏).
We encode labels using
Tokenizer and with
Embedding layer tell the NN the word embeddings from Word2vec model. Moreover we allow the NN to modify embeddings during training.
So far we trained models in 3 languages: French, Spanish and Italian. Average accuracy obtained on validation dataset is 90%. Moreover the models show extremely good specificity and sensitivity.
Reminder: Specificity of a model is a measure used to determine the proportion of actual not-this-class cases, which got predicted correctly. In our case we can consider it as how well a category is being distinguished from the others. Average specificity of a class (from over 71 classes) is 99,5%!
Sensitivity is a measure used to determine the proportion of actual this-class cases, which got predicted correctly. If we consider results with accuracy of prediction above 50% we achieve an average sensitivity of 85%.
Some classification examples for French labels:
If you have difficulties in understanding these labels, don’t worry! Sometimes we have them too 😎.
Thanks for reading and share with us your experience with short-text classification task!