Abstract: Understanding the content of an image is a reasonably easy task - at least for us humans. For computers however, it’s a very different story. At Booking.com we have a huge collection (>150M) of hotel and user-generated images in our database. As images are a very important source of information while selecting a hotel, we would like to understand the content of the images to help our customers make better informed decisions. We can do this by serving the images in the right context for each individual user, helping create a personalised web site and a unique overall experience. But it’s no mean feat.
Tagging images (image classification) has long attracted interest from the research community. Conventional techniques have focused on creating densely populated, hand-crafted and low-level (pixel-based) image descriptors, using some middle or higher level decisions using different classification techniques. The last couple of years have seen some impactful changes in the area, aiming at improving the quality of conventional techniques using deep convolutional neural networks (CNNs), which are effectively used in many different image-related tasks to outperform the conventional techniques. Therefore, we’ve taken that approach to create an automated image tagging solution at Booking.com using deep convolutional neural networks. There are some commercial solutions for the task that can automatically analyse and generate tags for each image in our database. However, we have a unique corpus with special needs that require domain expertise to explore the value of specific image tags. We needed to create our own solution with a long list of photo tags for our internal image classification.
2. Convolutional Neural Networks (CNN):
Neural networks have been widely studied mostly as a supervised learning technique, where a network of artificial neurons are taught to adjust their weights according to given ground-truth data. The adjustment is enforced by showing the true examples of the task and forcing the model to give accurate predictions for these known examples. Given enough examples, the network adjusts its weights in such a manner that it can generalise to the new unlabelled samples and produce the required output with expected accuracy.
Convolutional neural networks were introduced in the 1990s. Despite some early remarkable results in the optical character recognition task , a wider domain adaptation has not been possible mostly due to their compute intensive requirements and not extending to other visual tasks. Through recent improvements in computer hardware and especially graphical processing units (GPU), it’s now possible to use larger networks and analyse bigger datasets under realistic time constraints (days to weeks). This has brought drastic improvement on many vision tasks and still attracts attention from both the academia and industry. For a detailed explanation of CNNs and how they work, this tutorial is worth watching.
2.1. CNN Architecture
There are various deep network architectures proposed in the literature, with different network depth, width and different sizes of convolutional kernels. Most complex architectures has brought some additional performance improvements for the vision tasks. In our work, we have compared different architectures with respect to performance-computation tradeoff. GoogleNet  and its successor Inception-v3  have been the two main architectures we’ve conducted our experiments on. After many different trainings and hyper parameter tunings, inception-v3 has consistently shown a slight (~1-2%) performance improvement (top 5 accuracy) at the cost of increased computational load (~4 times).
2.2. Transfer Learning
Transfer learning is a machine learning concept where a supervised learning technique is trained on domain A, and this learning/information is further applied on a different (but similar) domain B. This is a common and very useful approach in such cases where we have a good amount of labelled samples on domain, A but not as much on domain B. In our context, the labelled data we had was from two different sources; the first came from a third party company, while the other came from hoteliers tagging their own photos. From both sources we were able to get more than 10M tagged images with different levels of noise in the tags. Therefore, we decided to train a network from scratch and at the same time compare the performance with the transfer learning approach, where we used a pretrained network that was trained on a different dataset (IMAGENET) on a similar image classification task.
IMAGENET is a crowdsourced dataset, collected over many years from many different locations around the world . There’s over 10M images tagged with 1,000 image labels. Though the task is similar, the nature of the images is different from our hotel corpus. The images in the IMAGENET dataset mostly contain a centrally located object or an animal and only a single tag per image is supplied. On the contrary, our hotel images contain randomly located objects and could easily contain more tags per image (sea-view, bed, TV, balcony, seating area, etc.)
During transfer learning, the weights of the pre-trained network are used to initialize the network (as compared to random initialisation during scratch training) and further tuned with our internal images and with the new tags that we created.
2.3. Performance Evaluation
Performance evaluation is an important part of any machine learning task. For this task, we created a dataset which has not been seen by the network before. The ground truth labels of the dataset are generated by our internal employees, where the same photo has been shown to two different people and has been accepted only if they agree on the tag. The collected dataset has been used to compare model performance. In total, we trained more than 50 models (through hyperparameters search), evaluating them using precision and recall metrics. What we expect from the image tags:
- Be accurate - If we tag one image as A, it should be correct (Precision)
- Bring traffic - If there is an image with tag A, we should not miss it (Recall)
Since we can increase one metric at the cost of harming the other, we selected a sweet spot per category favoring either recall (more traffic for experiments) or precision (we certainly don’t want to confuse a bedroom with swimming pool). We also provided a confidence score for all the tags, which indicates how confident the network is for the predicted tag. This way, we could select different levels of confidence for an experiment.
Looking at the results per tag level, we can see that the final model is very good on some classes, but performs poorly at some others. The classes like Floor plan, Bathroom, Bed, Swimming pool, etc. seems to be very accurate, whereas some classes are confused with each other a lot. Some examples of the confused classes: Sea view -> Balcony/Terrace, Lake view -> Sea view, River view -> Sea view, Breakfast -> Lunch/Dinner.
Even if it sounds trivial, it’s actually very tricky to assign a label to an image. Think about the garden tag; where should you start calling it garden? Is some grass on the ground enough? Or do you need some plants? If so, how many? We saw a lot of ambiguity (disagreement) in assigning some of the classes, and that was a hard challenge to address. The confidence scores supplied for each tag are supposed to help with those confusions and should be used in experiment hypothesis creation.
In order to make this computationally intensive task feasible, we used GPUs; though even with the high performing GPUs it still takes up to 12 hours to fine tune a network. When we tried to train from scratch, it took around 48 hours for the training loss to converge. All the experiments have been done with TensorFlow, and we used Tensorboard for visualizing loss and accuracy during the training.
3. Challenges & Learnings
The main objective in this project is to create an automated image tagging pipeline with high accuracy and to further implement this solution at scale to address all the existing and newly coming images. In order to improve the accuracy we have done extensive experiments with some of the well known techniques.
3.1. Data augmentation
Deep networks are known to be data-hungry. They happen to benefit from more and more data, which means that we need a lot of labelled data as our training dataset. However, getting high volumes of high quality data is always costly and sometimes not even possible. Therefore, the data augmentation technique is used to obtain higher number of labelled data samples. Data augmentation actually refers to slightly distorting an image and using it multiple times with the same label during the training process; this also lets us get the most value from a single image. Some examples of commonly used distortions are mirroring, random cropping, affine transformation, aspect ratio distortion, color manipulation, contrast enhancement, etc. What we did in the data augmentation phases is to randomly select one or more of the listed random distortions to increase the labelled data by 10 times. These image distortions have been applied in a preprocessing pipeline where a random selection of single or multiple distortions are applied on the fly during training. No additional image is saved during the process.
3.2. Image label hierarchy
Labelling images is a hard task - but even before that, deciding on which labels to collect is even harder. We have approached this issue with two things in mind. First, the business value: Is there a known or potential use case for detecting an object in an image? This is a hard question and we have discussed with many experienced stakeholders to get an answer for this. On the other side, we tried to pick the labels that are realistic to detect with a reliable accuracy to be used on the website. While doing that, we have also constructed a visual hierarchy in the image labels. This can eventually give us more confidence in the final labels’ precision. The motivation behind this is to be able to detect confidently that an image belongs to the water view class even we are not very sure if that is a river view or a lake view. Similarly, if we are not sure about the exact type of the room (lobby or restaurant), we can at least say this is an interior image.
The literature about using image label hierarchy has featured two different approaches. In the first approach, hierarchy is imposed as a post-processing step during inference by selecting between maximal hierarchy vs maximum accuracy to increase the information gain as described in . This technique has the advantage of being able to be applied without retraining the model and hence easy to implement. Alternatively, it is also possible to incorporate this information during the model training by using a multinomial logistic loss and a novel temporal sparsity penalty as described in . This technique has the advantage to learn the hierarchy during training. Another technique that imposes a graph structure explore the flexible relations between image labels also is worth checking .
Due to implementational simplicity, we’ve tried the first approach and observed an average of 1% improvement in top-1 accuracy for both GoogleNet and Inception v3.
3.3. Multi labelled data
The images in our dataset naturally belong to multiple classes. The room photo could easily belong to classes like interior, bedroom, bed, city-view, TV, etc. In order to account for this, we tried two main approaches.
The first approach is to enable soft labels for the images where we do not use binary, but instead floating point labels. This idea is introduced by Hinton  with the aim of propagating more information into the network. The motivation is to let the network know about the similarities of different labels. How, for example, confusing a bird with chicken is less wrong than confusing the bird with a car. We have accounted for this by imposing a non-zero fixed prior probability for the different classes during the training.
The second approach is to resample the labels using the co-occurrence statistics. This can be seen as the data augmentation step where the non-zero priors are obtained by observing the co-occurrences of the classes in the validation set.
Both of the approaches did not bring any additional increase in the final top-1 and top-5 accuracy. The hypothesis we have is that the co-occurrence labels on the validation set were too noisy to be effective.
3.4. Class imbalance
Another challenge we faced during the project is the highly imbalanced classes. The hotel images mostly contain bedroom, bathroom and lobby photos. Moreover, the hoteliers would like to post the photos of the facilities they provide. Sauna, fitness, kid facilities, table tennis and golf are some of the facilities a hotel can have. Naturally, the number of photos we have for the bedroom and table tennis are highly out of proportion. This is a challenge for the network since it picks up the prior distribution of the classes as a bias and hence cannot be “fair” to different classes of photos. In order to account for that, we have performed an over-sampling of the under-sampled classes and it has shown ~.5% increase in top-1 and top-5 accuracy.
3.4. Stochastic optimization
Gradient descent is a technique to minimize a cost function by updating the network parameters in the inverse direction of the gradient. There are different optimization techniques proposed in the literature and most of them are already implemented in most deep learning toolboxes. In order to understand and benefit from different optimization techniques, we have experimented with momentum SGD , Adam  and RMSProp  optimizers. There were no significant difference in the final loss given enough iterations for all three techniques. As a side note, decreasing the learning rate at every 10K iterations has shown great value by an increase of 0.7% in the final top-1 accuracy.
4. Application Examples
4.1. Personalization with images
The whole aim of creating image tags is to be able to personalise the website and give our customers a unique experience throughout the funnel. We hypothesise that by serving the right image in the right context for each individual user, we can create this personalised experience, helping our customers make even more informed decisions. In order to understand and personalise the experience, we’re interested in understanding the intent of the customer. This we can do using the customer behaviour; if filters are used (a hotel with breakfast included, or a hotel with beach front, etc.), we can show the relevant photo for that search. As an extended goal, we are investigating the ways to obtain the optimal photo ranking per user within the given context.
4.2. Food World Map
We encourage our guests to upload their holiday photos after their stay. This way we can help our customers with real images from the previous guests. In this hackathon project we accumulated the photos that were tagged as food, and pinned them on the map for a nice visualisation. This way, we can identify spatial point clusters and suggest areas where travellers can experience different types of cuisine during their stay.
4.3. Unsupervised clustering of image embeddings:
Deep neural networks have proven to be successful for supervised classification tasks. The improved accuracies are only possible due to the fact that different layers in the network hierarchy respond to specific levels of details. Lower layers learn to respond to low level characteristics like edges and corners, while the higher layers learn to respond to object parts; finally, the highest layer responds when we show a sample image containing one of the classes in the training set. The final decision layer has been used in our project to assign tags to the images. But when we remove the final layer, we end up with raw activations of the highest layer neurons. And these activations are very good representations (embedding) of an image. When we make an unsupervised clustering on this high dimensional embedding and project it onto the 2D surface, we get the kind of visualisations seen below.
Images with similar semantic representation are grouped together. Like floor map photos or bathroom or food photos. This is achieved using the well known t-SNE dimensionality reduction technique .
In this project we aimed to automate and scale the image tagging pipeline and make it consistent and accurate. The current system has been implemented in the Booking.com infrastructure and all the existing and newly coming images are analysed. As a future project, we intend to introduce feedback on the generated image tags and use them for retraining the network for improving accuracy.
 LeCun, Yann, et al. "Handwritten digit recognition with a back-propagation network." Advances in neural information processing systems. 1990.
 Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
 Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
 Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
 Jia Deng et al. Hedging Your Bets: Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition, CVPR 2012
 Zhicheng Yan et al. HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition, CVPR 2014
 Deng, Jia, et al. "Large-scale object classification using label relation graphs." European Conference on Computer Vision. Springer, Cham, 2014.
 Geoffrey Hinton et al. Distilling the Knowledge in a Neural Network, NIPS 2014 workshop
 Ning Qian, On the momentum term in gradient descent learning algorithms, Neural Networks : The Official Journal of the International Neural Network Society, 1999
 Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. "Lecture 6a overview of mini–batch gradient descent." Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture, Online (2012).
 Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of Machine Learning Research 9.Nov (2008): 2579-2605.