Monday, July 11, 2016

Self coloring books

Machine learning is eating software. Here at comma.ai we want to build the best machine learning. This makes us all work really hard and sometimes need some stress relief. Our art therapist suggested us to try adult coloring books to relax. They worked so well for us that we decided to share the love with the world and built commacoloring, comma.ai adult coloring books .

commacoloring was really well received and made it to the front page of Product Hunt. We got a lot of feedback from our users (we love users!). A feature was requested to automatically color the easy parts of the image, letting the user focus in the details. We used our self-driving car engineering skills to build a self-coloring book.

We call this new feature Suggestions. You can try right now by clicking the "suggest" button!

The engineering

Note: you can skip that section without affecting your coloring experience, but if you are familiar with deep learning jargon, please read along.

To automate the coloring process we trained a deep neural network for pixel level semantic parsing, i.e a network that will classify (color) each pixel using information of its surroundings. Given the state of the art, we knew the right approach would be a fully convolutional neural network. We started by trying an encoder-decoder like architecture with 4 convolutions down and 4 deconvolutions up [1], with one output channel per class. This was taking too long to converge though.

We later noticed that [2] claims that retraining the encoder network is not really necessary. They used a pre-trained VGG for dense classification in low resolution and bilinear interpolation followed by Conditional Random Fields for upscaling the image back to its desired size. Also [3] stated that the job of the decoder/deconvolution network is to mainly upscale and smooth the segmented output image and it can be a smaller network. Reddit brought our attention to ReSeg [4] that uses only the convolutional layers of VGG as the encoder.

Our final solution combined ideas from [3] and [4] and used fixed VGG convolutional layers as the encoder and trained a simple deconvolutional network as the decoder. Each layer of our decoder used only 16 filters of 5x5 pixels with upscaling stride of 2. We tried faster upscaling with stride 4 but the results didn't look sharp enough.

In one of our experiments we reinitilized the VGG weights to random values and were still able to learn a successful decoder. We called this architecture Extreme Segmentation Network, since it resembles Extreme Learning Machines. Unfortunately, we were aware that the acronym would compete with Echo-State Networks' and we decided to use the original VGG filters in production. Our final network is called Suggestions Network (SugNet). Some results are shown in Figure 1 and 2.


Figure 1. Input image and self colored Suggestions example.


Figure 2. Sample outputs of the segmentation network after 400 training epochs compared to human colored images.

All our method was implemented with Keras using Tensorflow backend. The VGG image preprocessing used Theano backend. At test time, using Tensorflow only the results didn't match and we doubted our engineering skills for a while before remembering that Theano implements correlation instead of convolution. Here is how to convert convolutional wieghts from Theano to Tensorflow. Keras didn't have a proper deconvolution layer, but we started working on a PR for that.

References:  
[1] Vijay Badrinarayanan, Ankur Handa and Roberto Cipolla "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling". arXiv:1505.07293   
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs". arXiv:1412.7062  
[3] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation". arXiv:1606.02147.
[4] Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, Aaron Courville "ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation". arXiv:1511.07053.

We hope that Suggestions will inspire you to build even more fun apps with the open source commacoloring product. Let us know about all the amazing things you build with it.

Tuesday, July 5, 2016