Street view images included millions of geo-located images. By

Street view house numbers are broadly used in
our practical life in many applications’ such as GPS, navigation system etc. It’s
not easy to predict the house numbers from an images captured in natural scene
because they can be contain noise or image can be blurry, image can be blurry etc.
Different scholars applied different techniques to classify data from an image
file. Some of their experiment get acceptable result but others have some
limitation. In this paper, I try to explore a model build by convolutional
neural network (CNN), a popular and modern technique for street view house number
recognition. Convolution network is useful for feature extraction from image
and text as well as image/text classification. Over the experiment its proved
that this model provides most accurate result by extracting features from the
raw file and by recognize them. Here, I gave emphasis on accuracy and loss value
and try to get highest accuracy by adding, changing different values and
features. Finally, get satisfaction on my effort from 92% accuracy with small
loss. Recognizing character and digit from documents such as photographs captured
at a street level is a very important factor in developing digital map. For
example, google street view images included millions of geo-located images. By recognizing
images, we can develop a precise map which can improve navigation services.
Though normal character classification is already solved by computer vision,
but still recognizing digit or character from the natural scene like
photographs are still a complex issue. 
The reason behind this problem are non-contrasting backgrounds, low
resolution, blurred images, fonts variation, lighting etc. Traditional approach
of doing this work was a two-step process. First slice the image to isolate
each character and then perform recognition on extracted image. This used to be
done using multiple hand-crafted features and template matching. 1 The main purpose
of this project is to recognize the street view house number by using a deep
convolutional neural network.  For this
work, I considered the digit classification dataset of house numbers which I
extracted from street level images. 5 This dataset is similar in flavor to
MNIST dataset but with more labeled data. It has more than 600,000-digit images
which contain color information and various natural backgrounds. 5 To achieve
the goal, I developed an application which will detect the number from images.
A convolutional neural network model with multiple layers is used to train the
dataset and detect the house digit numbers. I used the traditional
convolutional architecture with different pooling methods and multistage
features and finally got almost 92% accuracy. Street view number detection is called
natural scene text recognition problem which is quite different from printed
character or handwritten recognition. Research in this field was started in
90’s, but still it is considered as an unsolved issue. As I mentioned earlier
that the difficulties arise due to fonts variation, scales, rotations, low
lights etc. In earlier years to deal with natural scene text identification sequentially,
first character classification by sliding window or connected components mainly
used. 4 After that word prediction can be done by predicting character
classifier in left to right manner. Recently segmentation method guided by
supervised classifier use where words can be recognized through a sequential
beam search. 4 But none of this can help to solve the street view recognition
problem.In recent works convolutional neural networks proves its capabilities
more accurately to solve object recognition task. 4 Some research has done
with CNN to tackle scene text recognition tasks. 4 Studies on CNN shows its
huge capability to represent all types of character variation in the natural
scene and till now it is holding this high variability. Analysis with
convolutional neural network stars at early 80’s and it successfully applied
for handwritten digit recognition in 90’s. 4 With the recent development of computer
resources, training sets, advance algorithm and dropout training deep
convolutional neural networks become more efficient to recognize natural scene
digit and characters. 3 Previously CNN used mainly to detecting a single
object from an input image. It was quite difficult to isolate each character
from a single image and identify them. Goodfellow et al., solve this problem by
using deep large CNN directly to model the whole image and with a simple
graphical model as the top inference layer. 4 The rest of the paper is
designed in section III Convolutional neural network architecture, section IV
Experiment, Result, and Discussion and Future Work and Conclusion in section V.
Convolutional Neural Networks (CNN) is a multilayer network to handle complex
and high-dimensional data, its architecture is same as typical neural networks.
8 Each layer contains some neuron which carries some weight and biases. Each
neuron takes images as inputs, then move onward for implementation and reduce
parameter numbers in the network. 7 The first layer is a convolutional layer.
Here input will be convoluted by a set of filters to extract the feature from
the input. The size of feature maps depends on three parameters: number of
filters, stride size, padding. After each convolutional layer, a non-linear
operation, ReLU use. It converts all negative value to zero. Next is pooling or
sub-sampling layer, it will reduce the size of feature maps. Pooling can be different
types: max, average, sum. But max pooling is generally used. Down-sampling also
controls overfitting. Pooling layer output is using to create feature
extractor. Feature extractor retrieves selective features from the input
images. These layers will have moved to fully connected layers (FCL) and the
output layer. In CNN previous layer output considers as next layer input. For the
different type of problem, CNN is different.The main objective of this project
is detecting and identifying house-number signs from street view images. The
dataset I am considering for this project is street view house numbers dataset
taken from 5 has similarities with MNIST dataset. The SVHN dataset has more
than 600,000 labeled characters and the images are in .png format. After
extract the dataset I resize all images in 32×32 pixels with three color
channels. There are 10 classes, 1 for each digit. Digit ‘1’ is label as 1, ‘9’
is label as 9 and ‘0’ is label as 10. 5 The dataset is divided into three
subgroups: train set, test set, and extra set. The extra set is the largest
subset contains almost 531,131 images. Correspondingly, train dataset has
73,252 and test data set has 26,032 images. Figure 3 is an example of the original,
variable-resolution, colored house-number images where each digit is marked by
bounding boxes. Bounding box information is
stored in digitStruct.mat file, instead of drawn directly on the images in the
dataset. digitStruct.mat file contains a struct called digitStruct with the
same length of original images. Each element in digitStruct has the following
fields: “name” which is a string containing the filename of the corresponding
image. “bbox” is a struct array that contains the position, size, and label of
each digit bounding box in the image. As an example, digitStruct(100). bbox (1). height means the height of the 1st
digit bounding box in the 100th image. 5 This is very clear from Figure 3 that in SVHN dataset
maximum house numbers signs are printed signs and they are easy to read. 2
Because there is a large variation in font, size, and colors it makes the
detection very difficult. The variation of resolution is also large here.
(Median: 28 pixels. Max: 403 pixels. Min: 9 pixels). 2 The graph below
indicates that there is the large variation in character heights as measured by
the height of the bounding box in original street view dataset. That means the
size of all characters in the dataset, their placement, and character
resolution is not evenly distributed across the dataset. Due to data are not
uniformly distributed it is difficult to make correct house number detection.  In my experiment, I train a
multilayer CNN for street view house numbers recognition and check the accuracy
of test data. The coding is done in python using Tensorflow, a powerful library
for implementation and training deep neural networks. The central unit of data
in TensorFlow is the tensor. A tensor consists of a set of primitive values
shaped into an array of any number of dimensions. A tensor’s rank is its number
of dimensions. 9 Along with TensorFlow used some other library function such
as Numpy, Mathplotlib, SciPy etc. I perform my analysis only using the train
and test dataset due to limited technical resources. And omit extra dataset
which is almost 2.7GB. To make the analysis simpler delete all those data
points which have more than 5 digits. By preprocessing the data from the
original SVHN dataset a pickle file is created which being used in my experiment.
For the implementation, I randomly shuffle valid dataset and then used the
pickle file and train a 7-layer Convoluted Neural Network. At the very
beginning of the experiment, first convolution layer has 16 feature maps with
5×5 filters, and originate 28x28x16 output. A few ReLU layers are also added
after each convolution layer to add more non-linearity to the decision-making
process. After first sub-sampling the output size decrease in 14x14x10. The
second convolution has 512 feature maps with 5×5 filters and produces 10x10x32
output. By applying sub-sampling second time get the output size 5x5x32.
Finally, the third convolution has 2048 feature maps with same filter size. It
is mentionable that the stride size =1 in my experiment along with zero padding.
During my experiment, I use dropout technique to reduce the overfitting.
Finally, SoftMax regression layer is used to get the final output. Weights are
initialized randomly using Xavier initialization which keeps the weights in the
right range. It automatically scales the initialization based on the number of
output and input neurons. After model buildup, start train the network and log
the accuracy, loss and validation accuracy for every 500 steps.Once the process
is done then get the test set accuracy. To minimize the loss, Adagrad Optimizer used. After
reach in a suitable accuracy level stop train the network and save the
hyperparameters in a checkpoint file. When we need to perform the detection, the
program will load the checkpoint file without train the model again. Initially,
the model produced an accuracy of 89% with just 3000 steps. It’s a great
starting point and certainly, after a few times of training the accuracy will reach
in 90%. However, I added some additional features to increase accuracy. First, added
a dropout layer between the third convolution layer and fully connected layer. This
allows the network to become more robust and prevents overfitting. Secondly, introduced
exponential decay to calculate learning rate with an initial rate 0.05. It will
decay in each 10,000 steps with a base of 0.95. This helps the network to take
bigger steps at first so that it learns fast but over time as we move closer to
the global minimum, it will take smaller steps. With these changes, the model
is now able to produce an accuracy of 91.9% on the test set. Since there are a
large training set and test set, there is a chance of more improvement if the
model will train for a longer time. During my analysis, I reached an accuracy level of
almost 92%. After train the model first time the accuracy was 89%. After several
times of training it reached to 92%. As mentioned earlier that the saved checkpoint
file will be restored later to continue training or to detect new images. By
using the dropout, its confirm that the model is suitable and can predict most images.
The model is tested over a wide range of input from the test dataset.To recognize
house numbers this model can detect most of the images. From Figure 5 it
appears that among ten house numbers it correctly recognizes seven house
numbers. However, the model still gives incorrect output when the images are
blurry or has any other noise. Due to limited resource I train the model few
times as it takes longer time to run. I believe there is a strong possibility to
increase the accuracy level if work with whole dataset. Also, the use of better
hardware and GPU can run the model faster. In the experiment I proposed a multi-layer
deep convolutional neural network to recognize the street view house number. The
testing done on more than 600,000 images and achieve almost 92% accuracy. From the
analysis it is vibrant that the model produces correct output for most images.
However, the detection may fail if the Image is blurry, or contain any noise.Most
exciting feature of the project is to discover the performance of some applied
tricks like dropout and exponential learning rate decay on real data. As many variation
of CNN architecture can be implemented, it’s very difficult to understand which
architecture will work best for any specific dataset. Determine the most
appropriate CNN architecture was very challenging aspect of this experiment. The
model implemented in this project is relatively simple but does the job very
well and is quite robust. However still some works need to be done to optimize
accuracy level. As a future work, I will extend my experiment using another architecture
of CNN along with hybrid technique and algorithms. And try to find out which
one gives better accuracy with minimum cost and less number of loss. As well as
try to incorporate the whole dataset in next experiment