NB available by Java double data type for which

NB Classifier~cite{paper1,paper9} works very effectively to solve the problem of document classification. This was verified by measuring the performance accuracy of classification using NB Classifier against those using Icsiboost-bigram, an implementation of AdaBoost algorithm and Expected Maximum algorithm which is an iterative method. The assumptions that were made for easier calculations were that the position of the words in the document did not matter and that the features are conditionally independent given a category. Laplace’s smoothing was required to consider documents with features absent from the training set in the classification process. One of the problems encountered was running out of floating point number range available by Java double data type for which logarithms of the probabilities were calculated. Then additions on these logarithms were performed instead of multiplications on probabilities. This is a supervised classification and the data set is divided into 80\% training and 20\% testing data sets.To further improve the performance in terms of accuracy and overcome the major drawbacks of the traditional NB Classifier, a hybrid model is developed comprising of a NB vectorizer and a SVM classifier~cite{paper10}. The model is successful in reducing the dimensionality of the features from thousands of words as is the case with TF-IDF vectorizer to less than 20 corresponding to the number of categories. Thus reducing the training and classification time while maintaining acceptable generalization afforded by SVM and accuracy provided by NB vectorizer. Graph-based algorithm, Graph-PartiallySupervised Learning (Graph-PSL)~cite{paper2} solveS the problem of document classification with positive and unlabeled documents. The documents are classified by Transductive Support Vector Machine (TSVM), which can explore the information contained in unlabeled data. Graph-PSL identifies negative documents accurately and classifies the unlabeled ones more effectively and more robustly compared to Bayesian based algorithms.Text classification can also be done using combined sparse representation classifiers and support vector machines~cite{paper5} using frequency-based kernels. Term-frequency (TF) representation for a textdocument is considered. The sparse representation of an example is obtained by using an overcomplete dictionary made up of TF vectors corresponding to all the training documents. It is observed that the examples misclassified by one classifier is correctly classified in another classifier. To take advantage of the various classifiers, classifiers are combined to improve the performance of text classification. SVM cannot process data in text format and needs a vectorizer in the front end to convert it into numerical values. Thus the training documents train the NB vectorizer and then the vectorized training data and the test document are used as inputs to the SVM classifier~cite{paper10}. Each training document is vectorized by the trained vectorizer by calculating the posterior probability of the documents for each existing category based on the Bayes formula. In this way, all the training documents are vectorized by their probability distribution in the feature space in the format of numerical multidimensional arrays. SVM considers the probabilities of all categories in this classification task while traditional NB uses only the highest probability category. When there are lot of common keywords between categories, NB is better.The training data set needs to be processed before it can be used further. Pre-processing steps normally include natural language specific stop-word elimination, tokenization, lemmatization, choosing words which have high mutual information with the target concept, stemming where words are reduced to their root forms. Poter’s stemmer is a popular stemmer for the English language that systematically strips suffixes for stemming an English word. In the case when documents are web pages, additional pre-processing is required. Next step is the feature selection using TF-IDF vectorizer or Bag of Words model that helps identify important words in the document indicative of the category~cite{paper8, paper12}. This step converts the text document into a document vector ready for further processing by a machine learning categorization algorithm such as NB classifier. Word embeddings are distributed representations of word features. Topical Paragraph Vector (TPV)~cite{paper4} which is similar to word embedding training method is used to overcome the lack of homonymy and polysemy in single vector word embeddings. Selection of attributes plays a vital role to improve the quality of clustering~cite{paper6}. The suggested framework is primarily concerned with the problem of determining and selecting key distributional noun attributes, which are nominated by ranking the attributes according to the importance measure scores from the original noun attributes without class information. A novel text categorization method based on modified Support Vector Clustering~cite{paper13} is proposed. Text documents can be represented unconventionally by making use of symbolic data analysis concepts. A new method of representing documents based on clustering of term frequency vectors is proposed. To cluster the termfrequency vectors, Single Linkage, Complete Linkage, Average Linkage, K-Means and Fuzzy C-Means clustering algorithms are used~cite{paper3}. A number of different classification algorithms such as NB, K-NN, Rocchio Algorithm can be combined to form a single classifier for the purpose of improving classification accuracy at the expense of increased complexity. Rocchio classifier is based on the relevance feedback algorithm and uses TF-IDF word weights. Voting algorithms take the predicted classes of different classifiers as input and select the one which has been predicted by most classifiers. Weighted majority voting, decision template and a class of aggregation operators such as Ordered Weighted Averaging (OWA) operators~cite{paper11, paper7} are other techniques of combining the results of different classifiers. egin{comment}A variation of recurrent neural networks called Long Short-Term Neural Networks is capable of learning long-term dependencies. The error which is backpropagated may vanish and LSTMs can help preserve such errors by maintaining a more constant error. A neural network approach to document classification is observed to be more reliable and efficient. It is a supervised classification approach, so the neural network is trained with the known labels through the extracted feature space from the training data to predict the unknown labels in the testing data. LSTMs can achieve a classification accuracy rate of 93\% and can handle the complexities which makes it an appropriate machine learning algorithm for document classification.end{comment}section{Analysis of data set}This article makes use of the 20 Newsgroups data set which is a collection of nearly 20,000 documents which are distributed across 20 different topics. These documents are text documents featuring email conversations that include distinguishing features such as keywords that link the document to its category or topic. It was originally collected by Ken Lang as mentioned on the home page of the data set. This collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and clustering. Some of the topics are very closely related to each other (e.g. rec.sport.baseball and rec.sport.hockey), while others are highly unrelated (e.g talk.politics.mideast and comp.sys.ibm.pc.hardware). This relationship among the 20 categories can be seen in Table