- Data Preprocessing (NLP)
- URL removal: For the purpose of this exercise all URLs were removed from the tweets because no real sentiment value can be associated with links and do not provide any worthy data worth training and classifying.
- Elongated words: Words that contained three or more of the same letters together were stripped to now include only two. By doing so it the vocabulary generated by the feature extracted would be more accurate and contain far less redundant features.
- Positive emoticon faces: All the positive emoticon faces were replaced by the text “happy”. By doing this, the sentiment classifier would be better able to train itself and correctly classify tweets as positive.
- Negative emoticon faces: All the negative emoticon faces were replaced by the text “sad”. By doing this, the sentiment classifier would be better able to train itself and correctly classify tweets as negative.
Classifier 1 : sklearn - CountVectorizer() This is a feature extractor that works mainly on datasets and it converts a collection of text documents to a matrix of token counts. This function takes in various arguments that help in extracting the necessary features. This model uses the bag of features method and is based on the occurrence of words. The figure below shows the parameters used for this exercise.Parameters used for CountVectorizer()
- ngrams_range:The lower and upper boundary of the range of n-values for different n-grams to be extracted.
- token pattern: Regular expression denoting what constitutes a “token”. A regular expression (regex) patterns was used to accurately identify each token.
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. The value of 8 was chosen as it returned the best accuracy when tested against the different test sets.
- stop_words: If ‘english’, a built-in stop word list for English is used. This tells the extractor when to stop looking for more words.
Once the features were extracted, the fit_tranform() function was used that would learn the vocabulary dictionary and return term-document matrix. This would become the set of training features.
Two other classifiers were used fo the above process.
- Tweet Classifier
- Based on the sentiment classifier used on the tweets, a corresponding tweet classifier was used that optimized the prediction capability for the particular sentiment classifier.
- After this was completed, using each tweet classifiers’ fit() method, the tweets were fitted on the corresponding sentiment values. Once this was completed, using the predict() method, the corresponding test features were predicted. A dictionary was then created containing the tweet ID and the corresponding sentiment of the twitter test data.