Sentiment Analysis Model Using NLP: Part II -- Out-of-Core Classification


Building a Model for Sentiment Analysis Using Natural Language Processing Part II

In a previous post, I demonstrated how one can go about building a model for sentiment analysis using text analysis and natural language processing or NLP. In that post, I found that due to memory limitations, I was only able to train on 20000 tweets out of a tweet corpus comprising of 1.6 million tweets. In this analysis, I will be exploring out-of-core techniques for classification of text documents.

A note on the data

The training data used for this analysis is the 1.6 million English language Stanford twitter corpus that has been automatically annotated for negative and positive sentiment using emoticons. The test data is also from the Stanford group and consists of tweets that have been manually annotated with 177 reflecting negative sentiment and 182 reflecting positive sentiment. The data sets can be found here:

In [1]:
import numpy as np
import pandas as pd
import nltk
import re #for regex
import cPickle as pickle

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

Get the raw data and do some data processing

In [2]:
def loadData ():
    Load the cvs files with the training and test data
    header = ['polarity', 'tweet_id', 'date','query', 'user', 'tweet']
    df_train = pd.read_csv('/home/concinte/Code/SentimentAnalysis/Twitter/\
    df_test = pd.read_csv('/home/concinte/Code/SentimentAnalysis/Twitter/\

    df_train.columns = header
    df_test.columns  = header

    #Shuffle the rows so that you get a mix of pos, neutral, and neg sentiments
    df_train = df_train.sample(frac=1).reset_index(drop=True)
    df_test  = df_test.sample(frac=1).reset_index(drop=True)

    #Drop unnecessary columns
    df_train.drop(['tweet_id','date','query','user'], axis=1, inplace=True)
    df_test.drop(['tweet_id','date','query','user'], axis=1, inplace=True)

    #Pickle the data frames
    print "Finished loading and pickling data"
In [3]:
def preProcessTweet(tweet):
    Function to pre-process the tweet
    #Replace all words preceded by '@' with 'USER_NAME'
    tweet = re.sub(r'@[^\s]+', 'USER_NAME', tweet)
    #Replace all url's with 'URL'
    tweet = re.sub(r'www.[^\s]+ | http[^\s]+ | https[^\s]+',' URL ', tweet)
    #Replace all hashtags with the word
    tweet = tweet.strip('#')
    #Replace words with long repeated char with the shorter form
    tweet = re.sub(r'(.)\1{2,}', r'\1', tweet)
    #Remove any extra white space
    tweet = re.sub(r'[\s]+', ' ', tweet)
    return tweet
In [4]:
def preProcessData():
    Obtained the pickled data and pre-process.
    The pre-processed data is then pickled 
    df_train = pd.read_pickle('/Data/df_training.pkl')
    df_test = pd.read_pickle('/Data/df_test.pkl')
    #Pre-process the data
    df_train['tweet'] = df_train['tweet'].apply(preProcessTweet)
    df_test['tweet'] = df_test['tweet'].apply(preProcessTweet)
    #Pickle pre-processed data frames

    print "Training and test data is now pre-processed"

Implementation of the out-of-core technique

Get the pre-processed pickled data

In [5]:
df = pd.read_pickle('/Data/df_training_preprocessed.pkl')
df = df[:800000]
In [6]:
df_test = pd.read_pickle('/Data/df_test_preprocessed.pkl')
df_test = df_test[df_test.polarity !=2]
In [7]:
def tweetGenerator(df):
    This function takes a tweet from the data frame
    and returns the text of the tweet as well as the sentiment label
    for row in df.itertuples():
        label = row[1]
        tweet = row[2]
        yield tweet, label
In [37]:
def getFeatureVector(tweet):
    The function takes a tweet and does some processing
    to remove stopwords, remove punctuation, lemmatize/stem
    and reject any words that are non-alpha. Depending on the 
    flag selected, it will return a unigram, bigram, or a
    mix of the two. It returns a list with the filtered n-grams
    flag = 3 #1 for unigram; 2 for bigram; 3 for mix
    #Tokenize the tweet and convert each token to lower case
    tokens = [token.lower() for token in word_tokenize(tweet)]

    punctuations = ["'", ":", ",", "-", ".", "!", "(", ")", "?", '"', ";"]
    stopWords = stopwords.words('english')
    stopWords = set(stopWords)
    lemmatizer = WordNetLemmatizer()
    #Remove stopwords, punctuation, 'url', and 'user_name'
    filteredTokens = []
    featureVector = []
    for token in tokens:
        if (token in punctuations or token in stopWords):
        elif (token == 'url' or token == 'user_name'):
        elif token.isalpha()== False: #Reject non-alpha tokens
            #Lemmatize the tokens
            token = lemmatizer.lemmatize(token)
            #This is the feature vector for each tweet
            if flag == 1:
                featureVector = filteredTokens
            elif flag == 2:
                featureVector = list(nltk.bigrams(filteredTokens))
                if featureVector != []: #ensure it is not an empty list
                    #Convert the tuple of bigrams to a string
                    featureVector = [' '.join(bigram) for bigram in featureVector]
                featureVector = list(nltk.everygrams(filteredTokens, max_len=2))
                if featureVector != []:
                    #Convert any tuple of n-grams to a string
                    temp = []
                    for everygram in featureVector:
                        if type(everygram) == tuple:
                            everygram = ' '.join(everygram)
                    featureVector = temp
    return featureVector
In [9]:
def getBatch(tweet_gen, size):
    This function takes as arguments
    the tweet generator tweet_gen and the batch size
    desired. It returns two lists for the tweets and labels
    whose length is the batch size
    tweets, labels = [], []
    for _ in range(size):
        tweet, label = next(tweet_gen)
    return tweets, labels
In [10]:
def getBatchTest(tweet_gen_test, size):
    This function is used to generate the batches
    for the test data
    tweets, labels = [], []
    for _ in range(size):
        tweet, label = next(tweet_gen_test)
    return tweets, labels

Define a HashingVectorizer which takes the tokenizing function as one of its arguments

In [11]:
from sklearn.feature_extraction.text import HashingVectorizer
vector = HashingVectorizer(decode_error = 'ignore',
                          n_features = 2**21,
                          preprocessor = None,
                          tokenizer = getFeatureVector)

Note: It is important that non_negative is set to true. From the scikit-learn homepage:

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. If non_negative=True is passed to the constructor, the absolute value is taken. This undoes some of the collision handling, but allows the output to be passed to estimators like sklearn.naive_bayes.MultinomialNB or sklearn.feature_selection.chi2 feature selectors that expect non-negative inputs.

Here are the different classifiers

In [12]:
from sklearn.naive_bayes import MultinomialNB
MNBclassifier = MultinomialNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)
In [13]:
from sklearn.naive_bayes import BernoulliNB
BernoulliNBclassifier = BernoulliNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)
In [14]:
from sklearn.linear_model import SGDClassifier
SGDclassifier = SGDClassifier(loss='log', random_state=1, n_iter=1)
tweet_gen = tweetGenerator(df)
In [15]:
from sklearn.linear_model import PassiveAggressiveClassifier
PassiveAggressiveclassifier = PassiveAggressiveClassifier(C=1.0, fit_intercept=True, n_iter=5, shuffle=True)
tweet_gen = tweetGenerator(df)

Pyprind is a python function that provides a progress bar for iterative processes. We will iteratively call each classifier for each batch of the training data and monitor the progress

In [16]:
import pyprind
batchSize = 20000
totalTweets = len(df)
#totalTweets = 15000
iterations = totalTweets/batchSize
progressBar = pyprind.ProgBar(iterations)

classes = np.array([0, 4])
for i in range(iterations):
    X_train, y_train = getBatch(tweet_gen, size=batchSize)
    X_train = vector.transform(X_train)
    MNBclassifier.partial_fit(X_train, y_train, classes=classes)
    BernoulliNBclassifier.partial_fit(X_train, y_train, classes=classes)
    SGDclassifier.partial_fit(X_train, y_train, classes=classes)
    PassiveAggressiveclassifier.partial_fit(X_train, y_train, classes=classes)
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:18:52

Test and score each classifier using the test data

In [17]:
tweet_gen_test = tweetGenerator(df_test)
X_test, y_test = getBatchTest(tweet_gen_test, size=df_test.shape[0])
X_test = vector.transform(X_test)
In [18]:
print('Accuracy: %.3f' % MNBclassifier.score(X_test, y_test))
Accuracy: 0.751
In [19]:
print('Accuracy: %.3f' % BernoulliNBclassifier.score(X_test, y_test))
Accuracy: 0.707
In [20]:
print('Accuracy: %.3f' % SGDclassifier.score(X_test, y_test))
Accuracy: 0.788
In [21]:
print('Accuracy: %.3f' % PassiveAggressiveclassifier.score(X_test, y_test))
Accuracy: 0.796

Pickle the tokenizer, HashingVectorizer, and classifier

In [40]:
import dill

dill.dump(vector,open('./Classifiers/HashingVectorizer.pkl', 'wb'))
In [39]:
num = len(df)/1000
dill.dump(MNBclassifier, open('./Classifiers/MNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(BernoulliNBclassifier, open('./Classifiers/\
BernoulliNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(SGDclassifier, open('./Classifiers/\
SGDclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(PassiveAggressiveclassifier, open('./Classifiers/\
PassiveAggressiveclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))
In [38]:
from getFeatureVector import getFeatureVector
dill.dump(getFeatureVector, open('./Classifiers/tokenizer.pkl', 'wb'))

Note on improving performance

The performance of the models can be improved by pairing the capabilities of the NLTK suite of models and those of the Scikit-learn library. We can use grid search on each of the algorithms with cross-validation to select for the optimum set of hyperparameters, and then apply cross-validation in the training of the models once the optimum hyperparameters have been selected. A demonstration of this will be at a later date.