Sentiment Analysis Model Using NLP: Part II -- Out-of-Core Classification

SentimentAnalysisPart-II

Building a Model for Sentiment Analysis Using Natural Language Processing Part II

In a previous post, I demonstrated how one can go about building a model for sentiment analysis using text analysis and natural language processing or NLP. In that post, I found that due to memory limitations, I was only able to train on 20000 tweets out of a tweet corpus comprising of 1.6 million tweets. In this analysis, I will be exploring out-of-core techniques for classification of text documents.

A note on the data

The training data used for this analysis is the 1.6 million English language Stanford twitter corpus that has been automatically annotated for negative and positive sentiment using emoticons. The test data is also from the Stanford group and consists of tweets that have been manually annotated with 177 reflecting negative sentiment and 182 reflecting positive sentiment. The data sets can be found here: http://help.sentiment140.com/for-students

In [1]:
import numpy as np
import pandas as pd
import nltk
import re #for regex
import cPickle as pickle

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from sklearn.naive_bayes import MultinomialNB,BernoulliNB


Get the raw data and do some data processing

In [2]:
def loadData ():
    """
    Load the cvs files with the training and test data
    """
    header = ['polarity', 'tweet_id', 'date','query', 'user', 'tweet']
    
    df_train = pd.read_csv('/home/concinte/Code/SentimentAnalysis/Twitter/\
    Data/training.1600000.processed.noemoticon.csv',
                 header=None)
    df_test = pd.read_csv('/home/concinte/Code/SentimentAnalysis/Twitter/\
    Data/testdata.manual.2009.06.14.csv',
                 header=None)

    df_train.columns = header
    df_test.columns  = header

    #Shuffle the rows so that you get a mix of pos, neutral, and neg sentiments
    df_train = df_train.sample(frac=1).reset_index(drop=True)
    df_test  = df_test.sample(frac=1).reset_index(drop=True)

    #Drop unnecessary columns
    df_train.drop(['tweet_id','date','query','user'], axis=1, inplace=True)
    df_test.drop(['tweet_id','date','query','user'], axis=1, inplace=True)

    #Pickle the data frames
    df_train.to_pickle('/Data/df_training.pkl')
    df_test.to_pickle('/Data/df_test.pkl')
    
    print "Finished loading and pickling data"
In [3]:
def preProcessTweet(tweet):
    """
    Function to pre-process the tweet
    """
    str(tweet)
    
    #Replace all words preceded by '@' with 'USER_NAME'
    tweet = re.sub(r'@[^\s]+', 'USER_NAME', tweet)
    
    #Replace all url's with 'URL'
    tweet = re.sub(r'www.[^\s]+ | http[^\s]+ | https[^\s]+',' URL ', tweet)
    
    #Replace all hashtags with the word
    tweet = tweet.strip('#')
    
    #Replace words with long repeated char with the shorter form
    tweet = re.sub(r'(.)\1{2,}', r'\1', tweet)
    
    #Remove any extra white space
    tweet = re.sub(r'[\s]+', ' ', tweet)
    
    return tweet
In [4]:
def preProcessData():
    """
    Obtained the pickled data and pre-process.
    The pre-processed data is then pickled 
    """
    df_train = pd.read_pickle('/Data/df_training.pkl')
    df_test = pd.read_pickle('/Data/df_test.pkl')
    
    #Pre-process the data
    df_train['tweet'] = df_train['tweet'].apply(preProcessTweet)
    df_test['tweet'] = df_test['tweet'].apply(preProcessTweet)
    
    #Pickle pre-processed data frames
    df_train.to_pickle('/Data/df_training_preprocessed.pkl')
    df_test.to_pickle('/Data/df_test_preprocessed.pkl')

    print "Training and test data is now pre-processed"


Implementation of the out-of-core technique

Get the pre-processed pickled data

In [5]:
df = pd.read_pickle('/Data/df_training_preprocessed.pkl')
df = df[:800000]
In [6]:
df_test = pd.read_pickle('/Data/df_test_preprocessed.pkl')
df_test = df_test[df_test.polarity !=2]
In [7]:
def tweetGenerator(df):
    """
    This function takes a tweet from the data frame
    and returns the text of the tweet as well as the sentiment label
    """
    for row in df.itertuples():
        label = row[1]
        tweet = row[2]
        yield tweet, label
In [37]:
%%writefile getFeatureVector.py
def getFeatureVector(tweet):
    """
    The function takes a tweet and does some processing
    to remove stopwords, remove punctuation, lemmatize/stem
    and reject any words that are non-alpha. Depending on the 
    flag selected, it will return a unigram, bigram, or a
    mix of the two. It returns a list with the filtered n-grams
    """
    
    flag = 3 #1 for unigram; 2 for bigram; 3 for mix
    
    #Tokenize the tweet and convert each token to lower case
    tokens = [token.lower() for token in word_tokenize(tweet)]

    punctuations = ["'", ":", ",", "-", ".", "!", "(", ")", "?", '"', ";"]
    stopWords = stopwords.words('english')
    stopWords.append("#")
    stopWords.append("%")
    stopWords = set(stopWords)
    lemmatizer = WordNetLemmatizer()
    
    #Remove stopwords, punctuation, 'url', and 'user_name'
    filteredTokens = []
    featureVector = []
    for token in tokens:
        if (token in punctuations or token in stopWords):
               continue
        elif (token == 'url' or token == 'user_name'):
            continue
        elif token.isalpha()== False: #Reject non-alpha tokens
            continue
        else:
            #Lemmatize the tokens
            token = lemmatizer.lemmatize(token)
            
            #This is the feature vector for each tweet
            filteredTokens.append(token)
            if flag == 1:
                #unigrams
                featureVector = filteredTokens
            elif flag == 2:
                #bigrams 
                featureVector = list(nltk.bigrams(filteredTokens))
                if featureVector != []: #ensure it is not an empty list
                    #Convert the tuple of bigrams to a string
                    featureVector = [' '.join(bigram) for bigram in featureVector]
            else:
                #mixgrams
                featureVector = list(nltk.everygrams(filteredTokens, max_len=2))
                if featureVector != []:
                    #Convert any tuple of n-grams to a string
                    temp = []
                    for everygram in featureVector:
                        if type(everygram) == tuple:
                            everygram = ' '.join(everygram)
                        temp.append(everygram)
                    featureVector = temp
                                         
    return featureVector
            
Overwriting getFeatureVector.py
In [9]:
def getBatch(tweet_gen, size):
    """
    This function takes as arguments
    the tweet generator tweet_gen and the batch size
    desired. It returns two lists for the tweets and labels
    whose length is the batch size
    """
    tweets, labels = [], []
    for _ in range(size):
        tweet, label = next(tweet_gen)
        tweets.append(tweet)
        labels.append(label)
    return tweets, labels
In [10]:
def getBatchTest(tweet_gen_test, size):
    """
    This function is used to generate the batches
    for the test data
    """
    tweets, labels = [], []
    for _ in range(size):
        tweet, label = next(tweet_gen_test)
        tweets.append(tweet)
        labels.append(label)
    return tweets, labels

Define a HashingVectorizer which takes the tokenizing function as one of its arguments

In [11]:
from sklearn.feature_extraction.text import HashingVectorizer
vector = HashingVectorizer(decode_error = 'ignore',
                          n_features = 2**21,
                          preprocessor = None,
                          non_negative=True,
                          encoding='utf-8',
                          tokenizer = getFeatureVector)

Note: It is important that non_negative is set to true. From the scikit-learn homepage:

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. If non_negative=True is passed to the constructor, the absolute value is taken. This undoes some of the collision handling, but allows the output to be passed to estimators like sklearn.naive_bayes.MultinomialNB or sklearn.feature_selection.chi2 feature selectors that expect non-negative inputs.

Here are the different classifiers

In [12]:
from sklearn.naive_bayes import MultinomialNB
MNBclassifier = MultinomialNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)
In [13]:
from sklearn.naive_bayes import BernoulliNB
BernoulliNBclassifier = BernoulliNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)
In [14]:
from sklearn.linear_model import SGDClassifier
SGDclassifier = SGDClassifier(loss='log', random_state=1, n_iter=1)
tweet_gen = tweetGenerator(df)
In [15]:
from sklearn.linear_model import PassiveAggressiveClassifier
PassiveAggressiveclassifier = PassiveAggressiveClassifier(C=1.0, fit_intercept=True, n_iter=5, shuffle=True)
tweet_gen = tweetGenerator(df)

Pyprind is a python function that provides a progress bar for iterative processes. We will iteratively call each classifier for each batch of the training data and monitor the progress

In [16]:
import pyprind
batchSize = 20000
totalTweets = len(df)
#totalTweets = 15000
iterations = totalTweets/batchSize
progressBar = pyprind.ProgBar(iterations)

classes = np.array([0, 4])
for i in range(iterations):
    X_train, y_train = getBatch(tweet_gen, size=batchSize)
    X_train = vector.transform(X_train)
    MNBclassifier.partial_fit(X_train, y_train, classes=classes)
    BernoulliNBclassifier.partial_fit(X_train, y_train, classes=classes)
    SGDclassifier.partial_fit(X_train, y_train, classes=classes)
    PassiveAggressiveclassifier.partial_fit(X_train, y_train, classes=classes)
    progressBar.update()
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:18:52

Test and score each classifier using the test data

In [17]:
tweet_gen_test = tweetGenerator(df_test)
X_test, y_test = getBatchTest(tweet_gen_test, size=df_test.shape[0])
X_test = vector.transform(X_test)
In [18]:
print('Accuracy: %.3f' % MNBclassifier.score(X_test, y_test))
Accuracy: 0.751
In [19]:
print('Accuracy: %.3f' % BernoulliNBclassifier.score(X_test, y_test))
Accuracy: 0.707
In [20]:
print('Accuracy: %.3f' % SGDclassifier.score(X_test, y_test))
Accuracy: 0.788
In [21]:
print('Accuracy: %.3f' % PassiveAggressiveclassifier.score(X_test, y_test))
Accuracy: 0.796

Pickle the tokenizer, HashingVectorizer, and classifier

In [40]:
import dill


dill.dump(vector,open('./Classifiers/HashingVectorizer.pkl', 'wb'))
In [39]:
num = len(df)/1000
dill.dump(MNBclassifier, open('./Classifiers/MNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(BernoulliNBclassifier, open('./Classifiers/\
BernoulliNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(SGDclassifier, open('./Classifiers/\
SGDclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(PassiveAggressiveclassifier, open('./Classifiers/\
PassiveAggressiveclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))
In [38]:
from getFeatureVector import getFeatureVector
dill.dump(getFeatureVector, open('./Classifiers/tokenizer.pkl', 'wb'))


Note on improving performance

The performance of the models can be improved by pairing the capabilities of the NLTK suite of models and those of the Scikit-learn library. We can use grid search on each of the algorithms with cross-validation to select for the optimum set of hyperparameters, and then apply cross-validation in the training of the models once the optimum hyperparameters have been selected. A demonstration of this will be at a later date.

-->