## Sentiment Analysis Model Using NLP: Part II -- Out-of-Core Classification

SentimentAnalysisPart-II

# Building a Model for Sentiment Analysis Using Natural Language Processing Part II¶

In a previous post, I demonstrated how one can go about building a model for sentiment analysis using text analysis and natural language processing or NLP. In that post, I found that due to memory limitations, I was only able to train on 20000 tweets out of a tweet corpus comprising of 1.6 million tweets. In this analysis, I will be exploring out-of-core techniques for classification of text documents.

### A note on the data¶

The training data used for this analysis is the 1.6 million English language Stanford twitter corpus that has been automatically annotated for negative and positive sentiment using emoticons. The test data is also from the Stanford group and consists of tweets that have been manually annotated with 177 reflecting negative sentiment and 182 reflecting positive sentiment. The data sets can be found here: http://help.sentiment140.com/for-students

In [1]:
import numpy as np
import pandas as pd
import nltk
import re #for regex
import cPickle as pickle

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from sklearn.naive_bayes import MultinomialNB,BernoulliNB


### Get the raw data and do some data processing¶

In [2]:
def loadData ():
"""
Load the cvs files with the training and test data
"""
header = ['polarity', 'tweet_id', 'date','query', 'user', 'tweet']

Data/training.1600000.processed.noemoticon.csv',
Data/testdata.manual.2009.06.14.csv',

#Shuffle the rows so that you get a mix of pos, neutral, and neg sentiments
df_train = df_train.sample(frac=1).reset_index(drop=True)
df_test  = df_test.sample(frac=1).reset_index(drop=True)

#Drop unnecessary columns
df_train.drop(['tweet_id','date','query','user'], axis=1, inplace=True)
df_test.drop(['tweet_id','date','query','user'], axis=1, inplace=True)

#Pickle the data frames
df_train.to_pickle('/Data/df_training.pkl')
df_test.to_pickle('/Data/df_test.pkl')


In [3]:
def preProcessTweet(tweet):
"""
Function to pre-process the tweet
"""
str(tweet)

#Replace all words preceded by '@' with 'USER_NAME'
tweet = re.sub(r'@[^\s]+', 'USER_NAME', tweet)

#Replace all url's with 'URL'
tweet = re.sub(r'www.[^\s]+ | http[^\s]+ | https[^\s]+',' URL ', tweet)

#Replace all hashtags with the word
tweet = tweet.strip('#')

#Replace words with long repeated char with the shorter form
tweet = re.sub(r'(.)\1{2,}', r'\1', tweet)

#Remove any extra white space
tweet = re.sub(r'[\s]+', ' ', tweet)

return tweet

In [4]:
def preProcessData():
"""
Obtained the pickled data and pre-process.
The pre-processed data is then pickled
"""

#Pre-process the data
df_train['tweet'] = df_train['tweet'].apply(preProcessTweet)
df_test['tweet'] = df_test['tweet'].apply(preProcessTweet)

#Pickle pre-processed data frames
df_train.to_pickle('/Data/df_training_preprocessed.pkl')
df_test.to_pickle('/Data/df_test_preprocessed.pkl')

print "Training and test data is now pre-processed"


## Implementation of the out-of-core technique¶

### Get the pre-processed pickled data¶

In [5]:
df = pd.read_pickle('/Data/df_training_preprocessed.pkl')
df = df[:800000]

In [6]:
df_test = pd.read_pickle('/Data/df_test_preprocessed.pkl')
df_test = df_test[df_test.polarity !=2]

In [7]:
def tweetGenerator(df):
"""
This function takes a tweet from the data frame
and returns the text of the tweet as well as the sentiment label
"""
for row in df.itertuples():
label = row[1]
tweet = row[2]
yield tweet, label

In [37]:
%%writefile getFeatureVector.py
def getFeatureVector(tweet):
"""
The function takes a tweet and does some processing
to remove stopwords, remove punctuation, lemmatize/stem
and reject any words that are non-alpha. Depending on the
flag selected, it will return a unigram, bigram, or a
mix of the two. It returns a list with the filtered n-grams
"""

flag = 3 #1 for unigram; 2 for bigram; 3 for mix

#Tokenize the tweet and convert each token to lower case
tokens = [token.lower() for token in word_tokenize(tweet)]

punctuations = ["'", ":", ",", "-", ".", "!", "(", ")", "?", '"', ";"]
stopWords = stopwords.words('english')
stopWords.append("#")
stopWords.append("%")
stopWords = set(stopWords)
lemmatizer = WordNetLemmatizer()

#Remove stopwords, punctuation, 'url', and 'user_name'
filteredTokens = []
featureVector = []
for token in tokens:
if (token in punctuations or token in stopWords):
continue
elif (token == 'url' or token == 'user_name'):
continue
elif token.isalpha()== False: #Reject non-alpha tokens
continue
else:
#Lemmatize the tokens
token = lemmatizer.lemmatize(token)

#This is the feature vector for each tweet
filteredTokens.append(token)
if flag == 1:
#unigrams
featureVector = filteredTokens
elif flag == 2:
#bigrams
featureVector = list(nltk.bigrams(filteredTokens))
if featureVector != []: #ensure it is not an empty list
#Convert the tuple of bigrams to a string
featureVector = [' '.join(bigram) for bigram in featureVector]
else:
#mixgrams
featureVector = list(nltk.everygrams(filteredTokens, max_len=2))
if featureVector != []:
#Convert any tuple of n-grams to a string
temp = []
for everygram in featureVector:
if type(everygram) == tuple:
everygram = ' '.join(everygram)
temp.append(everygram)
featureVector = temp

return featureVector


Overwriting getFeatureVector.py

In [9]:
def getBatch(tweet_gen, size):
"""
This function takes as arguments
the tweet generator tweet_gen and the batch size
desired. It returns two lists for the tweets and labels
whose length is the batch size
"""
tweets, labels = [], []
for _ in range(size):
tweet, label = next(tweet_gen)
tweets.append(tweet)
labels.append(label)
return tweets, labels

In [10]:
def getBatchTest(tweet_gen_test, size):
"""
This function is used to generate the batches
for the test data
"""
tweets, labels = [], []
for _ in range(size):
tweet, label = next(tweet_gen_test)
tweets.append(tweet)
labels.append(label)
return tweets, labels


### Define a HashingVectorizer which takes the tokenizing function as one of its arguments¶

In [11]:
from sklearn.feature_extraction.text import HashingVectorizer
vector = HashingVectorizer(decode_error = 'ignore',
n_features = 2**21,
preprocessor = None,
non_negative=True,
encoding='utf-8',
tokenizer = getFeatureVector)


Note: It is important that non_negative is set to true. From the scikit-learn homepage:

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. If non_negative=True is passed to the constructor, the absolute value is taken. This undoes some of the collision handling, but allows the output to be passed to estimators like sklearn.naive_bayes.MultinomialNB or sklearn.feature_selection.chi2 feature selectors that expect non-negative inputs.

### Here are the different classifiers¶

In [12]:
from sklearn.naive_bayes import MultinomialNB
MNBclassifier = MultinomialNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)

In [13]:
from sklearn.naive_bayes import BernoulliNB
BernoulliNBclassifier = BernoulliNB(alpha = 0.01)
tweet_gen = tweetGenerator(df)

In [14]:
from sklearn.linear_model import SGDClassifier
SGDclassifier = SGDClassifier(loss='log', random_state=1, n_iter=1)
tweet_gen = tweetGenerator(df)

In [15]:
from sklearn.linear_model import PassiveAggressiveClassifier
PassiveAggressiveclassifier = PassiveAggressiveClassifier(C=1.0, fit_intercept=True, n_iter=5, shuffle=True)
tweet_gen = tweetGenerator(df)


### Pyprind is a python function that provides a progress bar for iterative processes. We will iteratively call each classifier for each batch of the training data and monitor the progress¶

In [16]:
import pyprind
batchSize = 20000
totalTweets = len(df)
#totalTweets = 15000
iterations = totalTweets/batchSize
progressBar = pyprind.ProgBar(iterations)

classes = np.array([0, 4])
for i in range(iterations):
X_train, y_train = getBatch(tweet_gen, size=batchSize)
X_train = vector.transform(X_train)
MNBclassifier.partial_fit(X_train, y_train, classes=classes)
BernoulliNBclassifier.partial_fit(X_train, y_train, classes=classes)
SGDclassifier.partial_fit(X_train, y_train, classes=classes)
PassiveAggressiveclassifier.partial_fit(X_train, y_train, classes=classes)
progressBar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:18:52


### Test and score each classifier using the test data¶

In [17]:
tweet_gen_test = tweetGenerator(df_test)
X_test, y_test = getBatchTest(tweet_gen_test, size=df_test.shape[0])
X_test = vector.transform(X_test)

In [18]:
print('Accuracy: %.3f' % MNBclassifier.score(X_test, y_test))

Accuracy: 0.751

In [19]:
print('Accuracy: %.3f' % BernoulliNBclassifier.score(X_test, y_test))

Accuracy: 0.707

In [20]:
print('Accuracy: %.3f' % SGDclassifier.score(X_test, y_test))

Accuracy: 0.788

In [21]:
print('Accuracy: %.3f' % PassiveAggressiveclassifier.score(X_test, y_test))

Accuracy: 0.796


### Pickle the tokenizer, HashingVectorizer, and classifier¶

In [40]:
import dill

dill.dump(vector,open('./Classifiers/HashingVectorizer.pkl', 'wb'))

In [39]:
num = len(df)/1000
dill.dump(MNBclassifier, open('./Classifiers/MNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(BernoulliNBclassifier, open('./Classifiers/\
BernoulliNBclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(SGDclassifier, open('./Classifiers/\
SGDclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

dill.dump(PassiveAggressiveclassifier, open('./Classifiers/\
PassiveAggressiveclassifier_OutofCore_{0}k.pkl'.format(num), 'wb'))

In [38]:
from getFeatureVector import getFeatureVector
dill.dump(getFeatureVector, open('./Classifiers/tokenizer.pkl', 'wb'))


### Note on improving performance¶

The performance of the models can be improved by pairing the capabilities of the NLTK suite of models and those of the Scikit-learn library. We can use grid search on each of the algorithms with cross-validation to select for the optimum set of hyperparameters, and then apply cross-validation in the training of the models once the optimum hyperparameters have been selected. A demonstration of this will be at a later date.

-->