## Sentiment Analysis Model Using NLP: Part I

SentimentAnalysisPart-I

# Building a Model for Sentiment Analysis Using Natural Language Processing Part I¶

### Goal¶

In this post, I will be building and evaluating a text classification model for sentiment analysis. In part I, I will learn from the data by loading all the data into the computer's internal memory while in part II, I will be exploring out-of-core learning learning techniques to allow learning on large data sets when there are memory limitations.

### A note on the data¶

The training data used for this analysis is the 1.6 million English language Stanford twitter corpus that has been automatically annotated for negative and positive sentiment using emoticons. The test data is also from the Stanford group and consists of tweets that have been manually annotated with 177 reflecting negative sentiment and 182 reflecting positive sentiment. The data sets can be found here: http://help.sentiment140.com/for-students

In [1]:
import numpy as np
import pandas as pd
import nltk
import re #for regex
import cPickle as pickle

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from nltk.classify.scikitlearn import SklearnClassifier

from nltk import NaiveBayesClassifier


In the loadData function, we will load the data from csv to a Pandas dataframe do some shuffling so as to get a mix of all sentiments, and remove unecessary columns. We will then pickle the data to disk to allow for faster loading.

In [2]:
def loadData ():
"""
Load the cvs files with the training and test data
"""
header = ['polarity', 'tweet_id', 'date','query', 'user', 'tweet']

Data/training.1600000.processed.noemoticon.csv',
Data/testdata.manual.2009.06.14.csv',

#Shuffle the rows so that you get a mix of pos, neutral, and neg sentiments
df_train = df_train.sample(frac=1).reset_index(drop=True)
df_test  = df_test.sample(frac=1).reset_index(drop=True)

#Drop unnecessary columns
df_train.drop(['tweet_id','date','query','user'], axis=1, inplace=True)
df_test.drop(['tweet_id','date','query','user'], axis=1, inplace=True)

#Pickle the data frames
df_train.to_pickle('/Data/df_training.pkl')
df_test.to_pickle('/Data/df_test.pkl')



The preProcessTweet function takes a tweet and does some pre-processing on the tweet

In [3]:
def preProcessTweet(tweet):
"""
Function to pre-process the tweet
"""
#str(tweet.encode('utf-8'))
str(tweet)

#Replace all words preceded by '@' with 'USER_NAME'
tweet = re.sub(r'@[^\s]+', 'USER_NAME', tweet)

#Replace all url's with 'URL'
tweet = re.sub(r'www.[^\s]+ | http[^\s]+',' URL ', tweet)

#Replace all hashtags with the word
tweet = tweet.strip('#')

#Replace words with long repeated characters with the shorter form
tweet = re.sub(r'(.)\1{2,}', r'\1', tweet)

#Remove any extra white space
tweet = re.sub(r'[\s]+', ' ', tweet)

return tweet


In preProcessData, we take the previoulsy pickled dataframes, pre-process each tweet in the dataframe, and pickle the now pre-processed dataframes for faster loading

In [4]:
def preProcessData():
"""
Obtained the pickled data and pre-process.
The pre-processed data is then pickled
"""

#Pre-process the data
df_train['tweet'] = df_train['tweet'].apply(preProcessTweet)
df_test['tweet'] = df_test['tweet'].apply(preProcessTweet)

#Pickle pre-processed data frames
df_train.to_pickle('/Data/df_training_preprocessed.pkl')
df_test.to_pickle('/Data/df_test_preprocessed.pkl')

print "Training and test data is now pre-processed"


The feature_Extractor functions takes a tweet as argument and extracts the features in the tweet.

In [5]:
def feature_Extractor(tweet):
"""
Takes a tweet and extracts its features
"""
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features


getFeatureVector will take a tweet as argument and tokenize each tweet

In [6]:
def getFeatureVector(tweet):
"""
The function takes a tweet and does some processing
to remove stopwords, remove punctuation, lemmatize/stem
and reject any words that are non-alpha. Depending on the
flag selected, it will return a unigram, bigram, or a
mix of the two. It returns a list with the filtered n-grams
"""

flag = 3 #1 for unigram; 2 for bigram; 3 for mix

#tokenize the tweet and convert each token to lower case
#tokens = [token.lower() for token in word_tokenize(tweet)]
tokens = [token.lower() for token in word_tokenize(tweet.decode('latin-1'))]

punctuations = ["'", ":", ",", "-", ".", "!", "(", ")", "?", '"', ";"]
stopWords = stopwords.words('english')
stopWords.append("#")
stopWords.append("%")
stopWords = set(stopWords)
lemmatizer = WordNetLemmatizer()

#Remove stopwords, punctuation, 'url', and 'user_name'
filteredTokens = []
featureVector = []
for token in tokens:
if (token in punctuations or token in stopWords):
continue
elif (token == 'url' or token == 'user_name'):
continue
elif token.isalpha()== False: #reject non-alpha tokens
continue
else:
#Normalize the tokens, either by stemming or lemmatization
#I might also have to tag the tokens with Parts of Speech
#<lemmatize words>
token = lemmatizer.lemmatize(token)

#This is the feature vector for each tweet
filteredTokens.append(token)
if flag == 1:
#unigrams
featureVector = filteredTokens
elif flag == 2:
#bigrams
featureVector = list(nltk.bigrams(filteredTokens))
if featureVector != []: #ensure it is not an empty list
#Convert the tuple of bigrams to a string
featureVector = [' '.join(bigram) for bigram in featureVector]
else:
#mixgrams
featureVector = list(nltk.everygrams(filteredTokens, max_len=2))
if featureVector != []:
#Convert any tuple of n-grams to a string
temp = []
for everygram in featureVector:
if type(everygram) == tuple:
everygram = ' '.join(everygram)
temp.append(everygram)
featureVector = temp

return featureVector


In [7]:
def getFeatures(df):
"""
This function obtains features from a data set using a Bag of Words or Bag of n-grams approach
"""
tweets = []
allWords = []

#Set flag for unigram (Bag of words) or n-gram(Bag of n-grams)
#Flag = 2 # 1 is unigram; 2 is bigram; any otehr is a mixed bag of unigram and bigram (everygram)

for row in df.itertuples():
polarity = row[1]
tweet = row[2]

#Obtain the feature vector for each tweet
featureVector = getFeatureVector(tweet)

#tweets is a list containing tuples of filtered n-grams
#and their respective sentiments
tweets.append((featureVector, polarity))

#Get list of all words/n-grams from all the tweets
allWords.extend(featureVector)

#Return dict with the frequency distribution of each word/n-gram
wordDist = nltk.FreqDist(allWords)

#Get a list of the features with each word/n-gram in the dist as a feature
featureList = wordDist.keys()

return featureList, tweets

In [8]:
def generateTrainFeatureList(num):

df = df[:num]

#Extract the data set
featureList, tweets = getFeatures(df)

#Pickle the feature list and tweets
num = num/1000
pickle.dump(featureList, open('featureList_train_{0}k.pkl'.format(num), 'wb'))
pickle.dump(tweets, open('tweets_train_{0}k.pkl'.format(num), 'wb'))

print "Pickle of train feature list and {0}k tweets successful".format(num)

In [9]:
def generateTestFeatureList():

#Drop neutral polarity rows in test data (I don't have a neutral class in my training set)
df = df[df.polarity !=2]

#Extract the data set
featureList, tweets = getFeatures(df)

#Pickle the feature list and tweets
pickle.dump(featureList, open('featureList_test.pkl', 'wb'))
pickle.dump(tweets, open('tweets_test.pkl', 'wb'))

print "Pickle of test feature list and tweets successful"


#### Pre-process the data¶

In [ ]:
preProcessData()


#### Generate train feature list and tweets list. generateFeatureList takes the number of tweets to train on as an argument¶

In [10]:
generateTrainFeatureList(10000)

Pickle of train feature list and 10k tweets successful


#### Generate test feature list and tweets list.¶

In [11]:
generateTestFeatureList()

Pickle of test feature list and tweets successful


## Classify¶

### Naive Bayes Classifier¶

#### Obtain the training set¶

In [12]:
featureList = pickle.load(open('featureList_train_10k.pkl', 'rb'))
training_set = nltk.classify.apply_features(feature_Extractor, tweets)


#### Train and pickle the Naive Bayes classifier¶

In [13]:
NBclassifier = nltk.NaiveBayesClassifier.train(training_set)
pickle.dump(NBclassifier, open('NBclassifier_10K.pkl', 'wb'))


#### Here is a quick test of the Naive Bayes classifier¶

In [14]:
testtweet = "I love turtles"
processedTweet = preProcessTweet(testtweet)
feature_vec = getFeatureVector(processedTweet)
features =  feature_Extractor(feature_vec)

In [15]:
if NBclassifier.classify(features)==0:
print "Negative"
else:
print "Positive"

Positive


#### Obtain the test set¶

In [16]:
featureList = pickle.load(open('featureList_test.pkl', 'rb'))
testing_set = nltk.classify.apply_features(feature_Extractor, tweets)


#### Load and evaluate the pickled Naive Bayes classifier¶

In [17]:
f = open('NBclassifier_10K.pkl', 'rb')
f.close()

In [18]:
accuracy = nltk.classify.accuracy(NBclassifier, testing_set )*100
print("Classification accuracy is %.2f %%:" % accuracy)

Classification accuracy is 74.02 %:


#### Show the most valuable words¶

In [19]:
NBclassifier.show_most_informative_features(20)

Most Informative Features
contains(sad) = True                0 : 4      =     24.8 : 1.0
contains(died) = True                0 : 4      =     17.1 : 1.0
contains(welcome) = True                4 : 0      =     15.5 : 1.0
contains(anymore) = True                0 : 4      =     13.5 : 1.0
contains(poor) = True                0 : 4      =     13.4 : 1.0
contains(sick) = True                0 : 4      =     13.2 : 1.0
contains(ca find) = True                0 : 4      =     13.1 : 1.0
contains(hurt) = True                0 : 4      =     12.1 : 1.0
contains(shame) = True                0 : 4      =     11.7 : 1.0
contains(headache) = True                0 : 4      =     11.6 : 1.0
contains(upset) = True                0 : 4      =     10.4 : 1.0
contains(lonely) = True                0 : 4      =     10.4 : 1.0
contains(wish could) = True                0 : 4      =     10.2 : 1.0
contains(ugh) = True                0 : 4      =      9.8 : 1.0
contains(throat) = True                0 : 4      =      9.7 : 1.0
contains(bo) = True                0 : 4      =      9.7 : 1.0
contains(sorry hear) = True                0 : 4      =      9.7 : 1.0
contains(wtf) = True                0 : 4      =      9.7 : 1.0
contains(first time) = True                4 : 0      =      9.6 : 1.0
contains(go away) = True                0 : 4      =      9.1 : 1.0


Note, the last column shows the number of times a certain feature appears in one class as compared to the other. So, for example, "sad" appears 24.8 more times in a negative tweet as compared to a positive one

### Final words on this analysis¶

In this analysis, I found that I was limited on the size of the dataset I could use for training due to the need to load the full dataset to memory. In the next part, I will explore out-of-core techniques for text classification.</font>

-->