Long-term analysis of emotion, sentiment, and aging using photos and text.
The goal of this project is to quantitatively monitor the emotional and physical changes of an individual over periods of time. My thesis is that if you can accurately show emotional or physical change over time, you can objectively pinpoint how an environmental change such as a career change, moving to a new city, starting or ending a relationship, or starting a new habit like going to the gym affected your physical and emotional health. This can lead to important insights on an individual level and for a population as a whole.
We start with two types of datasets. The first is a CSV of emotion, age, and ethnity as output by a photographic emotion detection library by Affectiva. The second type are CSVs of sentiment analysis data from iMessage, Facebook Messenger, 750 Words, and Day One. This data is all in the /data
directory and further organized into directories based on each user - for example, /user/a
refers to user A
's data. I've used letters to label users to protect their privacy. You can read more about the tools I've used to generate these CSVs here.
First, I'll import a few libraries. I'll use Pandas to clean up the data and then Plotly to visualize it.
import os
import datetime
import numpy as np
import pandas as pd
import plotly
import plotly.offline as py
Next I tell Plotly and Python to use offline mode.
py.init_notebook_mode(connected=True)
I start by setting a few variables for later.
Our data is stored in the data directory:
data_directory = '../data'
I'm going to do this analysis on the first participant, a
:
participant = 'a'
I'm going to use a 30-day moving average to show how emotion and sentiment changes over time:
rolling_mean_window = 30
I'm going to show data from April 2016 to February 2017, since that's the range of time that all of my datasets are available for:
timeframe = ('2016-04-01', '2018-10-13')
Next I'm going to define a function that removes outliers from the dataset. An outlier is defined as any value that is more than 1.5 times greater than the upper bound of the interquartile range or 1.5 times less than the lower bound of the interquartile range. I'll use this on each of the datasets that we import.
def remove_outliers(series):
iqr = (series.quantile(0.25) * 1.5, series.quantile(0.75) * 1.5)
outliers = (series < iqr[0]) | (series > iqr[1])
return series[~outliers]
I'll also define a function to normalize the data such that all values are scaled to a range of -1 to 1.
def normalize(series):
min = series.min()
max = series.max()
return ((series - min) / (max - min) - 0.5) * 2
Finally I'll create an empty dataframe to store the cleaned-up, normalized data.
data = pd.DataFrame()
Now that everything is set up for the analysis, I'll read the CSV into a dataframe and output the first few rows to see what kind of data I'm working with:
lifeslice = pd.read_csv(data_directory + '/' + participant + '/lifeslice.csv', parse_dates=[['date', 'time']], index_col=['date_time']).dropna()
lifeslice.tail()
Now for the fun part. I'm going to remove outliers from each of the datasets, normalize them all to a scale of -1 to 1, and then put the columns of interest into a single dataframe.
series = lifeslice['emotions.valence']
series = remove_outliers(series)
series = normalize(series)
data = data.merge(series.to_frame('lifeslice'), how='outer', left_index=True, right_index=True)
for dataset in ['imessage', 'dayone', 'facebook', '750words']:
csv = data_directory + '/' + participant + '/' + dataset + '.csv'
if (not os.path.exists(csv)):
continue
df = pd.read_csv(csv, parse_dates=[['date', 'time']], index_col=['date_time']).dropna()
series = df['sentiment.comparative']
series = remove_outliers(series)
series = normalize(series)
data = data.merge(series.to_frame(dataset), how='outer', left_index=True, right_index=True)
And I'll show the first few rows to see what that result looks like:
data.tail()
Most of the rows have values for only one column, but this makes sense. Unless the participant had sent an iMessage at exactly the same moment that Lifeslice took a photo of them, there would only be a single value for that time. I'm going to resample the data later to re-distribute the data into evenly-spaced time buckets so that each time has a value.
For now, I'll just take a slice of the dataframe with only the timeframe that I want to analyze.
start, end = (data.index.searchsorted(datetime.datetime.strptime(i, '%Y-%m-%d')) for i in timeframe)
data = data[start:end]
And then I'll show a histogram of the data to see how it's distributed:
fig = plotly.tools.make_subplots(rows=len(data.columns), cols=1)
for index, column in enumerate(data.columns):
trace = plotly.graph_objs.Histogram(
name = column,
x = data[column],
)
fig.append_trace(trace, index + 1, 1)
fig['layout'].update(height=len(data.columns) * 250)
plot_url = py.iplot(fig)
It looks like there a dramatically unproportionate amount of normalized sentiment scores that are -1 and Lifeslice valence scores that are 1 which is skewing the data. I'll remove those rows and try again:
for column in data.columns:
if column == 'lifeslice':
data = data[data[column] != 1]
continue
data = data[data[column] != -1]
Now I'll create a set of histograms again and see if it looks any better:
fig = plotly.tools.make_subplots(rows=len(data.columns), cols=1)
for index, column in enumerate(data.columns):
trace = plotly.graph_objs.Histogram(
name = column,
x = data[column],
)
fig.append_trace(trace, index + 1, 1)
fig['layout'].update(height=len(data.columns) * 250)
plot_url = py.iplot(fig)
Sweet, that looks way better. The distribution looks much more reasonable now.
Now it's time to resample. Basically I'm going to squeeze all of these random events that occur at a specific moment in time into 1-day time buckets by averaging all of the seconds for each day together to create an average for that day. This is called resampling. Luckily Pandas makes this very easy with the resample
method. Then I'll output the first few rows to see how the result looks:
rule = '1d'
resampled = data.resample('1d').mean().fillna(data.mean()).rolling(rolling_mean_window, center=True).mean()
resampled.head()
start, end = tuple(datetime.datetime.strptime(i, '%Y-%m-%d') for i in timeframe)
data = data[data.index.searchsorted(start):data.index.searchsorted(end)]
The first and last 15 days of the timeframe are all NaN
because of the way resampling works. It will only calculate a mean for a day if it has a full 30 days of values before and after the day, which means it leaves the first and least 30 days out. I'll drop these rows to make things cleaner.
resampled.dropna(inplace=True)
resampled.head()
I'm almost ready to chart the data, but I want to chart not only the moving averages, but also the upper and lower quartiles so that I can get a sense of how much variation there is in the data.
lower = data.resample(rule).apply(lambda x: x.quantile(q=0.25)).fillna(data.mean()).rolling(rolling_mean_window, center=True).mean().dropna()
upper = data.resample(rule).apply(lambda x: x.quantile(q=0.75)).fillna(data.mean()).rolling(rolling_mean_window, center=True).mean().dropna()
Cool. Now it's finally time to plot the data. I'll use Plotly to create four "lines", which Plotly more accurately calls traces
. The first trace will be to show all of the actual data points. This will help me see how the moving averages relate to the actual data. The second trace will be the moving average itself, and the third and fourth trace will together create the lower and upper bounds.
colors = [
'#50514F',
'#F25F5C',
'#FFE066',
'#247BA0',
'#70C1B3',
]
datasets = [[
# Scatterplot
plotly.graph_objs.Scatter(
name = column,
x = data.index,
y = data[column],
mode = 'markers',
marker = {
'size': 1,
'color': colors[index],
},
),
# Moving average
plotly.graph_objs.Scatter(
name = column + ' ma',
x = resampled.index,
y = resampled[column],
mode = 'lines',
fill = 'tonexty',
fillcolor = 'rgba(68, 68, 68, 0.3)',
line = {
'color': colors[index],
},
),
# Lower quartile
plotly.graph_objs.Scatter(
name = column,
x = resampled.index,
y = lower[column],
line = dict(width = 0),
showlegend = False,
mode = 'lines',
),
# Upper quartile
plotly.graph_objs.Scatter(
name = column,
x = resampled.index,
y = upper[column],
fill='tonexty',
fillcolor='rgba(68, 68, 68, 0.3)',
marker=dict(color = '444'),
line=dict(width = 0),
showlegend = False,
mode='lines',
)
] for index, column in enumerate(data.columns)]
I'll start by showing the Lifeslice emotional analysis scores and the iMessage sentiment scores on the same plot so that I can visually compare them.
to_plot = [trace for dataset in datasets[0:2] for trace in dataset]
py.iplot(to_plot)
And now I'll show each of the datasets independently:
fig = plotly.tools.make_subplots(rows=len(datasets), cols=1)
for index, dataset in enumerate(datasets):
for trace in dataset:
fig.append_trace(trace, index + 1, 1)
fig['layout'].update(title='Sentiment Comparisons', height=len(datasets) * 250)
plot_url = py.iplot(fig, filename='stacked-subplots')
That looks kind of neat. It makes it easy to visually get a sense of how participant a
's emotion and sentiment scores changed over time, but it's not very clear if there are any correlations. I'll check directly:
resampled.corr()
It looks like there is a slight negative correlation between iMessage and Lifeslice. I'll plot this to investigate:
trace = plotly.graph_objs.Scatter(
name = 'comparison',
x = resampled['lifeslice'],
y = resampled['imessage'],
mode = 'markers',
marker = {
'size': 5,
'color': colors[0],
},
)
py.iplot([trace])
Indeed, it is slight.