Chronist

Long-term analysis of emotion, sentiment, and aging using photos and text.

The goal of this project is to quantitatively monitor the emotional and physical changes of an individual over periods of time. My thesis is that if you can accurately show emotional or physical change over time, you can objectively pinpoint how an environmental change such as a career change, moving to a new city, starting or ending a relationship, or starting a new habit like going to the gym affected your physical and emotional health. This can lead to important insights on an individual level and for a population as a whole.

We start with two types of datasets. The first is a CSV of emotion, age, and ethnity as output by a photographic emotion detection library by Affectiva. The second type are CSVs of sentiment analysis data from iMessage, Facebook Messenger, 750 Words, and Day One. This data is all in the /data directory and further organized into directories based on each user - for example, /user/a refers to user A's data. I've used letters to label users to protect their privacy. You can read more about the tools I've used to generate these CSVs here.

First, I'll import a few libraries. I'll use Pandas to clean up the data and then Plotly to visualize it.

In [1]:
import os
import datetime
import numpy as np
import pandas as pd
import plotly
import plotly.offline as py

Next I tell Plotly and Python to use offline mode.

In [2]:
py.init_notebook_mode(connected=True)

I start by setting a few variables for later.

Our data is stored in the data directory:

In [3]:
data_directory = '../data'

I'm going to do this analysis on the first participant, a:

In [4]:
participant = 'a'

I'm going to use a 30-day moving average to show how emotion and sentiment changes over time:

In [5]:
rolling_mean_window = 30

I'm going to show data from April 2016 to February 2017, since that's the range of time that all of my datasets are available for:

In [6]:
timeframe = ('2016-04-01', '2018-10-13')

Next I'm going to define a function that removes outliers from the dataset. An outlier is defined as any value that is more than 1.5 times greater than the upper bound of the interquartile range or 1.5 times less than the lower bound of the interquartile range. I'll use this on each of the datasets that we import.

In [7]:
def remove_outliers(series):
    iqr = (series.quantile(0.25) * 1.5, series.quantile(0.75) * 1.5)
    outliers = (series < iqr[0]) | (series > iqr[1])
    return series[~outliers]

I'll also define a function to normalize the data such that all values are scaled to a range of -1 to 1.

In [8]:
def normalize(series):
    min = series.min()
    max = series.max()
    return ((series - min) / (max - min) - 0.5) * 2

Finally I'll create an empty dataframe to store the cleaned-up, normalized data.

In [9]:
data = pd.DataFrame()

Now that everything is set up for the analysis, I'll read the CSV into a dataframe and output the first few rows to see what kind of data I'm working with:

In [10]:
lifeslice = pd.read_csv(data_directory + '/' + participant + '/lifeslice.csv', parse_dates=[['date', 'time']], index_col=['date_time']).dropna()

lifeslice.tail()
Out[10]:
path faces appearance.age appearance.ethnicity appearance.gender appearance.glasses emojis.disappointed emojis.dominantEmoji emojis.flushed emojis.kissing ... featurePoints.7.x featurePoints.7.y featurePoints.8.x featurePoints.8.y featurePoints.9.x featurePoints.9.y measurements.interocularDistance measurements.orientation.pitch measurements.orientation.roll measurements.orientation.yaw
date_time
2018-10-13 22:05:00 face_2018-10-13T22-05-00Z-0400.jpg 1 18 - 24 Caucasian Male Yes 0.001830 😐 0.001829 0.001834 ... 1634.665649 761.034546 1725.642822 763.478821 1803.114746 734.430054 296.085815 -32.105598 14.906713 -0.892394
2018-10-13 22:30:00 face_2018-10-13T22-30-00Z-0400.jpg 1 18 - 24 East Asian Male Yes 0.001829 😐 0.001829 0.001840 ... 1659.128662 700.185059 1721.991455 694.979004 1790.222900 662.822266 252.723190 -29.726332 -3.520424 -6.532529
2018-10-13 22:35:00 face_2018-10-13T22-35-00Z-0400.jpg 1 25 - 34 Caucasian Male Yes 0.001829 😐 0.001829 0.001847 ... 1066.928345 587.672607 1132.069580 585.294373 1173.240723 565.153687 181.387253 -27.444805 -0.328129 -5.242582
2018-10-13 22:40:00 face_2018-10-13T22-40-00Z-0400.jpg 1 18 - 24 Caucasian Male Yes 0.001829 😐 0.001829 0.002140 ... 923.367676 556.024658 984.398804 554.013855 1029.220215 534.051636 187.884155 -20.872135 1.971123 -1.556815
2018-10-13 22:50:00 face_2018-10-13T22-50-00Z-0400.jpg 1 Under 18 Hispanic Male Yes 0.001829 😐 0.001829 0.004443 ... 1346.051758 708.090942 1399.336914 702.162048 1464.945801 676.023560 222.556351 -19.568733 0.243604 1.102069

5 rows × 121 columns

Now for the fun part. I'm going to remove outliers from each of the datasets, normalize them all to a scale of -1 to 1, and then put the columns of interest into a single dataframe.

In [11]:
series = lifeslice['emotions.valence']
series = remove_outliers(series)
series = normalize(series)
data = data.merge(series.to_frame('lifeslice'), how='outer', left_index=True, right_index=True)

for dataset in ['imessage', 'dayone', 'facebook', '750words']:
    csv = data_directory + '/' + participant + '/' + dataset + '.csv'
    if (not os.path.exists(csv)):
        continue
    df = pd.read_csv(csv, parse_dates=[['date', 'time']], index_col=['date_time']).dropna()
    series = df['sentiment.comparative']
    series = remove_outliers(series)
    series = normalize(series)
    data = data.merge(series.to_frame(dataset), how='outer', left_index=True, right_index=True)

And I'll show the first few rows to see what that result looks like:

In [12]:
data.tail()
Out[12]:
lifeslice imessage dayone
date_time
2018-10-14 10:28:39 NaN -1.000000 NaN
2018-10-14 10:44:41 NaN -0.583333 NaN
2018-10-14 10:45:44 NaN -1.000000 NaN
2018-10-14 11:59:23 NaN -1.000000 NaN
2018-10-14 12:04:17 NaN -0.166667 NaN

Most of the rows have values for only one column, but this makes sense. Unless the participant had sent an iMessage at exactly the same moment that Lifeslice took a photo of them, there would only be a single value for that time. I'm going to resample the data later to re-distribute the data into evenly-spaced time buckets so that each time has a value.

For now, I'll just take a slice of the dataframe with only the timeframe that I want to analyze.

In [13]:
start, end = (data.index.searchsorted(datetime.datetime.strptime(i, '%Y-%m-%d')) for i in timeframe)
data = data[start:end]

And then I'll show a histogram of the data to see how it's distributed:

In [14]:
fig = plotly.tools.make_subplots(rows=len(data.columns), cols=1)

for index, column in enumerate(data.columns):
    trace = plotly.graph_objs.Histogram(
        name = column,
        x = data[column],
    )
    fig.append_trace(trace, index + 1, 1)

fig['layout'].update(height=len(data.columns) * 250)
plot_url = py.iplot(fig)
This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]

It looks like there a dramatically unproportionate amount of normalized sentiment scores that are -1 and Lifeslice valence scores that are 1 which is skewing the data. I'll remove those rows and try again:

In [15]:
for column in data.columns:
    if column == 'lifeslice':
        data = data[data[column] != 1]
        continue
    data = data[data[column] != -1]

Now I'll create a set of histograms again and see if it looks any better:

In [16]:
fig = plotly.tools.make_subplots(rows=len(data.columns), cols=1)

for index, column in enumerate(data.columns):
    trace = plotly.graph_objs.Histogram(
        name = column,
        x = data[column],
    )
    fig.append_trace(trace, index + 1, 1)

fig['layout'].update(height=len(data.columns) * 250)
plot_url = py.iplot(fig)
This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]

Sweet, that looks way better. The distribution looks much more reasonable now.

Now it's time to resample. Basically I'm going to squeeze all of these random events that occur at a specific moment in time into 1-day time buckets by averaging all of the seconds for each day together to create an average for that day. This is called resampling. Luckily Pandas makes this very easy with the resample method. Then I'll output the first few rows to see how the result looks:

In [17]:
rule = '1d'
resampled = data.resample('1d').mean().fillna(data.mean()).rolling(rolling_mean_window, center=True).mean()
resampled.head()

start, end = tuple(datetime.datetime.strptime(i, '%Y-%m-%d') for i in timeframe)
data = data[data.index.searchsorted(start):data.index.searchsorted(end)]

The first and last 15 days of the timeframe are all NaN because of the way resampling works. It will only calculate a mean for a day if it has a full 30 days of values before and after the day, which means it leaves the first and least 30 days out. I'll drop these rows to make things cleaner.

In [18]:
resampled.dropna(inplace=True)
resampled.head()
Out[18]:
lifeslice imessage dayone
date_time
2016-04-16 0.662552 0.112765 0.054272
2016-04-17 0.673985 0.114836 0.042427
2016-04-18 0.679121 0.115126 0.042427
2016-04-19 0.682729 0.138171 0.042427
2016-04-20 0.685517 0.149115 0.019012

I'm almost ready to chart the data, but I want to chart not only the moving averages, but also the upper and lower quartiles so that I can get a sense of how much variation there is in the data.

In [19]:
lower = data.resample(rule).apply(lambda x: x.quantile(q=0.25)).fillna(data.mean()).rolling(rolling_mean_window, center=True).mean().dropna()
upper = data.resample(rule).apply(lambda x: x.quantile(q=0.75)).fillna(data.mean()).rolling(rolling_mean_window, center=True).mean().dropna()

Cool. Now it's finally time to plot the data. I'll use Plotly to create four "lines", which Plotly more accurately calls traces. The first trace will be to show all of the actual data points. This will help me see how the moving averages relate to the actual data. The second trace will be the moving average itself, and the third and fourth trace will together create the lower and upper bounds.

In [20]:
colors = [
    '#50514F',
    '#F25F5C',
    '#FFE066',
    '#247BA0',
    '#70C1B3',
]

datasets = [[
        
    # Scatterplot
    plotly.graph_objs.Scatter(
        name = column,
        x = data.index,
        y = data[column],
        mode = 'markers',
        marker = {
            'size': 1,
            'color': colors[index],
        },
    ),
        
    # Moving average
    plotly.graph_objs.Scatter(
        name = column + ' ma',
        x = resampled.index,
        y = resampled[column],
        mode = 'lines',
        fill = 'tonexty',
        fillcolor = 'rgba(68, 68, 68, 0.3)',
        line = {
            'color': colors[index],
        },
    ),
        
    # Lower quartile
    plotly.graph_objs.Scatter(
        name = column,
        x = resampled.index,
        y = lower[column],
        line = dict(width = 0),
        showlegend = False,
        mode = 'lines',
    ),
        
    # Upper quartile
    plotly.graph_objs.Scatter(
        name = column,
        x = resampled.index,
        y = upper[column],
        fill='tonexty',
        fillcolor='rgba(68, 68, 68, 0.3)',
        marker=dict(color = '444'),
        line=dict(width = 0),
        showlegend = False,
        mode='lines',
    )
        
] for index, column in enumerate(data.columns)]

I'll start by showing the Lifeslice emotional analysis scores and the iMessage sentiment scores on the same plot so that I can visually compare them.

In [21]:
to_plot = [trace for dataset in datasets[0:2] for trace in dataset]
py.iplot(to_plot)

And now I'll show each of the datasets independently:

In [22]:
fig = plotly.tools.make_subplots(rows=len(datasets), cols=1)

for index, dataset in enumerate(datasets):
    for trace in dataset:
        fig.append_trace(trace, index + 1, 1)

fig['layout'].update(title='Sentiment Comparisons', height=len(datasets) * 250)
plot_url = py.iplot(fig, filename='stacked-subplots')
This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]

That looks kind of neat. It makes it easy to visually get a sense of how participant a's emotion and sentiment scores changed over time, but it's not very clear if there are any correlations. I'll check directly:

In [23]:
resampled.corr()
Out[23]:
lifeslice imessage dayone
lifeslice 1.000000 -0.085262 -0.012834
imessage -0.085262 1.000000 0.123364
dayone -0.012834 0.123364 1.000000

It looks like there is a slight negative correlation between iMessage and Lifeslice. I'll plot this to investigate:

In [24]:
trace = plotly.graph_objs.Scatter(
    name = 'comparison',
    x = resampled['lifeslice'],
    y = resampled['imessage'],
    mode = 'markers',
    marker = {
        'size': 5,
        'color': colors[0],
    },
)
py.iplot([trace])

Indeed, it is slight.

In [ ]:
 
In [ ]: