Exploring My Spotify Data

7 min readNov 9, 2020

Every year (since 2015) Spotify has had its Wrapped campaign where each December they release information regarding your listening habits over the last year. It started as just a simple site showing users their top songs and genres and over the years they have been adding to the data they show users. In 2019 they also added a review of the last decade. This has resulted in a huge increase in Spotify usage. More than 60 million users interacted with Spotify Wrapped last year alone. You can read more here.

Recently, I discovered that Spotify allows you to download a years worth of your own data. This data includes your streaming history, playlists, search queries, library, artists you follow, etc.

Now, even though it’s November and Spotify will be out with my year in review in about a month, I’m a data scientist now. Since I’m now a data scientist I decided to do some basic data exploration on this data set to see if I could gain any insight into my own listening habits.

I present my results using the functions I eventually wrote so that you can easily use them for analyzing your own data. You can read how to download your own data here.

Import Libraries & Read in Data

## standard imports 
import pandas as pd 
import numpy as np

## visualizations
import matplotlib.pyplot as plt
import seaborn as sns

import json

First I need to read in my Spotify streaming data. This was supplied in 3 separate JSON files so I’ll need to combine them into a single data frame for analysis.

### open streaming data JSONs and create single dataframe
def read_streaming_data(num_files):
    num_files = 3
    dfs = []
    ### read each JSON file and make dataframe
    for i in range(num_files): 
        with open(f'./data/StreamingHistory{i}.json') as f:
            dfs.append(pd.DataFrame(json.load(f)))
    ### create single dataframe
    streams = pd.concat(dfs)
    ### convert endTime to datetime and make index of dataframe
    streams['endTime'] = pd.to_datetime(streams['endTime'])
    streams.set_index('endTime', inplace=True)
    return streamsstreams = read_streaming_data(3)streams.head()

The streaming data gives you the time you ended listening to the track, track artist, name of the track, and milliseconds you played the track for. We’ll be repeating some analysis here so I’ve included a little library of functions that I’ll use to perform some of the data exploration.

def stream_period(df, start_date, end_date):
    ### create dataframe desired time period
    data = streams.loc[start_date : end_date]
    return data

def get_top_10_artists(df, plot=False):
    ### get top 10 artists for any subset of streaming data
    top_10 = df['artistName'].value_counts().head(10)
    
    if plot:
        top_10.iloc[::-1].plot(kind='barh')
        plt.xlabel('Track Count')
        plt.title('My top 10 Artists', fontweight='bold');
        
    return top_10

def get_top_10_tracks(df, plot=False):
    ### get top 10 tracks for any subset of streaming data
    top_10 = df['trackName'].value_counts().head(10)
    
    if plot:
        top_10.iloc[::-1].plot(kind='barh')
        plt.xlabel('Track Count')
        plt.title('My Top 10 Tracks', fontweight='bold');
    
    return top_10

def get_top_10_artist_tracks(df):
    ### for top 10 artists get top tracks
    artist_tracks = []
    top_10_artists = get_top_10_artists(df)
    
    for artist in top_10_artists.index:
        track = streams[streams['artistName'] == artist].value_counts(subset='trackName').head(1)
        artist_tracks.append({'artist' : artist, 'track': track.index[0], 'play_count': track.values[0]})
    
    top_10_artist_tracks = pd.DataFrame(artist_tracks)
    return top_10_artist_by_track

def plot_top_10_period(df, start_date, end_date, plot=True):
    ### create barplots for top artists and tracks for given period of time.
    data = stream_period(df, start_date, end_date)
    ### get the top 10s
    top_10_artists = get_top_10_artists(data)
    top_10_tracks = get_top_10_tracks(data)
    ## plot the data
    fig, ax = plt.subplots(1,2, figsize=(15,5))
    plt.subplots_adjust(wspace=1)
    top_10_artists.iloc[::-1].plot(ax=ax[0], kind='barh', title='Top Artists')
    top_10_tracks.iloc[::-1].plot(ax=ax[1], kind='barh', title='Top Tracks')
    plt.suptitle(f'{start_date} to {end_date}')
    pass

def compare_barplots(df1, df2, title1, title2):
    ### create barplots for top artists and tracks for given period of time.
    ## plot the data
    fig, ax = plt.subplots(1,2, figsize=(15,5))
    plt.subplots_adjust(wspace=1)
    df1.iloc[::-1].plot(ax=ax[0], kind='barh', title=title1, legend=None)
    df2.iloc[::-1].plot(ax=ax[1], kind='barh', title=title2, legend=None)
    ax[0].set_ylabel('')
    pass

def convert_ms(ms):
    ms = int(ms)
    secs = (ms/1000)%60
    secs = int(secs)
    mins = (ms/(1000*60))%60
    mins = int(mins)
    hrs = (ms/(1000*60*60))%24
    hrs = int(hrs)
    days = (ms/(1000*60*60*24))
    days = int(days)
    
    ### create dictionary to report results
    time = {
        'days' : days,
        'hours': hrs,
        'minutes': mins,
        'seconds': secs
    }
    return time

Let’s see how many tracks I listened to this last year.

print('You have streamed', streams.shape[0], 'songs this year.')You have streamed 27559 songs this year.

Wow, 27,559 tracks! I wonder how much time that is?

listen_time = streams['msPlayed'].sum()
listen_time4640540504

This is in milliseconds. Not a super insightful unit of time. Let’s convert this to more conventional units of time. We’ll write a basic function to do this for us since we’ll want to see time spent listening to subsets of the data. This function was written based on a post on Stack Overflow and results were verified here.

listen_time = convert_ms(streams['msPlayed'].sum())
print(f"You have listened to {listen_time['days']} days {listen_time['hours']} hours {listen_time['minutes']} minutes and {listen_time['seconds']} seconds worth of music.")You have listened to 53 days 17 hours 2 minutes and 20 seconds worth of music.

I knew I listened to a lot of music, but over 53 days worth?! 🤯

Top 10s for the Year

Now let’s dive into some specifics. What were the top 10 artists I listened to this last year?

top_10_artists = get_top_10_artists(streams, plot=True)

In terms of listening to individual tracks I’ve listened to more from Falling in Reverse. This is interesting since this is a band I discovered at the beginning of 2020. I wonder why and when I exactly started listening to them? When do they first appear in my streaming history?

streams[streams['artistName'] == 'Falling In Reverse'].head(1)

Ah, I remember now! This song appeared in a playlist I follow called “Rock This”. I loved this single and wanted to hear more from the artist. I wonder what the top tracks I listened to from each of these artists were?

get_top_10_artist_tracks(streams)

I wonder how different these tracks are from my top tracks overall?

get_top_10_tracks(streams, plot=True)

I wonder what my top artists look like in terms of time and not just number of tracks I’ve listened to?

top_10_by_time = streams.groupby(['artistName'])[['msPlayed']].sum().sort_values(by='msPlayed', ascending=False).head(10)
top_10_by_time

Honestly, not that much of a difference here. A few artists pop up here instead simply because of certain albums I have listened to over and over again. I wonder how this list of artists compares to my top artists by track count?

compare_barplots(top_10_by_time, get_top_10_artists(streams), 'Top Artists by Time', 'Top Artist by Track Count')

Top 10s for Specified Periods

Next, I wanted to get more detailed information about my listening habits for certain periods of time. Since 2020 has been such a weird year (both in terms of what is going on in the world and in my personal life). Let’s see what I was listening to most at the beginning of 2020.

top_10_period(streams, start_date = '2020-01-01', end_date = '2020-02-15')

Interesting. Looks like I was listening to a lot of blues at the beginning of the year in terms of tracks. It looks like Eminem dominated my listening after to his album “Music to be Murdered By”. Let’s see how I was doing during the first 6 weeks of the pandemic.

top_10_period(streams, start_date = '2020-02-29', end_date = '2020-04-15')

Interesting. A lot more pop punk in the mix. Usually I listen to this type of music to pull me out of feeling down so I suppose that makes sense. My track list is a little more interesting. Here I see a bunch of songs that have a lot more to do with love than I normally listen to. I wonder what was going on with me around that time? (No more comment on that observation here 😉)

Conclusions

There is so much more to explore in this data. If you’re like me, examining your listening habits can give an interesting insight into what your year was like at any given point. There is still a lot left to explore. I have yet to dive into analyzing my playlist data or my search queries so there will be more to come soon!

Exploring My Spotify Data

Import Libraries & Read in Data

Top 10s for the Year

Top 10s for Specified Periods

Conclusions

Written by Regene DePiero