Tuesday 2 April 2013

Handling dates and carrying out linear regression to model number of members of a Google Plus community in Python

I've blogged about rugby before (using some game theory to disagree with a rugby commentator) but here's another one.

I'm a member of the Rugby Union community on Google plus and it's a great place to chat about rugby and share stuff. As I said on this post about G+, I don't post publicly about rugby as I assume that most people who circle me won't be too interested in my opinions on rugby so it's nice to have a place to go. Anyway I really recommend the community it's a nice place if you like the game.

+Davide Coppola who's the owner of the community has been posting an announcement every time we go up a 100 members. We're currently at about 960 people and we've been wondering when we'll go past the 1000 member mark.

I thought I'd code a very simple bit of python to have an educated guess but also to see what member numbers have looked like.

Here's the data:


First of all I needed to import all the data (I've posted a short screencast about handling csv data in python and here's an old blog post with the code):

import csv
outfile = open("Data.csv", "rb")
data = csv.reader(outfile)
data = [row for row in data]

Once that is done I use the datetime python module to convert the dates read as strings to actual dates but also to create a list containing the member numbers:

dates = [datetime.datetime.strptime(e[0], '%Y-%m-%d') for e in data[1:]]
numbers = [eval(e[1]) for e in data[1:]]

I will use the dates list later on to plot things nicely but for now I need to consider numeric data (to fit a linear model). I convert the dates list to a set of numbers counting the number of days having passed from the first day of the community:

x = [(e - min(dates)).days for e in dates] 

Once I've done that I use the stats sub package from the scipy library to carry out a simple linear regression:

from scipy import stats
gradient, intercept, r_value, p_value, std_err = stats.linregress(x, numbers)

(I in fact only need the gradient and intercept from the above but it's all there in case I wanted it.)
To find out when we can expect to go past 1000 members (assuming a linear model of growth):

projected_date = min(dates) + datetime.timedelta(days=(1000 - intercept) / gradient)

To project the linear fit to see what number of members we could hope to have after a year I do the following:

extra_date = min(dates) + datetime.timedelta(days=365)
projection = gradient * (extra_date - min(dates)).days + intercept

Finally to plot all of the above I use pyplot:

import matplotlib.pyplot as plt
plt.figure(1, figsize=(15, 6))
plt.plot_date(dates, numbers, label="Data")
plt.plot_date(dates + [extra_date], numbers + [projection], '-', label="Linear fit (%.2f join per day)" % gradient)
plt.legend(loc="upper left")
plt.title("Rugby Union Google Plus Community Member Numbers")

The output is given here (there's something not great with the graph: I don't know why a line is being plotted between the data points but I haven't been able to fix that easily):

We should have about 3000 member by the end of 2013 and most importantly we can expect to go past the 1000 member number on the 9th of April 2013 at 21:38 (+Davide Coppola and +Andrew Byrne: set your alarm clocks).

Note that you would in fact want to fit a much more complex model than a simple linear fit to actually try and forecast any of what I've done above. I was in fact suprised at how linear the fit was...

It's been a quick bit of fun and perhaps could prove useful to some as to how to handle dates (and do a simple linear regression) in python.

I've actually got a couple of screencasts that show how to handle dates in R and SAS which I'll put here in case they're of use to anyone.

No comments:

Post a Comment