Wednesday, 23 October 2013

Pigeon holes, Markov chains and Sagemath.

On the 16/10/2013 I posted the following picture on G+:

Here's what I wrote on that post:
For a while now there's been a 'game' going on with our pigeon holes where people would put random objects in other people's pigeon holes (like the water bottle you see in the picture). These objects would then follow a random walk around the pigeon holes as each individual would find an object in their pigeon hole and absent-mindedly move it to someone else's pigeon hole.
As such each pigeon hole could be thought of as being a transient state in a Markov chain (http://en.wikipedia.org/wiki/Markov_chain). What is really awesome is that one of the PhD students here didn't seem to care when these random objects appeared in her pigeon hole. Her pigeon hole was in fact an absorbing state. This has now resulted in more or less all random objects (including a wedding photo that no one really knows the origin of) to be in her pigeon hole.
I thought I'd have a go at modelling this as an actual Markov chain. Here's a good video by a research student of mine (+Jason Young) describing the very basics of a Markov chain:



To model the movement of an object as a Markov chain we first of all need to describe the states. In our case this is pretty easy and we simply number our pigeon holes and refer to them as states. In my example there I've decided to model a situation with 12 pigeon holes.


What we now need is a set of transition probabilities which model the random behaviour of people finding an object in their pigeon hole and absent-mindedly moving it to another pigeon hole.

This will be in the form of a matrix $P$. Where $P_{ij}$ denotes the probability of going from state $i$ to state $j$.

I could sit in our photocopier room (that's where our pigeon holes are) and take notes as to where the individual who owns pigeon hole $i$ places the various objects that appear in their pigeon hole...
That would take a lot of time and sadly I don't have any time. So instead I'm going to use +Sage Mathematical Software System. The following code gives a random matrix:

N = 12
P = random_matrix(QQ, N, N)

This is just a random matrix over $\mathbb{Q}$ so we need to do tiny bit of work to make it a stochastic matrix:

P = [[abs(k) for k in row] for row in P]  # This ensures all our numbers are positive
P = matrix([[k / sum(row) for k in row] for row in P]) # This ensures that our rows all sum to 1

The definition of a stochastic matrix is any matrix $P$ such that:
  • $P$ is square
  • $P_{ij}\geq 0$ (all probabilities are non negative)
  • $\sum_{j}P_{ij}=1\;\forall\;i$ (when leaving state $i$ the probabilities of going to all other states must sum to 1)
Recall that our matrix is pretty big (12 by 12) so we the easiest way to visualise it is through a heat map:

P.plot(cmap='hsv',colorbar=True)

Here's what a plot of our matrix looks like (I created a bunch of random matrix gifs here):



We can find the steady state probability of a given object being in any given state using a very neat result (which is not actually that hard to prove). This probability vector $\pi$ (where $\pi_i$ denotes the probability of being in state $i$) will be a solution of the matrix equation:

$$\pi P = \pi$$

To solve this equation it can be shown that we simply need to find the eigenvector of $P$ corresponding to the unit eigenvalue:

eigen = P.eigenvectors_left()  # This finds the eigenvalues and eigenvectors

To normalise our eigenvector we can do this:

pi = [k[1][0] for k in eigen if k[0] == 1][0]  # Find eigenvector corresponding to unit eigenvalue
pi = [k / sum(pi) for k in pi]  # normalise eigenvector

Here's a bar plot of out probability vector:

bar_chart(pi)



We can read the probabilities from this chart and see the probability of finding any given object in a particular pigeon hole. The bar_chart function in Sage still needs a bit of work and at the moment can only print a single list of data so it automatically has the axis indexed from 0 onwards (not from 1 to 12 as we would want). We can easily fix this using some matplotlib code (Sage is just wrapping matplotlib anyway):

import matplotlib.pyplot as plt

plt.figure()
plt.bar(range(1, N + 1), pi)
plt.savefig("betterbarplot.png")

Here's the plot:


We could of course pass a lot more options to the matplotlib plot to make it just as we want (and I'll in fact do this in a bit). The ability to use base python within Sage is really awesome.

One final thing we can do is run a little simulation of our objects going through the chain. To do this we're going to sample a sequence of states (pigeon holes $i$). For every $i$ we sample a random number $0\ r\leq 1$ and find $j$ such that $\sum_{j'=1}^{j}P_{ij'}. This is a random sampling technique called inverse random sampling.

import random

def nextstate(i, P):
    """
    A function that takes a transition matrix P, a current state i (assumingstarting at 0) and returns the next state j
    """
    r = random.random()
    cumulativerow = [P[i][0]]
    for k in P[i][1:]:  # Iterate through elements of the transition matrix
        cumulativerow.append(cumulativerow[-1] + k)  # Obtain the cumulative distribution
    for j in range(len(cumulativerow)):
        if cumulativerow[j] >= r:  # Find the next state using inverse sampling
            return j
    return j

states = [0]
numberofiterations = 1000
for k in range(numberofiterations):
    states.append(nextstate(states[-1],P))
We can now compare our simulation to our theoretical result:
import matplotlib.pyplot as plt

plt.figure()
plt.bar(range(1, N + 1), pi, label='Theory')  # Plots the theoretical results
plt.hist([k + 1 for k in states], color='red', bins=range(1, N + 2), alpha=0.5, normed=True, histtype='bar', label='Sim')  # Plots the simulation result in a transparent red
plt.legend()  # Tells matplotlib to place the legend
plt.xlim(1, N)  # Changes the limit of the x axis
plt.xlabel('State')  # Include a label for the x axis
plt.ylabel('Probability')  # Include a label for the y axis
plt.title("After %s steps" % numberofiterations)  # Write the title to the plot
plt.savefig("comparingsimulation.png")

We see the plot here:


A bit more flexing of muscles allows us to get the following animated gif in which we can see the simulation confirming the theoretical result:



This post assumes that all our states are transitive (although our random selection of $P$ could give us a non transitive state) but the motivation of my post is the fact that one of our students' pigeon holes was in fact absorbing. I'll write another post soon looking at that (in particular seeing which pigeon hole is most likely to move the object to the absorbing state).

Saturday, 19 October 2013

Just over two years with linux and the terminal: some thoughts and a plugin-less vimrc

So two years ago I decided to find out what linux was. Here's my first post on +Google+ 'announcing' that I was going all in and asking for some tips:


At the time I had some comments about using vim to modify my .bashrc. I politely thanked people saying that I'd look in to it and having absolutely no idea what was going on.

After a couple of weeks I had worked really hard to get a system that gave me everything that my work Windows machine did (office working, LaTeX gui setup etc). At that point I remember speaking to a friend of mine that was a hardened linux user saying: "ok, I've not lost anything since moving from Windows but what's the point? What have I gained?". My friend didn't really give me a satisfactory answer but I quite enjoy being out of my comfort zone so I kept with it.

After a couple more weeks I began to see that a big  point is the community. For anything I wanted help with I could find some people who had done it before and timidly open up this voodoo-black-magic thing called the "terminal" and paste in some code that would fix whatever needed fixing.

So at that point I thought the point was:
  • It's free;
  • Don't lose anything;
  • Gain a community
After a couple of months of scarily copying things in to the terminal and hearing people talk about 'scripting' and various other things I thought: "Right let's give the terminal a go". So I learnt vi(m). This is old but still amuses me:


That took me a while and I was ridiculously slow at first, basically staying away from anything but insert mode and slowly learning a few new commands every now and then.

I also began to understand the point of scripting and what the "terminal" is. I now script more or less everything I do, find myself typing ':wq' in my gmail window all the time, love git and pretty much sigh every time there's a particular thing that I need to do using my mouse and a gui.

The efficiency with which I can do stuff in the terminal is completely incomparable to how I worked before. I'm still very much an amateur but I really do love the terminal (I only learnt the other day that you can middle click to paste anything that is highlighted: that blew my mind).

I realise now that the point with linux is that I don't think I was really using a computer to it's full potential before, I was just using some stuff that people had put in place for me to do some stuff (there's nothing really wrong with that, it's just a bit constraining)... I'm again in no way an expert and there's so much I still can't wait to learnt (I ticked symbolic links of the list a couple of days ago). 

The other point I think with the terminal this time (and more generally with being comfortable outside of a GUI) is that there's so much amazing software out there that does ridiculous stuff but that does not have a GUI (one reason being how long it takes to make the d**n things).

Another great thing with forcing myself to get to know linux is that I can also use all the relevant skills on a Mac. My workflow is basically a browser for my gmail and a terminal for vim so I am really happy on either machine (I prefer my linux box for the ease of getting things exactly the way I like them, while my work machines have some commercial software that I occasionally need when I get particular types of email attachments and my Imac is quite possibly the prettiest thing I've ever seen). I hear that with powershell and cygwin and things like that you can almost get a Windows box in the same shape but I can't say I see myself wanting to try that.

Using any machine becomes extremely simple. Ssh'ing in to a server is a very comfortable thing to do as all I really need is vim. To make that a bit more comfortable I've put my vimrc up on github so I can just clone that and basically be at home anywhere. Here's a bit more about my vimrc:

A basic vimrc

I really fell in love with vim about a year ago after watching +Martin Brochhaus's talk showing how to turn vim into a +Python IDE:


After watching the talk I immediately rushed off to get pathogen setup and got the various plugins Martin mentioned working. It was awesome: my vim experience with Python was ridiculously awesome. I rushed to find a bunch of LaTeX plugins and I was pretty much complete.

I've decided now though that I want to understand those things a bit better so I'm going to start from a much more basic vimrc (just with some aliases and what not) and slowly pick and choose plugins bit by bit making sure I completely understand what they do.

That vimrc is here (it's based on a bunch of stuff from the basic parts of Martin's vimrc which you can find here).

I've put this up on github in a repository I plan on growing. I've also written a simple python script that creates a symbolic link in the home directory so that I just need to keep the repositories synced on all my machines and it'll all just work. In future that python scipt will check if pathogen is setup and take care of the plugins (I think).

Sunday, 13 October 2013

Setting up Zotero to use Dropbox for your attachments

So I've used +Mendeley in the past but recently started using Zotero to handle my references. There are certain aspects that I find +Mendeley a bit more user friendly for but having changed my workflow slightly I'm now a huge fan of Zotero (the scrapping tool is far superior to +Mendeley's ).

I don't use the firefox version but the standalone app which runs great on my linux box and my Mac.

I have a bunch of Dropbox space and the fact that Zotero doesn't play super nice with +Dropbox was a bit of a pain (you just had to point Mendeley at your +Dropbox folder and you were done). For a while I've just been using Zotero's base online storage but today I've just set it up to work on Dropbox so I thought I'd record how I've done it.

I asked a while back about this on +Google+ and had a bunch of people telling me: oh just use symbolic links it's super easy. Symbolic links have been one of those things I've been meaning to understand for a while but at that moment in time I just nodded and smiled. This morning I've just taken the time to figure it out and in particular see how to set it up to work with Zotero. It is super easy and so I thought I'd write a post in case it helps anyone and also to make sure I remember how to do it. I also didn't seem to be able to find anywhere online that explains the second half of what you need to do so that's here too.

First of all, if you check forums and stuff, there's a big safety warning with using Dropbox with Zotero. I understand that this is mainly because Zotero has multiple things going on: a database (that knows what is what and who wrote what etc) but also a storage folder (which contains the pdfs). This post is about division of labour: we're going to setup Zotero to take care of the database and +Dropbox to take care of the pdfs. There are basically 2 steps on 1 computer and 2 on every other.

If you use the preferences and change the data directory to dropbox, then if you're running on more than one box you will most likely corrupt the database (which is a bad thing: 'run you fools!').



So basically: don't mess with your data directory.

The 'solution' (I think) is to use 'symbolic links' to trick your computer in to thinking that your storage folder is also on your +Dropbox ( +Dropbox are the 'idiots' here and won't know the difference and just sync it all).

So to do that, choose computer 1. On computer 1 you setup a symbolic link from your Zotero storage file to your +Dropbox folder. This is what I did:

Step A1. Go to your preferences in Zotero and click on 'Show Data Directory'. This will open up your zotero folder (which contains the storage folder that we're looking for). Remember the path for this (click on info, or properties or something).

Step A2. Now for the magic trick: we create a symbolic link:


ln -s paththatitoldyoutoremember/storage ~/Dropbox/Literature/ZoteroStorage/


This tells computer 1 (and +Dropbox) that there's a folder in /Dropbox/Literature/ that contains folders with all your pdfs (it actually contains directories for each file):


(There is however no such folder, just a symbolic link that tricks everyone involved in to thinking that there is such a folder.)

That is however not everything. You now need to go to your other machines and tell Zotero on there that the storage file isn't exacly what it thinks it is (we trick it).

Step B1. So on computer 2 (and any other computer) go to the Zotero folder (remember just click on 'Show Data Directory' to find the path which you want to remember ie copy to your clipboard) and delete the storage folder (I think: be careful, don't sue me...):


rm -r otherpaththatitoldyoutoremember/storage


(make sure it deletes, when I did this on one machine I had to get rid of the folder again for some reason)

Step B2. Once we've done that we need to tell zotero not to worry and create a symbolic link of the storage folder (which it thinks is an actual folder) that is now in our +Dropbox:


ln -s ~/Dropbox/Literature/ZoteroStorage/ otherpaththatitoldyoutoremember/storage


That's basically it. If you now take a look at that folder/symlink on computer 2 you'll see all the folders (containing the pdfs) from your other machine (I'm blurring a bit of the path in case that somehow tells you my credit card number):

Now if you add a new file to Zotero and any given computer +Dropbox will first of all copy over the pdfs to all the right places (the symbolic links take care of the pdfs) and then when you sync zotero it'll also have the correct data (Zotero takes care of the database).

(Finally you can go to your preferences on Zotero on all your machines and turn off attachment syncing as +Dropbox is now taking care of that).

Friday, 11 October 2013

Revisiting the relationship between word counts and code word counts in LaTeX documents

In this previous post I posted some python code that would recursively search though all directories in a directory and find all .tex files. Using texcount and wc the code the script would return a scatter plot of the number of words against the number of code words with a regression line fitted.

Here's the plot from all the .tex files on my machine:



That post got quite a few views and +Robert Jacobson was kind enough to not only fix and run the script on his machine but also sent over his data. I subsequently tweaked the code slightly so that it also returns a histogram. So here's some more graphs:

  • Robert's teaching tex files:




  • Robert's research files:



It looks like my .6 ratio between code words and words isn't quite the same for Robert...

BUT if we combine all our files together we get:



So I'm still sticking to the rule of thumb for words in a LaTeX file: multiply your number of code words by .65 to get in the right ball park. (But more data would be cool so please do run the script on your files :)).

The tweaked code (including Robert's debugging) can be found in this github repo: https://github.com/drvinceknight/LaTeXFilesWordCount

Sunday, 6 October 2013

Bloom's taxonomy drawn in Tikz

I'm in the middle of writing about various pedagogic theories for PCUTL (a higher education certification process) and I needed a picture of Bloom's taxonomy:



I needed to be able to play around with it a bit (so as to add annotations and colours like in the above picture) so I wanted something in Tikz. I found this helpful stack-exchange post for hierarchical pyramids and stole modified the code from there to get Bloom's taxonomy in Tikz. Here's the stripped down version:



The code is here (I modified the following slightly to give the above standalone image using the tikz standalone document class):


\documentclass{article}

\usepackage{tikz}
\usetikzlibrary{intersections}

\title{Bloom's taxonomy}


\begin{document}
\begin{center}

\begin{tikzpicture}
\coordinate (A) at (-6,1) {};
\coordinate (B) at ( 6,1) {};
\coordinate (C) at (0,7.5) {};
\draw[name path=AC] (A) -- (C);
\draw[name path=BC] (B) -- (C);
\foreach \y/\A in {1/Knowledge,2/Comprehension,3/Application,4/Analysis,5/Synthesis,6/Evaluation} {
\path[name path=horiz] (A|-0,\y) -- (B|-0,\y);
\draw[name intersections={of=AC and horiz,by=P},
name intersections={of=BC and horiz,by=Q}] (P) -- (Q)
node[midway,above] {\A};
}
\end{tikzpicture}
\end{center}

\end{document}


I put it up on +writeLaTeX as well: here.


Saturday, 5 October 2013

Almost a 2 to 1 ratio of total code words to words in my LaTeX files...

In my previous post I posted a small python script that will recursively go through all directories in a directory and return the word count distribution using texcount (a utility that strips away LaTeX code to count words in documents). In this one I'm going to try and find a way of finding out how many words are in my LaTeX files without counting them (kind of).

On G+ +Dima Pasechnik suggested the use of wc as a proxy but wc gives the count of all words (include code words). I thought I'd see how far off wc would be. So I modified the python script from my last post so that it not only runs texcount but also wc and carries out a simple linear regression (using the stats package from scipy). The script is at the bottom of this blog post.

Here's the scatter plot and linear fit for all the LaTeX files on my system:




We see that the line $y=.68x-27.01$ can be accepted as a predictor for the number of words in a LaTeX document as a function of the total number of code words.

As in my previous post I obviously have an outlier there so here's the scatter plot and linear fit when I remove that one larger file:

The coefficient is again very similar  $y=.64x+26$.

So based on this I'd say that multiplying the number of codewords in a .tex file by .6 is going to give me a good indication of how many words I have in total.

Here's a csv file with my data, I'd love to know if other people have similar fits.

Here's the code (a +Dropbox link is here):

#!/usr/bin/env python import fnmatch import os import subprocess import argparse import pickle import csv from matplotlib import pyplot as plt from scipy import stats parser = argparse.ArgumentParser(description="A simple script to find word counts of all tex files in all subdirectories of a target directory.") parser.add_argument("directory", help="the directory you would like to search") parser.add_argument("-t", "--trim", help="trim data percentage", default=0) args = parser.parse_args() directory = args.directory p = float(args.trim) matches = [] for root, dirnames, filenames in os.walk(directory): for filename in fnmatch.filter(filenames, '*.tex'): matches.append(os.path.join(root, filename)) wordcounts = {} codewordcounts = {} fails = {} for f in matches: print "-" * 30 print f process = subprocess.Popen(['texcount', '-1', f],stdout=subprocess.PIPE) out, err = process.communicate() try: wordcounts[f] = eval(out.split()[0]) print "\t has %s words." % wordcounts[f] except: print "\t Couldn't count..." fails[f] = err process = subprocess.Popen(['wc', '-w', f],stdout=subprocess.PIPE) out, err = process.communicate() try: codewordcounts[f] = eval(out.split()[0]) print "\t has %s code words." % codewordcounts[f] except: print "\t Couldn't count..." fails[f] = err pickle.dump(wordcounts, open('latexwordcountin%s.pickle' % directory.replace("/", "-"), "w")) pickle.dump(codewordcounts, open('latexcodewordcountin%s.pickle' % directory.replace("/", "-"), "w")) x = [codewordcounts[e] for e in wordcounts] y = [wordcounts[e] for e in wordcounts] slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) plt.figure() plt.scatter(x, y, color='black') plt.plot(x, [slope * i + intercept for i in x], lw=2, label='$y = %.2fx + %.2f$ ($p=%.2f$)' % (slope, intercept, p_value)) plt.xlabel("Code words") plt.ylabel("Words") plt.xlim([0, plt.xlim()[1]]) plt.ylim([0, plt.ylim()[1]]) plt.legend() plt.savefig('wordsvcodewords.png') data = zip(x,y) f = open('wordsvcodewords.csv', 'w') wrtr = csv.writer(f) for row in data: wrtr.writerow(row) f.close()