Saturday, 28 September 2013

Counting words in all my LaTeX files with Python

So today I found out about latexcount which will give a nice detailed word count for LaTeX files. It's really easy to use and apparently comes with most LaTeX distributions (it was included with my TeXlive distribution under Mac OS and Linux).

To run it is as simple as:

$ texcount file.tex

It will output a nice detailed count of words (breaking down by sections etc...). I don't pretend to be an expert in anything but I'm genuinely really surprised that I had never seen this before. I was about to write a (most probably terrible) +Python script to count words in a given file and just before starting I thought "WAIT A MINUTE: someone must have done this"...

Anyway, I've gotten to the point of not being able to watch TV without a laptop at the tip of my fingers doing some type of work so whilst keeping an eye on South Africa playing ridiculously well in their win over Australia in the rugby championship I thought I'd see if I could have a bit of fun with texcount.

Here's a very simple Python script that will recursively search through all directories in a given directory and count the words in all the LaTeX files:

#!/usr/bin/env python
import fnmatch
import os
import subprocess
import argparse
import matplotlib.pyplot as plt
import pickle

def trim(t, p=0.01):
    """Trims the largest and smallest elements of t.

    Args:
    t: sequence of numbers
    p: fraction of values to trim off each end

    Returns:
    sequence of values
    """
    t.sort()
    n = int(p * len(t))
    t = t[n:-n]
    return t

parser = argparse.ArgumentParser(description="A simple script to find word counts of all tex files in all subdirectories of a target directory.")
parser.add_argument("directory", help="the directory you would like to search")
parser.add_argument("-t", "--trim", help="trim data percentage", default=0)
args = parser.parse_args()
directory = args.directory
p = float(args.trim)

matches = []
for root, dirnames, filenames in os.walk(directory):
  for filename in fnmatch.filter(filenames, '*.tex'):
        matches.append(os.path.join(root, filename))

wordcounts = {}
fails = {}
for f in matches:
    print "-" * 30
    print f
    process = subprocess.Popen(['texcount', '-1', f],stdout=subprocess.PIPE)
    out, err = process.communicate()
    try:
        wordcounts[f] = eval(out.split()[0])
        print "\t has %s words." % wordcounts[f]
    except:
        print "\t Couldn't count..."
        fails[f] = err

pickle.dump(wordcounts, open('latexwordcountin%s.pickle' % directory.replace("/", "-"), "w"))


try:
    data = [wordcounts[e] for e in wordcounts]
    if p != 0:
        data = trim(data, p)
    plt.figure()
    plt.hist(data, bins=20)
    plt.xlabel("Words")
    plt.ylabel("Frequency")
    plt.title("Distribution of words counts in all my LaTeX documents\n ($N=%s$,mean=$%s$, max=$%s$)" % (len(data), sum(data)/len(data), max(data)))
    plt.savefig('latexwordcountin%s.svg' % directory.replace("/", "-"))
except:
    print "Graph not produced, perhaps you don't have matplotlib installed..."

(Please forgive the lack of comments throughout the code...)

Here it is in a github repo as well in case anyone cares enough to want to improve it.

Here is what calls it on my entire Dropbox folder:

$ ./searchfiles.py ~/Dropbox

This will run through my entire Dropbox and count all *tex files (it threw up errors on some of my files so I have some error handling in there). It will output a dictionary of file name - word count pairs to a pickle (so you could do whatever you want with that) file but if you have matplolib installed it should also produce the following histogram:

 


As you can see from there it looks like I've got some files quite a lot bigger than the others (I'm guessing latexcount will count individual chapters as well as the entire thesis.tex files that I have in there that include them...). So I've added an option to trim the data set before plotting:

$ ./searchfiles.py ~/Dropbox -t .05

This takes 5% of the data off each side of our data set and gives:

 


Looking at that I have a lot of very short LaTeX files (which include some standalone images I've drawn to do stuff like this). If I had time I'd see how good a negative exponential fits to that distribution as it does indeed look kind of random. I'd love to see how others' word count distribution looks...

Now, I can say that if I ever produce more than 600 words then I'm doing above average work...