Friday, 11 October 2013

Revisiting the relationship between word counts and code word counts in LaTeX documents

In this previous post I posted some python code that would recursively search though all directories in a directory and find all .tex files. Using texcount and wc the code the script would return a scatter plot of the number of words against the number of code words with a regression line fitted.

Here's the plot from all the .tex files on my machine:



That post got quite a few views and +Robert Jacobson was kind enough to not only fix and run the script on his machine but also sent over his data. I subsequently tweaked the code slightly so that it also returns a histogram. So here's some more graphs:

  • Robert's teaching tex files:




  • Robert's research files:



It looks like my .6 ratio between code words and words isn't quite the same for Robert...

BUT if we combine all our files together we get:



So I'm still sticking to the rule of thumb for words in a LaTeX file: multiply your number of code words by .65 to get in the right ball park. (But more data would be cool so please do run the script on your files :)).

The tweaked code (including Robert's debugging) can be found in this github repo: https://github.com/drvinceknight/LaTeXFilesWordCount