On G+ +Dima Pasechnik suggested the use of
wc
as a proxy but wc
gives the count of all words (include code words). I thought I'd see how far off wc
would be. So I modified the python script from my last post so that it not only runs texcount but also wc
and carries out a simple linear regression (using the stats package from scipy). The script is at the bottom of this blog post.Here's the scatter plot and linear fit for all the LaTeX files on my system:
We see that the line $y=.68x-27.01$ can be accepted as a predictor for the number of words in a LaTeX document as a function of the total number of code words.
As in my previous post I obviously have an outlier there so here's the scatter plot and linear fit when I remove that one larger file:
The coefficient is again very similar $y=.64x+26$.
So based on this I'd say that multiplying the number of codewords in a .tex file by .6 is going to give me a good indication of how many words I have in total.
Here's a csv file with my data, I'd love to know if other people have similar fits.
Here's the code (a +Dropbox link is here):
#!/usr/bin/env python import fnmatch import os import subprocess import argparse import pickle import csv from matplotlib import pyplot as plt from scipy import stats parser = argparse.ArgumentParser(description="A simple script to find word counts of all tex files in all subdirectories of a target directory.") parser.add_argument("directory", help="the directory you would like to search") parser.add_argument("-t", "--trim", help="trim data percentage", default=0) args = parser.parse_args() directory = args.directory p = float(args.trim) matches = [] for root, dirnames, filenames in os.walk(directory): for filename in fnmatch.filter(filenames, '*.tex'): matches.append(os.path.join(root, filename)) wordcounts = {} codewordcounts = {} fails = {} for f in matches: print "-" * 30 print f process = subprocess.Popen(['texcount', '-1', f],stdout=subprocess.PIPE) out, err = process.communicate() try: wordcounts[f] = eval(out.split()[0]) print "\t has %s words." % wordcounts[f] except: print "\t Couldn't count..." fails[f] = err process = subprocess.Popen(['wc', '-w', f],stdout=subprocess.PIPE) out, err = process.communicate() try: codewordcounts[f] = eval(out.split()[0]) print "\t has %s code words." % codewordcounts[f] except: print "\t Couldn't count..." fails[f] = err pickle.dump(wordcounts, open('latexwordcountin%s.pickle' % directory.replace("/", "-"), "w")) pickle.dump(codewordcounts, open('latexcodewordcountin%s.pickle' % directory.replace("/", "-"), "w")) x = [codewordcounts[e] for e in wordcounts] y = [wordcounts[e] for e in wordcounts] slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) plt.figure() plt.scatter(x, y, color='black') plt.plot(x, [slope * i + intercept for i in x], lw=2, label='$y = %.2fx + %.2f$ ($p=%.2f$)' % (slope, intercept, p_value)) plt.xlabel("Code words") plt.ylabel("Words") plt.xlim([0, plt.xlim()[1]]) plt.ylim([0, plt.ylim()[1]]) plt.legend() plt.savefig('wordsvcodewords.png') data = zip(x,y) f = open('wordsvcodewords.csv', 'w') wrtr = csv.writer(f) for row in data: wrtr.writerow(row) f.close()
No comments:
Post a Comment