Saturday 30 March 2013

Concatenating and removing duplicates from two files

I'm posting this mainly to remind myself how to do this as I keep on forgetting (anyone with good *nix-foo won't learn anything here).

I've been running code for a long time gathering data for a paper I will one day perhaps have time to write. I analyse the csv file routinely thanks to a sleep command and everything is synced in a dropbox folder so if I'm bored I can take a look at this kind of graph every now and then:




(You can see that a particular measure for whatever I'm working on has arrived at steady state.)

Anyway! That's not the point.

The point is that at some point every now and then dropbox will get conflicted copies:


Concatenating (use cat)

First of all I need to gather those two csv files together:

cat Output_file_with_permute.csv Output_file_with_permute\ \(Vince\ Knight\'s\ conflicted\ copy\ 2013-03-06\).csv > fixed.csv

We can check that we do indeed have all the files together using grep -c . to count the number of rows in each file:

cat Output_file_with_permute.csv | grep -c .
cat Output_file_with_permute\ \(Vince\ Knight\'s\ conflicted\ copy\ 2013-03-06\).csv | grep -c .
cat fixed.csv | grep -c .

The output is shown (31499=9702+21798):




Now we need to make sure we don't have any duplicates in fixed.csv.

Removing duplicates

This is really simple using the sort and uniq commands:

sort fixed.csv | uniq > fixed_Output_file.csv

This sorts the file and using the uniq command to just output the unique ones.

If I count how many files are in the new file:

cat fixed_Output_file.csv | grep -c .

I get 21797 rows so it looks like the conflicted file didn't have any rows that the main file was missing.

I've used all this before when I had code running on multiple machines which obviously created a bunch of conflicted copies (because of how dropbox does things) with relevant data all over the place.

The final step is to simply clean all this up by removing the unwanted files:

mv fixed_Output_file.csv Output_file_with_permute.csvrm fixed.csv
rm Output_file_with_permute\ \(Vince\ Knight\'s\ conflicted\ copy\ 2013-03-06\).csv

As I said above the main reason I've written this post is to try and make sure I remember how to do this (I've had to google this everytime I need to do this)...

2 comments:

  1. Instead of using grep -c . to count the number of lines, you can also use
    wc -l
    (wc -> worcount tool)

    to count the number of lines.

    ReplyDelete