Removing duplicates in unsorted lists

In a recent project I was working on in my spare time, I came across the need to remove duplicate entries from an unsorted list of data.

Under circumstances when the data can be sorted, removing duplicates is fairly easy: one simply uses the “sort” command, followed by the “uniq” command. For example, if data.txt contains the following unsorted list, duplicates can be removed in this manner:

data.txt
——–
12
7
6
9
6
7
12
13

cat data.txt | sort | uniq

output:

6
7
9
12
13

Easy enough, right?

For this particular project, however, I did not want to sort the data first; each entry was added in chronological order along with a timestamp, but in this particular case the timestamp had to be stripped out ahead of time. Therefore, sorting was not an option. Furthermore, the “uniq” command only removes duplicate entries that are adjacent to one another (which is why the “sort” command has to be invoked before passing the data to “uniq”).

The solution: a particularly nasty sed one-liner, which is probably not worth deciphering at this point. Suffice it to say that it creates a buffer in memory and compares each new entry to existing entries in the buffer; if the entry already exists in the buffer, then it is not added; if the entry does exist in the buffer, then it is added to the end of the buffer. Once the input has ended, the buffer is output.

Here it is:

cat data.txt | /bin/sed -n ‘G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P’

output:

12
7
6
9
13

Because this relies somewhat heavily on memory, it could be problematic when used on humongous data sets, so keep that in mind. But, these days, I don’t imagine that would ever be much of a problem.

Advertisements