The client
was running Windows and the sort function in DOS did not seem to
have a -u parameter as it does in Linux.
Since the client already had PERL installed (for the other
script), I decided to write a sort by unique script in PERL. I
was quite surprised by the results. The file in question was
326MB* (a pipe-delimited scrape of the business listings on Yellow Pages.
Using time cat listings.csv | sort -u >> test.csv took
approximately 6 minutes and 30 seconds. Sorting the same file
with my PERL script took approximately 10 seconds.
The Linux sort function is written in C. I find this interesting
as C is generally much faster than PERL (although design is far more
important for optimization than is the speed of the language). Since
my little script obviously isn't the result of some ingenious design,
I think what this best illustrates is that certain languages are
best for certain jobs because of their inherent data-structures.
Some data-structures are simply better suited for certain jobs
and allow for simpler algorithms.
#!/usr/bin/perl
use strict;
(our $input, our $output) =@ARGV;
our %uniques = ();
open(INPUT, "< $input") or die "Cannot open input file $input...n";
open(OUTPUT, ">> $output") or print "Cannot open output file $output...n";
while(my $line = )
{ $uniques{$line} = $line; }
close(INPUT);
foreach my $key (sort (keys %uniques))
{ print OUTPUT $key; }
close(OUTPUT);
sub ksort()
{ $uniques{$b} <=> $uniques{$a}; }
* This file contained a large amount of redundancy - the resulting
output was only 1MB (.003 of the original).