Perl one-liners for the bioinformatician

In the Perl spirit of "Programming is fun", here are some one-liners that might be actually useful. Please mail me yours, the best one-liner writer wins a Perl magnetic poetry kit. Contest closes July 31st, 2000. Please note that it is me personally running this competition, not NRCC, CBR or IMB. "Best" is subjective, and will be determined by an open vote.


Take a multiple sequence FASTA sequence file and print the non-redundant subset with the description lines for identical sequence concatenated with ;'s. Not a small one-liner, but close enough.

perl -ne 'BEGIN{$/=">";$"=";"}($d,$_)=/(.*?)\n(.+?)>?$/s;push @{$h{lc()}},$d if$_;END{for(keys%h){print">@{$h{$_}}$_"}}' filename

Split a multi-sequence FastA file into individual files named after their description lines.

perl -ne 'BEGIN{$/=">"}if(/^\s*(\S+)/){open(F,">$1")||warn"$1 write failed:$!\n";chomp;print F ">", $_}'

Take a blast output and print all of the gi's matched, one per line.

perl -pe 'next unless ($_) = /^>gi\|(\d+)/;$_.="\n"' filename

Filter all repeats of length 4 or greater from a FASTA input file. This one is thanks to Lincoln Stein and Gustavo Glusman's discussions on the bio-perl mailing list.

perl -pe 'BEGIN{$_=<>;print;undef$/}s/((.+?)\2{3,})/"N"x length$1/eg' filename


Paul Gordon
Last modified: Sun Jul 25 20:51:21 AST 1999