BioInfo: Renaming sequences in a FASTA file

I got the problem of double sequence names after downloading a collection of genes. Some alignment programs like MAFFT do not mind this but Clustal (ClustalX) and GeneDoc were complaining about none-unique sequence names.
The easiest way to circumvent this problem is to number the sequences in a FASTA file with multiple sequences. The instruction I found on the web for this just added the numbers to the end of the names. This works fine as long as the names are not too long as some programs do not handle the full length sequence names. Also Seqkit version 0.10.0 had problems with this. Therefore I wanted to have the numbers in the beginning of the sequence names and came up with the following AWK instructions:

gawk '/^>/ {name=$0; printf("%s_%s\n", ">"i++, substr($name,2));next;} { print $0;}' infile.fasta > outfile.fasta

Included into a Windows batch script the instruction looks like this:

gawk '/^>/ {name=$0; printf("%%s_%%s\n", ">"i++, substr($name,2));next;} { print $0;}' %infile% > %outfile%

This instructions use GAWK and PRINTF. GAWK and PRINTF are part of the GNU coreutils and can be found in the GnuWin32 package or installed through Cygwin.

Friday, 22 February 2019

Renaming sequences in a FASTA file