[Inparanoid] InParanoid v4.1 does not work on more than 10000 sequences

Jean-Nicolas Audet jean.nicolas.audet at gmail.com
Thu Mar 27 13:41:29 CET 2014


Hello,

I've been trying to make InParanoid work using de novo transcriptomes of two
non-model species (birds) I assembled with Trinity. They were 'translated'
to protein sequences using Transdecoder. The resulting fasta file I'm trying
to use in InParanoid looks like this (~30 000 seqs):

>comp100291_c0_seq1

LPKKILLPIQQVLGHLLLALSYRGKVMQVKALKSKHEHNGPETLDAFLSSKLVVVKQPRE

QAGFPLSIVFIPGEGRQERFLLHGEYNQSFCKEPVMELPRQ

>comp102162_c0_seq1

PNMTLHFLKSSPGSWRLSGLVLIPYVTETISGSCETLTRLQMPAHIQQSRWKAKHGPRIL

LLGLLQNLRSLFPLKVLPPGANSQLKRNCSFTSVCLIGTFYVESS

>comp102206_c0_seq1

CQEQKWQKGNREEKGWAGVTVWGAYFPYLLIRCPNHQTSTPLSIHSQQHFMLCIIICPFS

WLKPPVKTTQMFKGFFFKSGLKKFLALFLISWAAFATDRPLLGKQQSR

I tried the example fasta files supplied with the program (called SC and EC)
and it works, but when I use my files, it's stuck at the first step and it
does not create any file (nor disk usage) after days. Here is what I get
with my fasta files:

Loading module bio/ncbi-blast-2.2.22.

Formatting BLAST databases

Done formatting

Starting BLAST searches...

 

Starting first BLAST pass for bf - bf on [blastall] WARNING: the -C 3
argument is currently experimental

It then stays like this forever (I tried up to 6 days with 24 CPUs and
256G).

I also tried supplying my Blast results (inter-sample) generated myself that
I parsed with their supplied parser but then it still stays forever at the
same state, again without generating any file:

Done BLAST searches. Starting ortholog detection...

I tried with and without bootstraping, multitreading (-a16 option) or not,
as I said with or without supplied blast results and I also cleaned my fasta
files for any weird characters (removed annotations, all ' * ', spaces,
empty lines and dots. Now I'm running out of ideas... I'm using a Unix
cluster. I tried these jobs using up to 4 to 24 CPUs with 8 to 256G memory.

Finally, since it was working with SC and EC, I tried with a small subset of
my transcriptomes (a few thousands sequences) and it worked. Thus it seems
to me that the problem could be that InParanoid cannot take more than, say,
10000 sequences. I could split my transcriptomes into several smaller files
of specified sequence ranges but then the orthologs that do not have the
exact same length will have a chance to be missed if they are in two
different split files.

Thanks in advance for your help,

Jean-Nicolas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.su.se/pipermail/inparanoid-at-sbc.su.se/attachments/20140327/b387be55/attachment.html>


More information about the InParanoid mailing list