perl - Large reference file (800,000 lines). Store in array or use filehandle -


i have large tab delimited file information.

chr9    refflat exon    136333685   136335910   .   +   .   gene_id "cacfd1"; transcript_id "nm_001242370"; exon_number "5"; exon_id "nm_001242370.5"; gene_name "cacfd1";   chrx    refflat exon    51804923    51805135    .   -   .   gene_id "maged4b"; transcript_id "nm_001242362"; exon_number "14"; exon_id "nm_001242362.1"; gene_name "maged4b"; 

i have file coordinates search (1800 lines)

chr11   62105438  chr11   85195064  chr17   33478139  chr21   9827089 

i have nested loop in loop, each line in coordinate file searches against reference file.

#!/usr/bin/perl -w  use strict;       foreach(@coord){                  @query = split(/\t/,$_);                 chomp @query; #clean                  foreach(@ref){                          @ref_line = split(/\t/,$_);                         chomp @ref_line; #clean                          if(($query[1] >= $ref_line[3]) && ($query[1] <= $ref_line[4])){                                  if ($query[0] eq $ref_line[0]){                                          @sub_ref_line = split(";",$ref_line[8]);                                         $results {"$query[0],$query[1]"} = "$sub_ref_line[4]";                                         next;                                 }                         }                 }         } 

for sake of speed , memory, better me use file handle reference instead of storing in array?

you want read reference file hash first, looks this:

my %ref = (     'chr9' => [         'chr9    refflat exon    136333685   136335910   .   +   .   gene_id "cacfd1"',         # other lines chr9     ],     'chrx' => [         ...     ],     ... ); 

then in inner loop, can loop on reference file lines have matching first field:

    foreach ( @{ $ref{ $query[0] } } ) { 

you use more memory if average chr# appears 20000 times, enter inner loop 36 million times instead of 1.44 billion times.

to answer actual question, reading file in inner loop instead of having reference data in memory take less memory slower.


Comments

Popular posts from this blog

php - regexp cyrillic filename not matches -

c# - OpenXML hanging while writing elements -

sql - Select Query has unexpected multiple records (MS Access) -