perl - Large reference file (800,000 lines). Store in array or use filehandle -
i have large tab delimited file information.
chr9 refflat exon 136333685 136335910 . + . gene_id "cacfd1"; transcript_id "nm_001242370"; exon_number "5"; exon_id "nm_001242370.5"; gene_name "cacfd1"; chrx refflat exon 51804923 51805135 . - . gene_id "maged4b"; transcript_id "nm_001242362"; exon_number "14"; exon_id "nm_001242362.1"; gene_name "maged4b";
i have file coordinates search (1800 lines)
chr11 62105438 chr11 85195064 chr17 33478139 chr21 9827089
i have nested loop in loop, each line in coordinate file searches against reference file.
#!/usr/bin/perl -w use strict; foreach(@coord){ @query = split(/\t/,$_); chomp @query; #clean foreach(@ref){ @ref_line = split(/\t/,$_); chomp @ref_line; #clean if(($query[1] >= $ref_line[3]) && ($query[1] <= $ref_line[4])){ if ($query[0] eq $ref_line[0]){ @sub_ref_line = split(";",$ref_line[8]); $results {"$query[0],$query[1]"} = "$sub_ref_line[4]"; next; } } } }
for sake of speed , memory, better me use file handle reference instead of storing in array?
you want read reference file hash first, looks this:
my %ref = ( 'chr9' => [ 'chr9 refflat exon 136333685 136335910 . + . gene_id "cacfd1"', # other lines chr9 ], 'chrx' => [ ... ], ... );
then in inner loop, can loop on reference file lines have matching first field:
foreach ( @{ $ref{ $query[0] } } ) {
you use more memory if average chr# appears 20000 times, enter inner loop 36 million times instead of 1.44 billion times.
to answer actual question, reading file in inner loop instead of having reference data in memory take less memory slower.
Comments
Post a Comment