Placement Of Reads
For each pair of end sequences
- an in silico insert size was calculated as the distance between (and including) end sequence positions on the human reference genome.
- To quantify the distribution of the fosmid library insert sizes, we calculated a mean of 176.820 kb +/- 33.5kb kb from the full set of 66206 fosmids. (Size Distribution).
- Based on this distribution we chose a concordant insert size range of 76.5-277 kb (within 3 standard deviations of the mean), making it unlikely that size discordant clones deviating outside of this range would represent chance occurrences rather than true rearrangements.
For each fosmid
All paired-end alignment combinations were scored using a 13-point in-house ordinal scale for placement.
- +1 per end for longest alignment(2 points)
- +1 per end for most identical(2 points)
- +1 per end if >30bp >30 (2 points)
- +2 per end with allelic levels of identity (if .90%)
- +2 per pair for proper size (76.5kb-277kb)
- +1 per pair for orientation
In addition to identifying the longest and most identical alignments of allelic proportions, the placement score strongly favored concordant positions over discordant ones helping to avoid false positive rearrangements due to recent segmental duplications or gene conversion events between non-allelic sites within the genome.
To add additional stringency for the detection of putative rearrangements, we required discordant alignments (insert size <76.5 kb or >277 kb and/or mis-oriented ends) to be
- >=90 % identity
- >=400 bp in length
- >=150 bp of unique sequence (RepeatMasker-detected genomic elements with a sequence divergence less than 2% from consensus.)