ZebraFinch WSSD On ZebraFinch assembly

Introduction

Using WGS library of 11,683,735 reads from a zebrafinch, for Whole Genome Shotgun Sequence Detection (WSSD) to detect platypus duplication.

Repeatmasking, Megablast & Quality Rescore

Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%
(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')

megablast-2.2.16 -d zebrafinch_0209 -D 3 -p 93 -F m -U T -s 220 -R T

quality rescore: eeek_blast_quality_rescorer353a_gg.pl   -in  STDIN -minquality 30 -alignments 12:13    -qual2db  zebrafinch_0209.qual   -inputtype megablastD3 -pattern2 \"lcl\|(\S+)\" -pattern1 \"([A-Z0-9]+\.\d+)\" -noalignments    -globalfast

Reads Pick & Sliding Window

Recruit Reads if:
high quality identity >94%, alignment length >300bp, unique bases >200p, high quality bases >200bp, aligned bases/total length >40%. 13,257,914 reads/alignments got collected.
Use 5K non-gap, repeatmasking-free bases for sliding window.

Thesholds

Megablasting 42 supposedly unique zebrafinch BACs (BACs used) repeat-masked with the same parameter as the zebrafinch assembly against each WGS library respectively, quality rescoring all the alignments and picking high quality bases. Calculate the statistics of depth coverage and divergence of no-gap, no-div10% common repeat 5Kb windows:


phred >=30
total window 5537
Minimum 2
Maximum 136
Average 59.409969
Median 61
Standard Deviation 17.026657
avg +3x stddev 110.489940

WSSD Picking


By Depth Coverage:
if 6/7 5Kb windows with depth coverage >110.489940