Segmental Duplication DB

This page contains data regarding recent segmental duplications in the human genome  It focuses on genomic duplications >1 kb and >90% identity.

Updated Analysis

WSSD updated on the April 2002 sequence freeze.

WGAC analysis for genome build30 (June 2002) sequence freeze.

Interactive Data

UW Duplication Browser
This browser provides an integrated view of the duplication data using two independent in silico strategies.  The whole genome analysis comparison (WGAC) based on the August 2001 UCSC assembly, and the whole genome shotgun sequence detection (WSSD) of duplications mapped on the assembly.  These are represented as additional tracks displayed by the UCSC Genome Browser.  Details regarding the depth of coverage and average percent identity of WSSD, as well as the length of alignment and percent sequence identity of WGAC are shown. 


Data Downloads

(See also April 2002 Updated WSSD)

The Whole Genome Shotgun Sequence Detection (WSSD) Database
This consists of 8,595 regions from 2,972 clones representing 130.4 Mb of segmental duplications. Regions were extracted where a significant increase in WGS read depth was observed.  This data has been filtered for duplications and recently transposed common repeats such as L1P and HERV elements.  Due to the complex nature and interrelationships of the duplications we did not attempt to create consensi.  

Fasta files containing the duplicated sequence detected by WSSD
         (Unix gzip format)  (PC zip format)

Mapping  of sequence coordinates to the UCSC Public Assembly(August 2001)
    Redundant:           (Unix gzip format)  (PC zip format)
    Non-redundant:     (Unix gzip format)  (PC zip format)
    (Unique regions >90% and >500 bp were mapped.)


The Whole Genome Assembly Comparison (WGAC) for UCSC August 2001
An all by all comparision of duplications (>90%, > 500 bp in length) present in the assembly using a previously described method (Bailey et al, 2001).

Unfiltered Set (covering ~15% of assembly)
    Pairwise Alignments  (Unix gzip format)  (PC zip format)

Filtered Set (covering 5.2% of assembly) 
(Alignments (>98%) with insufficent WSSD evidence were removed.)  
Pairwise Alignments    (Unix gzip format)  (PC zip format)
    Fasta Sequence of underlying Assembly  
        (Unix gzip format)  (PC zip format)

Putative Coordinates for Genomic Disorders (from Figure 3)
Regions of the human genome (> 50 kb and <10 Mb in length) that are flanked by duplications validated by both methods (>95% sequence identity > 10 kb in length)
    Coordinates within the August assembly.  

Gene Duplication Analysis
A complete list of all human genes within the RefSeq (n=13,351) and their duplication status. The number of bases and number of exons catagorized as duplicated within duplicated space.
    Gene duplication status (table format).  
         RefSeq Gene List (Unix gzip format) (PC zip format
         Duplicated Space (Unix gzip format) (PC zip format

   WSSD/WGAC Header Descriptions



Initial Read Depth Across Celera Multiple Alignments of Public Clones
Access to graphical representations of all 39,298 public clones screened for segmental duplications. (Includes April 2002 Update)

Second Pass with Consensus Across Putatively Duplicated Clones
Access to graphical representations of clones with putative duplications for which consensus sequences were generated. This includes average % sequence identity calculated over the consensus. (Includes April 2002 Update)

Chromosomal Views of Duplications with Gaps Emphasized (August 2001)
Views of Chromosomes emphasizing gaps (green).  The WSSD duplication regions (top track) are black.  The assembly (WGAC) duplications (red and blue for inter and intra chromosomal, respectively) are broken into >98% similar (top) and <98% similar (bottom). 

WSSD Duplications not detected by WGAC (August 2001)
These are potentially under-represented regions of the genome requiring reassembly or further sequencing.