Main

We have constructed a map of contiguous DNA fragments cloned in Escherichia coli and covering more than 99% of human chromosome 14. Unlike the ‘map first and sequence second’ approach used for previous chromosome sequencing efforts3,4, the construction of this map was mainly based on a sequence tag connector (STC) strategy1,2 in which the map progresses in parallel with the sequencing project. In this iterative approach, fully sequenced bacterial artificial chromosomes (BACs) are searched against a database of clone end sequences to identify minimally overlapping clones and select the next BACs to enter the sequencing pipeline. We combined this ‘map as you go’ strategy with a dense high-resolution radiation hybrid map. This conjunction conferred a high degree of flexibility on the project: it allowed us to increase the number of relatively evenly distributed seed points and to tailor our efforts in specific chromosomal regions when needed. The fingerprint-based mapping strategy reported by McPherson5 also successfully faced challenging schedules and integrated alternative sources of map and sequence data; but we feel that the STC-based approach required less manual editing and resulted in a more accurate and complete set of overlapping clones.

The establishment of ordered sets of clones (contigs) relied on a public database of BAC end sequences2 (ftp://ftp.tigr.org/pub/data/h_sapiens/bac_end_sequences/) augmented with end sequences generated in house. Auxiliary mapping resources included a high-resolution chromosome 14 radiation hybrid map and a chromosome 14 enriched library of about 10,000 BACs.

Initially, we chose around 50 seed markers, ordered at high odds and distributed as evenly as possible on the chromosome, using radiation hybrid data from the Genebridge4 panel6. We used probes corresponding to the seeds to isolate clones by hybridization against high-density filters of the chromosome 14 BAC library. To cope with the increasing sequencing throughput, the number of seeds was subsequently tripled by choosing additional BACs that could be mapped unambiguously. We assessed the identity and integrity of selected clones by analysing their restriction fingerprints using the FPC software7. Walking could then proceed iteratively and bidirectionally from each fully sequenced seed clone by querying the BAC end sequence database. The walking process entailed systematic checks consisting of re-sequencing of candidate clone ends as well as comparison of their fingerprints.

We monitored the progress of chromosome coverage on a radiation hybrid map built with the TNG high-resolution panel (http://www-shgc.stanford.edu/Mapping/rh/RH_poster/). Roughly 2,350 markers were analysed on this panel, including 640 markers derived from the ends of sequenced clones and 616 entries downloaded from the RHdb database (http://www.ebi.ac.uk/RHdb). Marker ordering was treated as being analogous to the well studied ‘travelling salesman problem’ for which powerful heuristic tools are available8 (http://www.caam.rice.edu/keck/concorde.html). We then estimated the distances between markers adjacent in the resulting orders using standard maximum likelihood techniques9. Information from lower resolution maps was used to improve long-range continuity. The TNG map enabled us to order and orient clone contigs, select additional candidate seed points and determine the gap sizes in the late stages of the project.

Contigs established earlier and covering around 15% of the chromosome were also included in the BAC map. Overall, we selected 162 seed clones (0.30 chromosome 14 equivalents). About 650 clones were assembled in three clone contigs covering more than 99% of chromosome 14 and totalling around 88 megabases (Mb). The largest contig (85 Mb) physically connects the most proximal genetic marker (D14S261) to the second most telomeric cluster of markers (D14S292), encompassing almost the entire genetic map of the chromosome. We estimate that fewer than five BACs remain to be incorporated to reach full chromosomal coverage.

Publicly available fingerprint data (ref. 5 and http://genome.wustl. edu/gsc/human/Mapping) provided a consistency check for our clone map, which was built independently. Fingerprinted contigs were anchored on our map in several ways. First, fully sequenced clones were digested ‘in silico’ and the resulting theoretical restriction pattern aligned against the experimental patterns of the fingerprint clone map5. Second, anchoring of fingerprinted contigs was based on the map data associated with markers (including end sequences) contained in the digested clones. These map comparisons indicated that the consistency between fingerprint contigs and our clone scaffold were consistent at a coarse level (some contig junctions were predicted). However, the present clone scaffold was notably more accurate at the finer resolution of local clone ordering and clone overlaps.

The overlap between consecutive BACs resulting from the walking process was estimated to average 20 kilobases (kb) per walking step, and the overlap resulting from random gap closure was estimated to average 52 kb per contig junction. The fraction of redundant sequence was estimated to be 22%, approximately half of which was attributable to suboptimal gap closures.

Note that: (1) most of the redundancy in the clone tiling path is a consequence of the high number of seeds that were required to complete the map in a short time frame10; (2) this amount of redundancy is still significantly smaller than that computed for typical draft chromosomes (http://genome.ucsc.edu/); and (3) the fact that fewer than 1% of the selected clones originated from chromosomes other than 14 shows that the strategy is robust with respect to false links of various origins. This contrasts with the fingerprint-based selection of clones, which led to incorrect chromosomal assignment more frequently.

The distance from the telomere was estimated to be about 5 kb on the basis of fragment restriction data involving a yeast artificial chromosome clone containing the 14q telomere (F.M., unpublished data). The distance to the alphoid centromeric repeats is unknown. However, the current most centromeric BAC already extends 1,200 kb beyond the most proximal marker of previously reported maps11,12,13. Interestingly, this clone contains two markers from our TNG map that exhibit extremely high retention rates in the hybrid lines, indicating that they may be close to the centromere.

The clone coverage of chromosome 14 that has been achieved using essentially an STC strategy is very satisfactory, and compares favourably with the coverage obtained for the human chromosomes that have been completely sequenced3,4. The two remaining gaps are located in the subtelomeric part of 14q; subtelomeric regions of many chromosomes are under-represented in most genomic libraries and hence often contain cloning gaps. The largest gap, estimated to be around 600 kb, was subsequently divided into two smaller gaps following the identification of clones through library screening with probes mapped within the gap. The second and more distal gap (20 kb) occurs in the immunoglobulin heavy-chain constant gene region, which contains a number of nearly identical genes and pseudogenes.

Considerations of the optimum design of an STC strategy should include theoretical aspects which correlate the effective depth of the clone end library and the number of seed points to the level of sequence redundancy10,14, as well as practical aspects such as the sequencing capacity and costs15, the time schedule for the project and the resolution of the available mapping data. However, a centralized repository of BAC end sequences is the only prerequisite for the construction of a tiling path based on the STC approach. Other mapping resources used in this project were auxiliary and provided useful information for seed selection and validation of map extension. Such a strategy is therefore generally portable to any large-scale sequencing project and is readily compatible with partitioning of the project.