The Human Genome Project
Chapter 7
The Human Genome Project
[W]hen the full map of the human genome is known … we shall have passed through a phase of human civilisation as significant as, if not more significant than, that which distinguished the age of Galileo from that of Copernicus, or that of Einstein from that of Newton…. We have crossed a boundary of unprecedented importance…. There is no going back…. We are walking hopefully in the scientific foothills of a gigantic mountain range.
—Ian Lloyd, 1990
In 1953 James D. Watson and Francis H. C. Crick described the double helical structure of deoxyribonucleic acid (DNA). Their molecular DNA structure was published in Nature in April 1953, in an article that was little more than one page. Their article ushered in a new age of discovery in genetics and laid the foundation for the sequencing of the human genome.
The word genome was derived from two words: gene and chromosome. Today, genome is widely understood to be the entire complement of genetic material in the cell of an organism. A genome is composed of a series of four nitrogenous DNA bases: adenine (A), guanine (G), thymine (T), and cytosine (C). In each organism these bases are arranged in a specific order, or sequence, and this order constitutes the genetic code of the organism. In humans the genome is composed of approximately three billion bases. In 2001 a first draft sequence of the entire human genome was completed and made available to the public for study and research. The Human Genome Project (HGP) of the National Human Genome Research Institute (NHGRI), which is one of the National Institutes of Health (NIH), completed the full human genome sequence in April 2003.
LAYING THE GROUNDWORK FOR THE SEQUENCING OF THE HUMAN GENOME
During the 1960s and 1970s the techniques that would enable the study of molecular genetics were developed. In 1964 the American virologist Howard Temin worked with ribonucleic acid (RNA) viruses and discovered that Crick's central tenet—that DNA makes RNA, and RNA makes protein—did not always hold true. In 1965 Temin described the process of reverse transcriptase—that genetic information in the form of RNA could be copied into DNA. The enzyme called reverse transcriptase used RNA as a template for the synthesis of a complementary DNA strand. Throughout the 1960s the American biochemists Robert William Holley, Har Gobind Khorana, and Marshall Nirenberg, along with the American geneticist Philip Leder, all contributed to deciphering the genetic code by determining the DNA sequence for each of the twenty most common amino acids. Holley, Khorana, and Nirenberg were awarded the 1968 Nobel Prize in Physiology or Medicine.
The American biochemist Paul Berg created the first recombinant DNA in 1972, and his work paved the way for isolating and cloning genes. Recombinant DNA is formed by combining segments of DNA, frequently from different organisms. In 1975 the English molecular biologist Sir Edwin Southern developed a method to isolate and analyze fragments of DNA that remains in use today. Known as the Southern blot analysis, it is a technique for separating DNA fragments by electrophoresis (a technique that separates molecules based on their size and charge) and identifying a specific fragment using a DNA probe. Figure 7.1 shows how the Southern blot analysis is performed. It is used in genetic research, forensic examinations of DNA evidence in legal proceedings, and clinical medical practice.
In 1977 English biochemist Frederick Sanger, whose many accomplishments have been acknowledged by two Nobel Prizes, and his colleagues developed techniques to determine the nucleic acid base sequence for long sections of DNA. In 1978 American biologists Hamilton O. Smith and Daniel Nathans and the Swiss molecular biologist Werner Arber were awarded the Nobel Prize for an array of discoveries made during the 1960s, including the use of restriction enzymes, which ignited the biotechnology field. Restriction enzymes recognize and cut specific DNA sequences. The same year restriction fragment length polymorphisms (DNA sequence variants) were discovered. Figure 7.2 shows a single nucleotide polymorphism—single base changes between homologous DNA fragments.
Using these new genetic techniques, several genes for serious human disorders were identified during the 1980s. In 1982 the American molecular biologist James F. Gusella and his colleagues at Harvard University began studying patients with Huntington's disease and determined that the gene for this degenerative, neuropsychiatric disorder was located on the short arm of chromosome 4. That same year a gene for neurofibromatosis type I was found on the long arm of chromosome 17. Neurofibromatoses are a group of genetic disorders that cause tumors to grow along various types of nerves and can affect the development of nonnervous tissues such as bones and skin. The disorder may also result in developmental abnormalities such as learning disabilities.
In 1985 the American biochemist Kary B. Mullis and his colleagues at the Cetus Corporation in California pioneered the polymerase chain reaction, a fast, inexpensive technique that amplifies small fragments of DNA to make sufficient quantities available for DNA sequence analysis—that is, determining the exact order of the base pairs in a segment of DNA. Because it enabled researchers to make an unlimited number of copies of any piece of DNA, it was dubbed "molecular photocopying," and in 1993 Mullis was awarded the Nobel Prize for this tremendous breakthrough in gene analysis. By 1987 automated sequencers were developed, enabling even more rapid sequencing and analysis on large segments of DNA. Figure 7.3 shows the steps involved in a polymerase chain reaction.
In 1985 the Canadian molecular geneticist Lap-Chee Tsui and his research team mapped the gene responsible for cystic fibrosis, the most common inherited fatal disease of children and young adults in the United States, to the long arm of chromosome 7. The gene for cystic fibrosis was discovered in 1989, and it was determined that three missing nucleic acid bases occurred in the altered gene of 70% of patients with cystic fibrosis.
The mutations associated with Duchenne muscular dystrophy were identified in 1987. This gene is located close to the gene for chronic granulomatous disease (an X-linked autosomal recessive disorder that, if left untreated, is fatal in childhood) on the short arm of the X chromosome. In 1990 the American geneticist Mary-Claire King found the first evidence that a gene on chromosome 17 (now known as BRCA1) could potentially be associated with an inherited predisposition to breast and ovarian cancer.
The discoveries and technological advances made by researchers during the 1970s and 1980s gave rise to modern clinical molecular genetics. The study of chromosome structure and function, called cytogenetics, produced methods to view distinct bands on each chromosome. Figure 7.4 is a cytogenetic map of human chromosomes. Cytogenetic studies are applied in three broad areas of medicine: congenital (from birth) disorders, prenatal diagnosis, and neoplastic diseases (cancer).
BIRTH OF THE HUMAN GENOME PROJECT
The first meetings to discuss the feasibility of sequencing the human genome were organized by Robert Louis Sinsheimer, a molecular biologist and chancellor of the University of California (UC) at Santa Cruz, and were held on campus in 1985. The idea of sequencing the human genome generated excitement among the many well-known researchers in attendance—they considered the undertaking to be the "Holy Grail" of molecular biology. The following year the U.S. Congress began to consider the feasibility of human genome research. Congress did not, however, decide to fund the project until 1988, after it concluded that the establishment of administrative centers accountable to Congress could effectively manage key aspects of the project, such as databases, sharing of research findings and materials, and cultivation of new technologies. The initial funding from Congress was $17.3 million to the NIH and $11.8 million to the U.S. Department of Energy's (DOE) Office of Health and Environmental Science, with progressive increases over the next few years.
TABLE 7.1 | |||
U.S. Human Genome Project funding, fiscal years 1988–2003 | |||
[$ millions] | |||
Fiscal year | Department of Energy | National Institutes of Health | U.S. total |
Note: These numbers do not include construction funds, which are a very small part of the budget. | |||
Source: "Human Genome Project Budget," in Human Genome Project Information, U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Project, http://www.ornl.gov/sci/techresources/Human_Genome/project/budget.shtml (accessed October 27, 2006) | |||
1988 | 10.7 | 17.2 | 27.9 |
1989 | 18.5 | 28.2 | 46.7 |
1990 | 27.2 | 59.5 | 86.7 |
1991 | 47.4 | 87.4 | 134.8 |
1992 | 59.4 | 104.8 | 164.2 |
1993 | 63.0 | 106.1 | 169.1 |
1994 | 63.3 | 127.0 | 190.3 |
1995 | 68.7 | 153.8 | 222.5 |
1996 | 73.9 | 169.3 | 243.2 |
1997 | 77.9 | 188.9 | 266.8 |
1998 | 85.5 | 218.3 | 303.8 |
1999 | 89.9 | 225.7 | 315.6 |
2000 | 88.9 | 271.7 | 360.6 |
2001 | 86.4 | 308.4 | 394.8 |
2002 | 90.1 | 346.7 | 434.3 |
2003 | 64.2 | 372.8 | 437 |
The allocation of these funds incited an impassioned debate. Opponents argued that the financial and human resources devoted to the "big science" of the human genome project would divert research funds from vital scientific and biomedical research and that most of the sequence was of little biological interest and no medical utility. Other detractors warned that the sheer size of the human genome would impede completion of the project within a reasonable time frame without the creation of entirely new research methods and technologies. The project was launched despite considerable opposition, and most of these concerns were dispelled during the project's early years. Table 7.1 shows the HGP budget from fiscal year 1988 through its completion in fiscal year 2003.
In 1988 Congress provided funding to the NIH and the DOE to "coordinate research and technical activities related to the human genome." The NIH also established the Office of Human Genome Research in September 1988. The following year the office was renamed the National Center for Human Genome Research (NCHGR). James Watson served as its enthusiastic champion and director until April 1992. Following his appointment, the NIH and the DOE committed 3% to 5% of the project's budget to address ethical, legal, and social issues that arose from the study of the human genome (February 5, 2003, http://genome.rtc.riken.go.jp/hgmis/elsi/elsi.html). This ambitious undertaking constituted the largest bioethics program, in terms of funding and human resources, in the world.
The Human Genome Project Information Web site (December 7, 2005, http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml) describes the ambitious goals of the HGP when it began in 1990. The overarching HGP goals were to:
- Identify all the approximately 30,000 genes in human DNA
- Determine the sequences of the three billion chemical base pairs in human DNA
- Store HGP findings and other information in databases
- Improve tools for data analysis
- Transfer related technologies to the private sector
- Effectively address the ethical, legal, and social issues that might arise from the project
International genomic research was also under way in England, France, Germany, Japan, and other countries. In 1987 the Italian National Research Council launched a genome research project; the United Kingdom began its project in February 1989. In 1989 an international group of geneticists founded the Human Genome Organization (HUGO) in Switzerland. Many international collaborations had already been forged as individual scientists exchanged information in their quests for genetic links to disease. HUGO developed an international framework to coordinate research projects and prevent wasted resources through duplication, creating a culture of sharing data. In 1990 the European Commission initiated a two-year human genome project. Russia funded its genome research project in the same year.
In April 1990 the initial planning stage of the U.S. HGP was completed with the publication of the joint research plan Understanding Our Genetic Inheritance: The Human Genome Project—The First Five Years, Fiscal Years 1991–1995 (http://www.ornl.gov/sci/techresources/Human_Genome/project/5yrplan/summary.shtml). Just two years into the five-year plan, Watson resigned from his leadership position with the NCHGR because he vehemently disagreed with NIH decisions about the commercialization, propriety, and legality of patenting human gene sequences. Watson maintained that data from the HGP should be in the public domain and freely available to all scientists as well as to the public. In April 1993 the American geneticist and physician Francis S. Collins was named the director.
Many prominent researchers sided with Watson against the patenting and commercialization of HGP data. In 1996 scientists at leading research institutions throughout the world agreed to submit their findings and genome sequences to GenBank, a genome database maintained by the NIH. In a resounding and unanimous move, they required the publication of any submitted sequence data on the Internet within twenty-four hours of its receipt by GenBank. This action ensured that gene sequences were in the public domain and could not be patented.
TABLE 7.2 | ||
Model organisms sequenced | ||
Date sequenceda | Species | Total basesb |
aFirst publication date. | ||
bData excludes organelles or plasmids. These numbers should not be taken as absolute. Scientists are confirming the sequences; several laboratories were involved in the sequencing of a particular organism and have slightly different numbers; and there are some strain variations. | ||
cThe first number was originally published, and the second is a correction as of June 2000. | ||
Source: Richard Robinson, ed., "Model Organisms Sequenced," in Genetics, Vol. 2, E-I, Macmillan Reference USA, 2002 | ||
7/28/1995 | Haemophilus influenzae (bacterium) | 1,830,138 |
10/30/1995 | Mycoplasma genitalium (bacterium) | 580,073 |
5/29/1997 | Saccharomyces cerevisiae (yeast) | 12,069,247 |
9/5/1997 | Escherichia coli (bacterium) | 4,639,221 |
11/20/1997 | Bacillus subtillis (bacterium) | 4,214,814 |
12/31/1998 | Caenorhabditis elegans (round worm) | 97,283,371 |
99,167,964c | ||
3/24/2000 | Drosophila melanogaster (fruit fly) | ∼137,000,000 |
12/14/2000 | Arabidopsis thaliana (mustardplant) | ∼115,400,000 |
1/26/2001 | Oryza sativa (rice) | ∼430,000,000 |
2/15/2001 | Homo sapiens (human) | ∼3,200,000,000 |
Worming Away
Although sequencing the human genome was the principal objective, the HGP also sought to sequence the genomes of other organisms. These other organisms served as models, enabling researchers to test and refine new methods and technologies that helped identify corresponding genes in the human genome. Table 7.2 is a list of some of the model organisms, including the round-worm, sequenced during the course of the HGP, along with the dates the sequences were published and the number of bases in each.
At England's Cambridge University, the geneticist and molecular biologist Sydney Brenner was studying the nematode worm Caenorhabditis elegans. By 1989 Brenner and his colleagues had successfully produced a map of the entire Caenorhabditis elegans genome. The map consisted of multiple overlapping fragments of DNA, arranged in the correct order, and Brenner's research team printed the worm's genome on postcard-sized pieces of paper.
Watson believed that the genomes of smaller organisms would not only help to refine research methods and the technology but also provide valuable sources of comparison once the human genome project was under way. The worm map convinced Watson that Caenorhabditis elegans should be the first multicellular organism to have its complete genome accurately sequenced. When the worm-sequencing project began in 1990, the first automatic sequencing machines had just become available from Applied Biosystems, Inc. The sequencing machines enabled the worm pilot project to meet its objective of sequencing three million bases in three years. Equally important, the worm project demonstrated that the technology could scale up—that is, more machines and more technologists could produce more sequences faster.
A DECADE OF ACCOMPLISHMENTS: 1993 TO 2003
When the HGP began in September 1990, its projected completion date was 2005, at a cost of $3 billion for just the U.S. portion of the research. Ever-improving research techniques—including the use of restriction fragment length polymorphisms, which is described in detail in Figure 7.5, as well as the polymerase chain reaction, bacterial and yeast artificial chromosomes, and pulsed-field gel electrophoresis—accelerated the progress of the project. The HGP researchers finished the mapping two years earlier than scheduled, and U.S. scientists spent just $2.7 billion.
Although the HGP was truly a collaborative, international effort, most of the sequencing work was performed at the Whitehead Institute for Medical Research in Massachusetts, the Baylor College of Medicine in Texas, the University of Washington, the Joint Genome Institute in California, and the Sanger Centre in the United Kingdom. Along with U.S government funding, the HGP was supported by the Wellcome Trust, a charitable foundation in the United Kingdom.
Public and Private Initiatives Compete
In 1993 the NCHGR established a Division of Intramural Research, charged with developing genome technology research of specific diseases. By 1996 eight NIH institutes and centers had also collaborated to create the Center for Inherited Disease Research to study the genetics of complex diseases. In 1997 the NCHGR gained full institute status at the NIH, becoming the National Human Genome Research Institute (NHGRI). A third five-year plan was announced in 1998 in Science.
The mid-1990s also saw the birth of a privately funded genomics effort led by the American geneticist J. Craig Venter. Venter had been working in a laboratory at the NIH when he decided to concentrate his sequencing efforts not on the genome itself, but on the gene products—that is, messenger RNAs produced by each cell. Besides the genes themselves, the genome is composed of a much larger amount of DNA whose function is as yet unknown. This portion of the genome is often referred to as "junk DNA," although it is possible that its true value has not yet been determined. Venter eventually left the NIH to establish a private, nonprofit organization, The Institute for Genomic Research (TIGR), an organization aimed at collecting and interpreting vast numbers of expressed sequence tags. (Expressed sequence tags are random fragments of complementary DNAs derived from the information in the RNAs, which contain all the information that is actually expressed in a given cell type.) In July 1995 TIGR published the first sequenced genome of the bacteria Haemophilus influenzae. (See Table 7.2.)
In May 1998 Venter announced that he was allying with the Perkin-Elmer Corporation to form a new company that would compete directly with the public effort to sequence the entire human genome by 2001. The new company would become Celera Genomics. Venter planned to take a different approach from the one used by the HGP. Instead of the map-based, gene-by-gene approach taken by the HGP, his firm planned to break the genome into random lengths, and then sequence and reassemble it. This method saved time by eliminating the mapping phase, but it required robust computing capabilities to reassemble the human genome, which includes many repeated sequences. Venter's approach relied on the use of a supercomputer and 300 high-speed automatic sequencers manufactured by the Perkin-Elmer Corporation; it was the precursor to the large-scale genomics studies that have become standard.
The competition between public and private initiatives began in earnest. Within one week of the launch of the private initiative, the Wellcome Trust increased its funding to the Sanger Centre to step up the production of raw sequence. In response to this increased support, the Sanger Centre revised its objective by aiming to sequence a full one-third of the entire genome rather than just one-sixth. The race between the public and private projects was on, and the milestones of the sequencing project came ever more rapidly.
In the United States Venter's firm energized the publicly funded project and inspired intensified efforts. HGP investigators feared that if their efforts were perceived as slow and inefficient, the HGP would lose congressional support and funding. The threat of genomic information ending up solely in the hands of a private firm was simply unacceptable to the researchers. When Celera entered the race to sequence the human genome, it changed the landscape of the field. Celera resolved to make its data available only to paying customers and planned to patent some sequences before releasing them.
The publicly funded HGP released sequence information quickly both to provide the scientific community with timely data that were immediately usable and to place identified sequences beyond the reach of commercial companies wishing to patent them or charge for access to the data. The leadership of the HGP echoed the sentiments that had prompted Watson's resignation. They contended that patenting the human sequence was unethical and delayed the timely application of genomic information to medical disorders.
The international partners in the genome project met in Bermuda in February 1996 at a strategy meeting sponsored by the Wellcome Trust. There they created the "Bermuda Principles," a set of conditions that govern access to data, including the standard that sequence information be released into public databases within twenty-four hours. To adhere to this agreement, participating scientists were to deposit base sequences into one of three databases within twenty-four hours of sequencing completion. The data contained in the three databases were exchanged daily. Because these were public databases, access to the stored sequences was free and unrestricted. The agreement was extended to data on other organisms at a meeting later that same year.
FIRST DRAFT OF THE COMPLETED HUMAN GENOME
In April 2000 Celera announced that it was prepared to present the first draft of the human genome. Not unexpectedly, scientists and the public eagerly anticipated this "first look" at the human genome. Although geneticists and other scientists could better comprehend the mechanics and the future implications of this endeavor than the general public, the significance of this achievement was evident to professionals and laypeople alike. The professional literature and the mass media had successfully communicated the importance of this achievement, and it was understood that knowledge of the human genome held the key to the singularity of the human species. Furthermore, it was widely assumed that this information would be the basis for unprecedented advances in medicine and biomedical technology.
In February 2001 the first working draft of the human genome—90% of the sequence of the genome's three billion bases—was published in special issues of the journals Nature (February 15, 2001) and Science (February 16, 2001). Nature detailed initial analysis of the descriptions of the sequence generated by the public HGP, and Science contained the draft sequence reported by private projects conducted by Celera.
One of several surprises from the first draft was that previous estimates of gene number appeared to have been wildly inaccurate. Most pregenome project estimates predicted that humans had as many as 60,000 to 150,000 genes. The first draft of the complete genome sequence indicated that the true number of genes required to make a human being was less than 40,000. By comparison, yeast have about 6,000 genes, fruit flies have 14,000, round worms have 19,000, and the mustard weed plant has 26,000. Another surprise was the observation that humans share 99.9% of the nucleotide code in the human genome. Notably, human diversity at the genetic level is encoded by less than a 0.1% variation in DNA.
First Draft Is Headline News
A White House press release (June 25, 2000, http://www.ornl.gov/sci/techresources/Human_Genome/project/clinton1.shtml) predicted some of the anticipated medical outcomes of the project. These included the ability to:
- Alert patients that they are at risk for certain diseases. Once scientists discover which DNA sequence changes in a gene can cause disease, healthy people can be tested to see whether they risk developing conditions such as diabetes or prostate cancer later in life. In many cases, this advance warning can be a cue to start a vigilant screening program, to take preventive medicines, or to make diet or lifestyle changes that may prevent the disease.
- Reliably predict the course of disease. Diagnosing ailments more precisely will lead to more reliable predictions about the course of a disease. For example, a genetic fingerprint will allow doctors treating prostate cancer to predict how aggressive a tumor will be. New genetic information will help patients and doctors weigh the risks and benefits of different treatments.
- Precisely diagnose disease and ensure that the most effective treatment is used. Genetic analysis allows us to classify diseases, such as colon cancer and skin cancer, into more defined categories. These improved classifications will eventually allow scientists to tailor drugs for patients whose individual response can be predicted by genetic fingerprinting. For example, cancer patients facing chemotherapy could receive a genetic fingerprint of their tumor that would predict which chemotherapy choices are most likely to be effective, leading to fewer side effects from the treatment and improved prognoses.
- Developing new treatments at the molecular level. Drug design guided by an understanding of how genes work and knowledge of exactly what happens at the molecular level to cause disease will lead to more effective therapies. In many cases, rather than trying to replace a gene, it may be more effective and simpler to replace a defective gene's protein product. Alternatively, it may be possible to administer a small molecule that would interact with the protein to change its behavior. This is the strategy behind a drug in development for chronic myelogenous leukemia, which targets the genetic flaw causing the disease. It attaches to the abnormal protein caused by the genetic flaw and blocks its activity. In preliminary tests, blood counts returned to normal in all patients treated with the drug.
On June 26, 2000, BBC News compiled assessments of the achievement by the world's premier scientists and politicians (http://news.bbc.co.uk/1/hi/sci/tech/807126.stm). Venter spoke for many researchers when he said, "I think we will view this period as a very historic time, a new starting point." Michael Dexter of the Wellcome Trust echoed Venter's sentiments when he ventured, "This is the outstanding achievement not only of our lifetime, but in terms of human history. I say this, because the Human Genome Project does have the potential to impact on the life of every person on this planet." Randy Scott, the president of Incyte, another private firm involved in genomics research, predicted that "[t]he availability of genome sequence is just the beginning. Scientists now want to understand the genes and the role they play in the prevention, diagnosis and treatment of disease." Mike Stratton of the Cancer Genome Project was equally optimistic when he said that "[i]t would surprise me enormously if in twenty years the treatment of cancer had not been transformed." The Nobel Prize-winning English biochemist and professor Frederick Sanger, the pioneer of DNA sequencing, expressed the collective awe of the scientific community when the HGP was completed earlier than anticipated. Sanger admitted, "I never thought it would be done as quickly as this."
The U.S. media celebrated the achievement with a flood of press releases and features. Efforts were also made to explain this monumental accomplishment to the public and to educate students. The DOE Human Genome Program provided a wealth of information about the HGP and its findings on the Internet. Figure 7.6 is an example of the information the DOE made available to the public. Table 7.3 presents some of the potential environmental benefits of the HGP that may be realized in the future.
TABLE 7.3 | ||
Projected national benefits of genomics research, 2020, 2040, and 2050 | ||
Within a decade | Long term | |
Source: "Payoffs for the Nation," in "Image Gallery," in Human Genome Project Information, U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Project, http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/payoffs.jpg (accessed October 27, 2006) | ||
2020 | ||
Develop knowledge base for cost-effective cleanup strategies | Save billions of dollars in toxic waste cleanup and disposal | |
2040 | ||
Understand earth's natural carbon cycle and design strategies for enhanced carbon capture | Help stabilize atmospheric carbon dioxide to counter global warming | |
2050 | ||
Increase biological sources of fuels and electricity | Contribute to U.S. energy security • Biohydrogen-based industry in place |
PUFFER FISH AND MOUSE GENOMES ARE SEQUENCED
In July 2002 the DOE Joint Genome Institute (JGI), operated by the Lawrence Berkeley National Laboratory, the Lawrence Livermore National Laboratory, and the Los Alamos National Laboratory, announced the draft sequencing, assembly, and analysis of the genome of the Japanese puffer fish Fugu rubripes. The Fugu Genome Project was initiated in 1989 in Cambridge, England, and in November 2000 the International Fugu Genome Consortium was formed, headed by the JGI. During 2001 the puffer fish genome was sequenced and assembled using the whole genome method pioneered by Celera. The puffer fish was the first vertebrate genome to be publicly sequenced and assembled in this manner and the first vertebrate genome published after the human genome. According to the JGI (2005, http://genome.jgi-psf.org/Takru4/Takru4.home.html), puffer fish have the smallest known genomes among vertebrates (animals with bony backbones or cartilaginous spinal columns—fish, reptiles, birds, and mammals, including humans). The puffer fish sequence has about the same number of genes as the considerably larger human genome, but is more compact because it contains relatively little of the junk DNA present in the human genome sequence.
Comparison of the human and puffer fish genomes enabled investigators to predict the existence of nearly 1,000 previously unidentified human genes. Although the function of these additional genes is as yet unknown, they contribute to the complete catalog of human genes. Ascertaining the existence and location of genes helped scientists begin to describe how they are regulated and function in the human body. Of the more than 30,000 puffer fish genes identified, the vast majority of human genes have counterparts in the puffer fish, with the most significant differences in genes of the immune system, metabolic regulation, and other physiological systems that are not alike in fish and mammals.
On December 5, 2002, the first draft of the sequence of the mouse genome was published in Nature. The mouse genome findings were deemed among the most important in terms of their comparability with humans. Mice and humans have about the same number of genes—approximately 20,000—and DNA base pairs—mice have 2.5 billion and humans have 2.9 billion. More important, about 90% of genes associated with medical disorders in humans have counterparts in mice. This finding means that mice are especially well suited for studying diseases that afflict humans and for testing therapeutic treatments for disease.
HUMAN GENOME PROJECT IS COMPLETED
After the publication of a first-draft human genome in 2001, work continued to fill in the blanks and produce a complete and accurate sequence. On January 10, 2003, another milestone in the human genome sequencing effort was reported: the fourth human chromosome—chromosome 14, the largest one to date, with 87 million base pairs—had been sequenced. The researchers Jean Weissenbach and Roland Heilig of Genoscope, France's National Sequencing Center in Paris, published their findings online in Nature. They found two genes that are vital for immune responses on chromosome 14 and about sixty genes that, when defective, contribute to disorders such as spastic paraplegia and Alzheimer's disease.
In a remarkable coincidence that made the crowning achievement of the HGP even more poignant, the completion of the sequencing of the human genome occurred during the same year slated for celebrations of the fiftieth anniversary of the discovery of the DNA double helix. On April 14, 2003, the International Human Genome Sequencing Consortium, directed in the United States by the DOE and the NHGRI, announced the successful completion of the HGP more than two years earlier than had been anticipated (http://www.genome.gov/11006929). The International Human Genome Sequencing Consortium included scientists at twenty sequencing centers in China, France, Germany, Great Britain, Japan, and the United States.
Nature, the same journal that had published the groundbreaking discoveries of Watson and Crick fifty years earlier, hailed the era of the genome in a special edition dated April 24, 2003. As had been the practice since the inception of the HGP, the entirety of sequence data generated by the HGP was immediately entered into public databases and made freely available to the scientific community throughout the world, with no restrictions on its use or redistribution. The data are used by researchers in academic settings and industry, as well as by commercial firms that provide information services to biotechnologists. Figure 7.7 shows some of the landmarks in the sequenced human genome.
In the April 14, 2003, press release "All Goals Achieved: New Vision for Genome Research Unveiled" (http://www.genome.gov/11006929), the NHGRI described the international effort to sequence the three billion DNA base pairs in the human genome as "one of the most ambitious scientific undertakings of all time," comparing it to feats such as splitting the atom or traveling to the moon. NHGRI director Francis S. Collins proudly declared that "the Human Genome Project has been an amazing adventure into ourselves, to understand our own DNA instruction book, the shared inheritance of all humankind. All of the project's goals have been completed successfully—well in advance of the original deadline and for a cost substantially less than the original estimates." In the same press release Eric Lander, the director of the Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, predicted the postgenomic era when he asserted, "The Human Genome Project represents one of the remarkable achievements in the history of science. Its culmination this month signals the beginning of a new era in biomedical research. Biology is being transformed into an information science, able to take comprehensive global views of biological systems. With knowledge of all the components of the cells, we will be able to tackle biological problems at their most fundamental level."
Collins urged the scientific community not to rest on its laurels in the wake of this triumph, saying, "With this foundation of knowledge firmly in place, the medical advances promised from the project can now be significantly accelerated." The April 24, 2003, issue of Nature detailed the challenges researchers would face in the postgenomic era as they sought to employ the HGP data to treat disease and improve public health. Recommendations included collaborative efforts to produce:
- New tools to allow discovery in the not-too-distant future of the genetic contributions to frequently occurring diseases, including diabetes, heart disease, and mental illnesses such as schizophrenia
- Improved methods for the early detection of disease and to enable timely treatment when it is likely to be effective
- New technologies able to sequence the entire genome of any person affordably, ideally for less than $1,000
- Wider access to tools and technologies of "chemical genomics" to enhance understanding of biological pathways and accelerate pharmaceutical and other treatment research
Along with the special commemorative issue of Nature, the April 11, 2003, edition of Science ran articles that described the HGP and detailed the multidisciplinary DOE plan dubbed "Genomes to Life," which aimed to use HGP data to understand the ways in which microbes can provide opportunities to develop clean energy, reduce climate change, and clean the environment.
HGP REVISES ITS ESTIMATE
In October 2004 the NHGRI reduced its estimate of the number of human genes from 30,000 to 35,000 to between 20,000 and 25,000 (http://www.genome.gov/12513430). The refined human genome sequence, published in the October 21, 2004, issue of Nature, was the most complete version to date. It covered 99% of the gene-containing parts of the human genome, identified nearly all known genes, and was 99.9% accurate, according to the HGP scientists.
DAWN OF THE POSTGENOMIC ERA
The Human Genome Project Information Web site (October 9, 2006, http://www.ornl.gov/sci/techresources/Human_Genome/project/benefits.shtml) enumerates many of the potential benefits of HGP research. Besides its role in the practice of molecular medicine, other uses of HGP data and applications of human and other genomic research include:
- Microbial genomics—the use of bacteria to create new energy sources such as biofuels and safe, efficient toxic waste cleanup; enhanced understanding of how microbes cause disease; and protection from threats of biological and chemical terrorism and warfare (See Figure 7.8.)
- Risk assessment—measuring the risks and health problems caused by exposure to radiation, carcinogens (cancer-causing agents), and mutagenic chemicals; and reduction of the probability of heritable mutations
- Archaeology, anthropology, evolution, and human migration—comparing the genomes of humans and other organisms such as mice already has identified similar genes associated with diseases and traits; improving the understanding of germline (cells that give rise to eggs or sperm) mutations; studying migration based on female genetic inheritance; examining mutations on the Y chromosome to trace lineage and migration of males; and comparing the DNA sequences of entire genomes of different microbes to enhance the understanding of the relationships among the three domains of life: archaebacteria (cells that do not contain nuclei), eukaryotes (cells that contain nuclei), and prokaryotes (single-celled organisms without nuclei)
- DNA forensics—identifying crime victims, potential suspects, and catastrophe victims through examination of DNA; confirm paternity and other family relationships; clear people wrongly accused of crimes; identify and protect endangered species; detect bacterial and other environmental pollutants; match organ donors and recipients for transplant programs; and determine pedigrees for animals and plants
- Agriculture and livestock breeding—develop healthier, stronger crops and farm animals able to resist insects, disease, and drought; create safer pesticides; grow more nutritious produce; incorporate vaccines into food products; and redeploy plants such as tobacco for use in environmental cleanup programs
Molecular Medicine
The HGP and the technological advances it has produced have moved the field of molecular medicine forward with extraordinary speed. In "Genomes, Transcriptomes, and Proteomes: Molecular Medicine and Its Impact on Medical Practice" (Archives of Internal Medicine, January 27, 2003), Ivan Gerling, Solomon S. Solomon, and Michael Bryer-Ash assert that the HGP will not only influence the way science is conducted but also will advance the clinical practice of medicine. Gerling, Solomon, and Bryer-Ash credit the HGP for the technological advances that enable preclinical detection—recognition of disease before its earliest biochemical or visible expression. They foresee increasing accuracy and ease of preclinical detection, as well as the ability to predict disease based on three fundamental levels of biologic determination:
- The genomic DNA constitution of the individual (the genome), which is unchanged from the moment of conception, except for some isolated, local mutations
- The transcribed messenger RNA complement (the transcriptome)
- The full range of translated proteins (the proteome)
Gerling, Solomon, and Bryer-Ash posit that the environment influences gene expression and modifies gene products in ways that initiate, accelerate, or slow progress of disease-causing processes. This does not change the genome, but it does change the transcriptome and the proteome. Recent technological breakthroughs have provided the tools to perform the comprehensive molecular analyses needed to examine not only the genome but also the transcriptome and proteome. Using new technologies will dramatically increase understanding at the molecular level of the mechanisms of disease development.
Along with molecular diagnosis of diseases even before they are clinically apparent, Gerling, Solomon, and Bryer-Ash predict increasingly effective therapies as genetic information enables physicians to individualize treatment in response to the availability of comprehensive genetic and molecular profiles. Although there are many promises and potential benefits of molecular medicine—improved diagnostic ease, speed, and accuracy; earlier detection of genetic predisposition or susceptibility to disease; gene therapy; and pharmaceutical drug development, specifically pharmacogenetics to produce "customized drugs"—Gerling, Solomon, and Bryer-Ash emphasize that the increased knowledge must be used responsibly. They caution that society must take steps to ensure that this improved understanding of genetics is not deployed to exclude people from obtaining insurance or employment.
Haplotype Mapping Project
In October 2002 an international effort to develop a haplotype map of the human genome was launched. A haplotype is a set of alleles or markers on one of a pair of homologous chromosomes, and a haplotype map will show human genetic variation. The premise of the International HapMap Project was that within the human genome different genetic variants within a chromosomal region—haplotypes—occur together far more frequently than others. Based on common haplotype patterns—combinations of DNA sequence variants that are usually found together—the haplotype map simplifies the search for medically important DNA sequence variations and offers new understanding of human population structure and history.
Given that any two people are 99.9% identical genetically, understanding the 0.1% difference is important because it helps explain why one person may be more susceptible to a certain disease than another. Researchers can use the HapMap to compare the genetic variation patterns of a group of people known to have a specific disease with a group of people without the disease. Finding a certain pattern more often in people with the disease identifies a genomic region that may contain genes that contribute to the condition. Researchers hope that identifying single nucleotide polymorphisms (SNPs), which are specific positions in the genome sequence that are occupied by one nucleotide in some copies and by a different nucleotide in others, will enable them to identify the alleles (particular forms of genes) that are associated with increased or decreased susceptibility to common diseases, such as asthma, heart disease, or psychiatric illness. Figure 7.9 shows that most SNP variation—about 85%—occurs within all populations.
Because investigators hypothesize that differences between haplotypes may be associated with varying susceptibility to disease, mapping the haplotype structure of the human genome may be the key to identifying the genetic basis of many common disorders. The HapMap project serves as a resource for studying the genetic factors that contribute to variation in response to environmental factors, in susceptibility to infection, and in the identification of genetic variants associated with the effectiveness of, and adverse responses to, drugs and vaccines.
To create a haplotype map, researchers must have enough SNPs to be sure that regions containing disease alleles have been found and that regions not containing disease alleles can be excluded from further consideration. The HapMap enables researchers to study the genetic risk factors underlying a wide range of disorders. For any given disease, researchers may perform an association study by using the HapMap tag SNPs to compare the haplotype patterns of a group of people known to have the disease to a group of people without the disease. If the association study finds a specific haplotype more frequently in those with the disease, researchers scrutinize the precise genomic region in their search for the specific genetic variant.
In the news release "Map of Human Genetic Variation Will Speed Search for Disease Genes" (February 7, 2005, http://www.nih.gov/news/pr/feb2005/nhgri-07.htm), the International HapMap Consortium announced plans to create an even more powerful map of human genetic variation than the group had initially intended. The project was originally intended to complete the map of haplotypes by September 2005, but by mid-2005 a draft of the HapMap, consisting of one million markers of genetic variation, was released. The first draft of the HapMap has enabled researchers to analyze the human genome in ways that were not possible with the human DNA sequence alone. Data from the second phase of the project were released in July 2006 and provide a denser map that enables scientists to narrow gene discovery more precisely to specific regions of the genome.
First Map of Common Human Genetic Variations
In February 2005 scientists working at Perlegen Sciences, Inc., in California produced the first map of common human genetic variations—differences in DNA that may assist in predicting disease risk and optimal disease treatment. To create the map, Perlegen investigators collaborated with researchers at the California Institute for Telecommunications and Information Technology at UC San Diego and the International Computer Science Institute at UC Berkeley. The map was unveiled at a meeting of the American Association for the Advancement of Science and described in the February 18, 2005, issue of Science.
Perlegen scientists looked at the DNA of seventy-one Americans of European, African, and Chinese ancestry and identified nearly 1.6 million SNPs—single-letter genetic differences—most of them shared across the three populations. Although the 1.6 million SNPs are just about 10% of the 10 million SNPs believed to exist, they appear to be among the most common. The Perlegen map does not pinpoint which SNPs are linked to disease risk, but future research will focus on identifying the SNP variations that trigger some people to develop diseases and others to resist or combat them.
Future of Genomic Research
In "A Vision for the Future of Genomics Research" (Nature, April 24, 2003), the NHGRI describes some of the research challenges of the postgenomic era. Besides the HapMap project and the DOE's "Genomes to Life," the article details other initiatives, including genome technology development.
Directed by the NHGRI, the Encyclopedia of DNA Elements (ENCODE) Project (December 2006, http://www.genome.gov/10005107) aims to develop efficient ways to identify and locate all the protein-coding genes, nonprotein-coding genes, and other sequence-based, functional elements contained in the human DNA sequence. This ambitious undertaking will produce an enormous resource for researchers seeking to use and apply the human sequence to predict disease risk and to develop new approaches to prevent and treat disease.
The ENCODE Project entails three phases: a three-year pilot project phase; a second technology development phase that parallels phase 1; and a planned production phase. In the October 22, 2004, issue of Science, the ENCODE investigators described their plans to build a "parts list" of all sequence-based functional elements in the human DNA sequence. The researchers hope to identify as-yet-unrecognized functional elements. During the pilot phase they are developing and testing high-throughput ways to efficiently identify functional elements. They are focusing on forty-four DNA targets, which together cover about 1% of the human genome, or about 30 million base pairs. The target regions were strategically selected to provide a representative cross section of the entire human genome sequence. In the second phase other researchers will work to develop new technologies to apply to the ENCODE Project. The results of the first two phases will determine how to begin the production phase and advance the ENCODE Project to analyze the remaining 99% of the human genome.
Another NHGRI initiative is the creation of publicly available libraries of organic chemical compounds for scientists engaged in charting biological pathways. These chemical compounds have many promising applications in genomic research. For example, their ability to enter cells readily makes them natural vehicles for pharmaceutical drug development and drug delivery system design. An endeavor of this size and scope requires significant financial and human resources, and the NHGRI is planning to use technologies such as robotic-enabled, high-output screening to create large libraries containing up to a million chemical compounds.
In 2004, the NIH announced the establishment of a Chemical Genomics Center based in the NHGRI Division of Intramural Research (http://www.ncgc.nih.gov/news/2004_06_09.html). Called PubChem (http://pubchem.ncbi.nlm.nih.gov/), it is a database of information on the biological activities of innovative chemical "tools" for use in biological research and drug development, and it includes a repository to acquire, maintain, and distribute a collection of up to one million chemical compounds. Like the HGP data, the chemical genomics network is freely available to the entire scientific community.
In addition, in April 2003 the United Kingdom's Wellcome Trust, along with Canadian funding organizations and the global pharmaceutical company Glaxo-SmithKline, established a charitable organization, the Structural Genomics Consortium, to round out international efforts in structural genomics. Structural genomics is the systematic, high-volume generation of the three-dimensional structure of proteins. The goal of examining the structural genomics of any organism is the complete structural description of all proteins encoded by the genome of that organism. These descriptions are important for drug design, diagnosis, and treatment of disease. Like the HGP and Chemical Genomics Center, the Structural Genomics Consortium is placing all the protein structures in public databases where scientists throughout the world may access them.