Nature, vol. 392, 3/97, Russell F. Doolittle, page 339
The complete genome of a dozen bacteria, and a yeast, provide a wealth of information for tracing evolutionary networks. Here is a brief guide to what has and can be learned.
It is less than three years ago that the first complete sequence of a bacterial genome was reported. Now we have a dozen, and that of a eukaryote (the budding yeast, Saccharomyces cerevisiae), with a certainty of many more to follow.
One of the most exciting aspects of this cavalcade of genomes is the immediate public access to the data The microbial genome database Web site at The Institute for Genome Research (TIGR, sited in Maryland) currently lists almost 50 projects under way around the world. Half of these are being conducted by TIGR, either alone or in collaboration with others. Assorted universities around the world have staked out particularly interesting organisms; some of the projects are being undertaken as consortia, international or national (for example, a Brazilian consortium is sequencing the plant pathogen Xylella fostidiosa). The Web site posting for the Sanger Centre, Cambridge, UK, currently shows five microbial genomes in progress; and - reminiscent of aircraft departure monitors in large airports - the entry for Mycobacterium tuberculosis continuously flashes Completed.
But what is so sensational about having a complete sequence instead of only the sequences of the interesting genes, many of which have been reported long ago? For one thing, what is absent is sometimes as interesting as what is present until an entire genome is in hand one can't be certain that some other functionally or otherwise related gene isn't lurking behind the scenes. Beyond that, there are some fundamental questions that these studies might be expected to answer, none of which has yet been answered definitively, mostly because identifying all of the genes on the basis of sequence alone has proved more difficult than expected. Additionally, there is increasing confusion about the retationship of the major divisions of life.
The tree of life
Until the late 1970s, living organisms were divided into two major groups: the prokaryotes (no nucleus) and the eukaryotes (defined nucleus). The advent of RNA sequencing led to a three-kingdom world, the prokaryotes being divided into two groups: the true bacteria (Eubacteria, also known as Bacteria) and the motley group composed of organisms from diverse and extreme habitats (the Archaebacteria, a.k.a. Archaea). Three of the first dozen bacterial genomes to be completed belong to the Archaea, allowing for a thorough comparison of the two groups.
Although a phylogenetic tree based on ribosomal RNA from the 13 organisms with fully sequenced genomes nicely bears out the three - kingdom world, not all of their gene products are so accommodating. Remarkably, in spite of many expectations to the contrary, the vast majority of gene products from the Archaea most resemble counterparts among the Eubacteria and not eukaryotesa'0. Even so, that a significant minority of archaeal proteins, and especially those having to do with the transcription and translation of other genes, more closely resemble eukaryotic forms, reinforced notions that eukaryotes were the result of a chimaeric merging of a eubacterium and an archaebacterium, the most recent report of which proposes a symbiosis between a eubacterium that consumed organic matter and excreted H2 and CO2, and a methane generating archaebacterium that used these by-products for food. Whatever the case, it is almost certain that the last common ancestor of Bacteria and Archaea was also the last common ancestor of all extant life, although the origin of the Eukarya is far from settled.
The Human Genome Project
The microbial genome projects, like those for a number of multicellular creatures, have roots that reach back to the Human Genome Project. Only a decade ago the biological community was debating the feasibility of sequencing the human genome, a bothersome question being how one would recognize genes on the basis of DNA sequence alone. Setting aside the non-coding sequences dubbed introns, what is it that would allow the identification of 50,000-100,000 genes in the human genome when, at the time, fewer than a thousand gene products for humans had been sequenced? In the end, sceptics were persuaded by the realization that all living things are the result of an enormous genetic expansion. New genes (and the proteins they encode) come from old genes by way of duplication and modification, and, for the most part, these modifications occur at a slow enough pace that related genes are easily recognized on the basis of sequence alone.
Briefly put, the dogma was that all creatures on Earth are descended from a common ancestor and share a fundamental set of genes, the translated sequences of many of which are easily recognized even between prokaryotes and eukaryotes. For example, the amino-acid sequences of the enzyme triose phosphate isomerase from humans and the bacterium E. coli were already known to be 45% identical. As a result, for many proteins one doesn't need to have the human sequence already in hand; it will be easily identified on the basis of sequences that have already been found in other creatures.
There was also concern that the human genome has many more genes than many of the organisms that molecular biologists had traditionally concentrated on, such as E. coli and yeast, or even the nematode worm Caenorhabditis elegans and the fruitfly Drosophila melanogaster. The fear was that vast numbers of genes would be found that had no counterparts in the data banks. T'hese concerns gradually diminished as appreciation of gene duplications grew, and gene families became a part of the molecular biologist's lexicon.
A boost from ESTs
Another concern about identifying human genes on the basis of sequence alone had to do with the fragmentation of genes in `higher' eukaryotes, whereby coding information is interrupted by the non-coding intron sequences. Some critics felt that it was unwise to sequence genomic DNA, only about 5% of which is coding and actually expressed as proteins. Why not stick to complementary DNA, they argued, where identifications will be easier? (cDNA is the copy of messenger RNA without introns, and directly encodes the amino-acid sequences of proteins.)
The point was forcefully made in 1991 when a group from TIGR reported the initial results of a massive sequencing effort on a large random cDNA library from human brain cells`. Instead of fully characterizing every cDNA done, they simply used an array of automatic DNA sequencers to examine individual but randomly selected clones, subjecting each to a single sequencer run, usually averaging about 450 bases, the equivalent of 150 amino acids. Because cDNA clones are prepared from messenger RNA, and thereby represent genes that are being expressed, these sequences were called expressed sequence tags, or ESTs. 'The TIGR group searched the databases with these partial sequences, finding related sequences for tens of thousands of them, and thereby allowing good estimates of the expected gene count for humans.
Indeed, it was the power of having several dozen machines accurately churning out a combined equivalent of 500 kilobases a day that set the stage for TIGR to undertake the sequencing of a complete bacterial genome. This brute force approach was coupled with solid Poisson-based expectations of how many sequences of 400-500 base pairs in length would be needed for complete coverage of a bacterial genome by shotgun sequencing. Thus, the first complete bacterial genome to be reported, the 1.83-megabase sequence of H. influenzae, was determined by randomly sequencing fewer than 20,000 clones, a sixfold redundancy, on the average, for every base, and one that had the advantage of simultaneously providing an accuracy and reliability check (if one of the six gave a different base than the others, it could be checked).
Once a genome's sequence is determined, the analysis begins in earnest. Although there is interesting information at every turn, the main priority is to identify the genes. This process is easier in bacteria than in eukaryotes because their genes lack those nuisancesome introns. Simple computer programs translate the DNA sequence directly into putative protein sequences. Long runs of coding triplets without stop codons are called open reading frames or ORFs, and presumably correspond to the genes for proteins.
The ORF sequences are then searched against large databases of all known sequences from all kinds of organisms to see if they resemble any known protein. Surprisingly, a substantial number of the ORFs uncovered in these first complete genomes have not looked like any reported protein, which makes them URFs (unidentified reading frames), as well as ORFs (Fig. 2).
As a rule of thumb, any two protein sequences that are more than 25% identical are likely to be homologous - that is, descended from a common ancestor. If the resemblance is less, one must be more cautious; the region between 15% and 25% identity is often referred to as the twilight zone. But proteins whose sequences are even less similar can be shown to be related, either by pattern analysis or by comparing their three-dimensional structures.
The large fraction of unidentified genes is a major stumbling block to answering basic questions about evolution. Some of these URFs may encode genes common to all organisms, but their sequences alone may not be sufficient to make their identity known; or they may be the genes that impart uniqueness to the organism. And not all of these URFs, especial1y the shortest among them, necessarily get translated into functional proteins.
Is a particular ORF unidentifiable simply because it encodes a very fast-changing protein, the sequence of which has been blurred by numerous amino-acid substitutions? Or does it represent a gene whose function has not yet been found by biochemists and molecular biologists? If it is the latter, then it appears that almost half of cell biology and biochemistry has not yet been encountered, in spite of a century of exploration.
With so many complete genome sequences now on hand, the two possibilities ought to he distinguishable by direct genome comparison. Thus, if the same URF is recognizable in the genomes of distantly related bacteria, it is not likely to be a case of non-identification due to rapid change. On the other hand, if an ORF is unique to an organism, it could be that either it has changed so much that it can't be recognized, or it might really be a uniqueness gene. In this case, the matter can be sorted out if sequences are available from two or more closely related organisms where rates of change can be compared with other genes.
Little and large microbes
The prokaryotic genomes determined so far range in size from about 0.6 to 4.7 megabases (Table 1 ). Inherent in this distribution is the evolutionary potential to expand or shrink. Expansion is made possible by gene duplications, and shrinkage by deletions. It is of considerable interest that different clusters of duplicated genes are showing up in the various genomes. For example, in B. subtilis more than a quarter of the genes show evidence of recent duplication; a hundred of them are present in paralogous sets of five (a paralogous gene or protein is one that has homology to another as a result of gene duplication). In E. coli, there is a family of 80 paralogous transport proteins . Tallying up all the recent duplications in these genomes may result in some general rules about frequency of occurrence.
Similarly, the absence of many genes in smaller genomes can provide hints about the ease of deletion and removal of genes that are no longer needed. Among the bacteria that have had their genomes completely sequenced, for example, are two closely related known; this is a reflection of their parasitic existence, because many of their needs are provided by their animal hosts. Mycoplasma genitalium has only 470 ORFs and M. pneumoniae 679. All 470 ORFs from the smaller bacterium are found in the larger relative. Their protein sequences are about 67% identical, on average, a reflection of their close relationship.
The two mycoplasmas also share numerous URFs that are not recognized in any of the other fully sequenced genomes. One of these is a `hypothetical protein' that is more than 90%o identical in the two organisms; clearly, this isn't a case of a fast-changing protein. So, where did it come from? Similarly, there are URFs found only in the larger bacterium (M. pneumoniae) that have been observed, and remain unidentified, in other distantly related bacteria such as E. coli and B. subtilis. Again, it isn't a matter of rapid change that is keeping their identities secret.
As for what makes mycoplasma unique, 3 50 of the 470 ORFs found in M. genitalium have corresponding relatives in B. subtilis, leaving only about a hundred candidates for uniqueness genes: Interestingly, the next smallest of the fully sequenced genomes, that of B. burgdorferi, is also an animal parasite that lacks many of the same enzymes and metabolic pathways that are absent in the mycoplasmas, even though it is distantly related ( Fig.1 ).
For example, both of these organisms lack biosynthetic capabilities for amino acids, fattv acids, nucleotides and enzyme cofactors, and none of the enzymes of the tricarboxylic acid cycle is found in either of them, leading to the notion of convergent evolutionary gene loss. Viewed in retrospect, this isn't so surprising. If an organism is provided with an abundant supply of nutrients, and has the wherewithal to absorb them, then it seems predictable that the machinery responsible for their synthesis would decay It is an old rule of evolutionary biology and natural selection: Use it or lose it.
Minimal gene content
As soon as two bacterial genomes became available, estimates were made of `minimal gene content: Comparison of the genomes of H. influenzae and M genitalium led to the claim that a set of 256 genes is "close to the minimal necessary and sufficient to sustain the existence of a modern-type cell". That both of these organisms are parasitic on hosts which provide much of their sustenance confounds the significance of such a conclusion, even though both bacteria can be cultured in the absence of their hosts. Nevertheless, this is a provocative suggestion that can be tested experimentally. It would be notable indeed if molecular biologists deleted all but the 256 genes from M. genitalium and found that the organism remained viable.
The last common ancestor
I worry that many readers will not appreciate the distinction between the alleged, minimal gene content and the quite different matters of the gene content of the common ancestor, on the one hand, or the number of genes in the earliest cells, on the other. Haemophilus influenzae and M. genitalium are quite distantly related, the former being a Gram-negative bacterium and the latter Gram-positive (this is one of the fundamental divides in the bacterial world). At the most recent, they last shared a common ancestor two billion years ago.
Clearly, the list of 256 genes they share in no way reflects the common ancestor, as can be shown by considering all other gene products shared by other similarly related bacteria. For example, B. subtilis and E. coli - also Gram-positive and Gram-negative, respectively - share large numbers of genes not on the minimallist of 256, and the last common ancestor of H influenzae and M. genitalium must also have had these genes. Although a comprehensive count of al1 such occurrences has yet to be made, the last common ancestor of Gram-positive and Gram-negative bacteria probably had a thousand or more genes.
Such an analysis also differs from looking at what genes are shared by all the known genomes. In the first nine bacterial genomes compared rigorously, only 34 genes are certain to be common to all of them'. The difference between the numbers shared by any pair and those shared by the whole group reflects differential loss along different lineages. It is the total number shared by all members of the archaebacteria with all eubacteria that needs to be used for estimating the gene composition of the last common ancestor of all life, barring one other confounding circumstance: the awkward matter of horizontal gene transfers.
Horizontal gene transfers
The underlying rationale for phylogenetic analysis with macromolecular sequences depends on the common ancestor having a defined sequence of some significant length, and that the sequence changes gradually and I differently along descendant limbs of a divergence as the result of single base substitutions and small insertions and deletions. As a result, the corresponding sequences of descendants differ according to their relatedness: the more different the sequences of two extant organisms, the more distant their evolutionary relationship. Still, occasionally phylogenies , are generated that are completely out of keeping with other phylogenies, and often as a last resort `horizontal gene transfer' is invoked.
The evidence for such transfer between I distantly related bacteria is strong, however It falls into two realms. First, there are the sequence comparisons themselves. The sequence of the enzyme dihydrolipoamide I dehydrogenase from the archaebacterium Halobacterium halobium is much more like those of Gram-positive eubacteria ( 50%o identical ) than it is like any sequence from the three fully sequenced Archaea (25%o identical); the : simplest explanation is a gene transfer.
Another example is the adenyl sulphate reductase from the archaeon, A. fulgidus, the sequence of which is very similar to an orthologue found in sulphite-reducing proteobacteria, but absent in all other Archaea. (Orthologous proteins in different organisms are related by common ancestry and have the same function. The differences between them result from speciation.) Moreover, a set of three sulphite reductase genes makes up an operon, or gene cluster, in A. fulgidus that exactly parallels those found in the sulphate-reducing proteobacteria. All of these observations imply that horizontal transfer has taken place.
There are also ancillary considerations that support the existence of horizontal transfers. The guanine + cytosine content of a gene may differ considerably from its surroundings, or the tracks of insertion sequences may be present, or there may be evidence of prophages - forms of bacterial virus that have been integrated into the host bacterium's DNA. In H. pylori, for example, the pathogenicity island. which is one of the mainstays of the organism's virulence, is delineated by 31-base-pair direct repeats, and in B. subtilis . there are ten prophages or their remnants, evidence of past horizontal transfers. Also, the scattered distribution of intervening aminoacid sequences that are spliced out of immature proteins (inteins) in Archaea is testimony to horizontal transfers; M. jannaschii has 18 inteins, M. thermoautotrophicum only one, and A. fulgidus none at all.
The order in which the genes occur on a bacterial chromosome is not entirely random. Indeed, genes that are involved in a particular function often occur next to one another and are coordinately regulated as operons' - clusters of genes associated with the same physiological function and transcribed on a single messenger RNA. Operons are common in the genomic sequences of both Bacteria and Archaea, but they may occur in widely different locations in different organisms. The implication is that, although the bacterial genome is experiencing constant reshuffling, a consequence of the persistent breaking and rejoining of DNA, there is a natural advantage to keeping some genes near each other.
There are, however, differing interpretations of what that advantage might be. The classical view has been that it is primarily a matter of regulating gene expression. A couple of years ago, that view was challenged with the publication of the hypothesis of the `selfish operon'. This view is based on the notion that horizontally transferred genes will fare better if they are accompanied by other genes with whose products they interact. The contention is that coordinate gene regulation need not require genes to be in proximity to one another.
On the other hand, an analysis of those operons that have been most preserved in the fully sequenced genomes has shown that they are predominantly of the kind regulated by mechanisms at the RNA level. It has been suggested that these operons reflect a kind of gene regulation that may pre-date the invention of DNA and that dates back to an ancestor with an RNA genomes.
In almost all of the fully sequenced genomes there are cases of `missing' ORFs - that is, biochemical activities have been observed, but the genes responsible have not been found. By far the most interesting case has to do with a missing lysyl transfer RNA synthetase in archaebacteria. This is the enzyme responsible for incorporating the amino acid lysine into proteins. Thus, when the complete genome of the archaebacterium M. jannaschii was determined, it was found that . four of the 20 amino-acyl-tRNA synthetases could not be identified. There was good reason to think that the enzymes for glutamine and asparagine would not be present, as previous work had shown that in archaebacteria and many eubacteria these amino acids are incorporated as transamidated derivatives of glutamate and aspartate, respectively"'. That the enzyme for cysteine was not found was puzzling, but the chemical relatedness of cysteine and serine allowed the possibility that a trans-sulphuration reaction might occur on a charged serine-tRNA. That the lysine enzyme could not be identified was more baffling, however, because lysine cannot be readily generated from any other amino acid.
Late last year, the tRNA synthetase for identified in the M. jannaschii genomic sequence. The first surprise was that it looks not at all like other lysine-tRNA synthetases, and belongs, so far as can be told, to a different class. The initial reaction was that somehow this enzyme had changed very rapidly in evolution, to the extent that it was no longer recognizable. But, even before a careful comparison of evolutionary rates could be made, a second surprise occurred: computer searching of the newly determined B. burgdorferi genome revealed the same kind of lysine-tRNA synthetase The lysine-tRNA synthetase sequences from the distantly related Methanococcus and Borrelia (Fig. 1 ) are about 30% identical, which is just about the same as for their other tRNA synthetases.
Rapid rate of change is not the explanation. Instead, what has most likely occurred is a gene displacement, several examples of which have been observed in the past. For example, bacterial ornithine decarboxylases are much more like bacterial lysine decarboxylases than eukaryote ornithine decarboxylases, and bacterial tyrosine transaminases are more like bacterial aspartate transaminases than eukaryote tyrosine transaminases. In each case it appears that gene duplication has led to a paralogue displacing an orthologue.
Sometimes the displacement is not an obvious paralogue, the new gene product having virtually no resemblance at all to the displaced agent. To cover both kinds of change, Koonin et al. introduced the somewhat redundant term `non-orthologous displacement' (if I quibble with this terminology, it's because I can't envisage an orthologous displacement). Nomenclature aside, these authors have systematically identified a number of such displacements, including the possible displacement of a missing nucleoside phosphokinase by another kind of kinase. More recently, several proteins have been identified in the B. subtilis genome that have the same function as completely different proteins in E. coli.
Of all the missing genes, the ones I find most perplexing are those of the eukaryotic cytoskeleton-the framework of protein filaments that give a cell its shape and ability to move. Where are the precursors of tubulin and actin and other cytoskeletal proteins? So although much has been made recently about the resemblance of the prokaryotic celldivision protein ftsZ to tubulin from eukaryotes, phylogenetic trees based on these proteins duster the Archaea and Bacteria firmly together and far from the tubulin-based position of eukaryotes. It is even more disappointing that another prokaryotic cell-division protein thought to be related to actin, ftsA, isn't in any of the archaeal genomes (Fig.1 ). The absence of sequences closely related to the cytoskeleton remains unsettling, and the origin of the cytoskeleton cannot easily be accounted for by simple chimaeric mergers of a eubacterium and an archaebacterium.
Rather, as the evolutionary trees based on ribosomal RNA have always implied, there must have been a third party involved in the origin of complex eukaryotes. Isn't it possible, for example, that the suggested merger of a eubacterium that excreted H2, and CO2, with an archaebacterium that consumed those by-products" actually took place within a third cell that already had a cytoskeleton? Were the forebears of that ancient third lineage, unlike the relatives of the swallowed Bacteria an d Archaea, completely overrun by the trimeric entity?
There may be important clues in the genomes of Eukarya more primitive than yeast. The TIGR Web site reveals that eight different groups are at work on chromosomes from the protist Plasmodium falciparum. Quite apart from the medical implications of having the sequence of the organism that causes malaria, perhaps some questions about the origin of the eukaryotic cell will be answered as well.