3. Total gene number is known for several organisms

3.3 Total gene number is known for several organisms

Key terms defined in this section
Orthologs are corresponding proteins in two species as defined by sequence homologies.
Proteome is the total number of proteins produced by an organism.



Figure 3.5 Genome sizes and gene numbers are known from complete sequences for several organisms (Arabidopsis, Drosophila, and man are estimated from partial data). Lethal loci are estimated from genetic data.

Large-scale efforts have now led to the sequencing of several genomes. A range is summarized in Figure 3.5.


The sequences of the genomes of bacteria and archaea show that virtually all of the DNA (typically 85 V88%) codes for RNA or protein. The bacterium with the smallest known genome, M. genitalium, has ~470 genes of ~1040 bp each. This identifies the minimum number of functions required to construct a cell. All classes of genes are reduced in number compared with bacteria with larger genomes, but the most significant reduction is in loci coding for enzymes concerned with metabolic functions and with regulation of gene expression. This makes M. genitalium more dependent on the provision of small molecules by its host.


The bacterium Rickettsia prowazekii (an intracellular parasite) has a complexity about twice that of the Mycoplasma (Andersson et al., 1998). A "typical" gram-negative bacterium, H. influenzae, has 1,743 genes each of ~900 bp. About 60% of the genes can be identified on the basis of homology with known genes in other species, and these genes fall approximately equally into classes whose products are concerned with metabolism, cell structure or transport of components, and gene expression and its regulation. E. coli has a larger genome, with 4,288 genes, average size ~950 bp, and an average separation between genes of 118 bp (Blattner et al., 1997).


The archaea have properties that are intermediate between the prokaryotes and eukaryotes. M. jannaschii is a methane-producing species that lives under high pressure and temperature. Its total gene number is similar to that of H. influenzae, but fewer of them can be identified on the basis of comparison with genes known in other organisms. Its apparatus for gene expression resembles eukaryotes more than prokaryotes, but its apparatus for cell division better resembles prokaryotes.


The most extensive data for a lower eukaryote are available from the sequence of the genome of S. cerevisiae. The density of genes is high: the average open reading frame is ~1.4 kb, and the average separation between genes is ~600 bp, so that ~70% of the genome is occupied by the total of ~6000 genes. About half of the genes identified by sequence were either known previously or are related to known genes. The remainder are new, which gives some indication of the number of new types of genes that may be discovered (Oliver et al., 1992; Dujon et al., 1994; Johnston et al., 1994).


The identification of large genes on the basis of sequence is quite accurate. However, there are also ~600 potential small genes Xwith ORFs coding for <100 amino acids Xwhich cannot be identified solely by sequence because of the high occurrence of false positives. Analysis of gene expression suggests that ~300 of these ORFs are likely to be genuine genes.


The genome of C. elegans DNA varies between regions rich in genes and regions in which genes are more sparsely organized. The total sequence contains ~18,500 genes. Only ~42% of the genes have putative counterparts outside the Nematoda (Wilson et al., 1994; C. elegans sequencing consortium., 1998).


Although the fly genome is larger than the worm genome, there are fewer genes (13,600) in D. melanogaster (Adams et al., 2000). The number of different transcripts is slightly larger (14,100) as the result of alternative splicing. We do not understand why the fly Xa much more complex organism Xhas only 70% of the number of genes in the worm. This emphasizes forcefully the lack of an exact relationship between gene number and complexity of the organism.




Figure 3.6 ~20% of Drosophila genes code for proteins concerned with maintaining or expressing genes, ~20% for enzymes, <10% for proteins concerned with the cell cycle or signal transduction. Half of the genes of Drosophila code for products of unknown function.

From the fly genome, we can form an impression of how many genes are devoted to each type of function. Figure 3.6 breaks down the functions into different categories. Among the genes that are identified, we find 2500 enzymes, ~750 transcription factors, ~700 transporters and ion channels, and ~700 proteins involved with signal transduction. But just over the half genes code for products of unknown function. ~20% of the proteins reside in membranes.


Protein size increases from prokaryotes to eukaryotes. The bacteria M. jannaschi and E. coli have average protein lengths of 287 and 317 amino acids, respectively; whereas S. cerevisiae and C. elegans have average lengths of 484 and 442 amino acids, respectively. Large proteins (>500 amino acids) are rare in bacteria, but comprise a significant part (~1/3) in eukaryotes. The increase in length is due to the addition of extra domains, a typical domain constituting 100 V300 amino acids. But the increase in protein size is responsible for only a very small part of the increase in genome size.


Another insight into gene number is obtained by counting the number of expressed genes. If we rely upon the estimates of the number of different mRNA species that can be counted in a cell, we would conclude that the average vertebrate cell expresses ~10,000 V20,000 genes. The existence of significant overlaps between the messenger populations in different cell types would suggest that the total expressed gene number for the organism should be within a few fold of this, in the range (say) of 50,000 V100,000. The plant Arabidopsis thaliana has a genome size intermediate between the worm and the fly, but has a larger gene number than either. This shows the lack of a clear relationship and also emphasizes the special quality of plants, which may have more genes (due to ancestral duplications) than animal cells.


Eukaryotic genes are transcribed individually, each gene producing a monocistronic messenger. There is only one general exception to this rule; in the genome of C. elegans, ~25% of the genes are organized into polycistronic units (which is associated with the use of trans-splicing to allow expression of the downstream genes in these units; see 22 Nuclear splicing and RNA processing).




Figure 3.7 Because many genes are duplicated. the number of different gene families is much less than the total number of genes.

Because some genes are present in more than one copy or are related to one another, the number of different types of genes is less than the total number of genes. We can divide the total number of genes into sets that have related members, as defined by comparing their exons. (A family of related genes arises by duplication of an ancestral gene followed by accumulation of changes in sequence between the copies). Figure 3.7 compares the total number of genes with the number of distinct families in each of four genomes (Rubin et al., 2000). In bacteria, most genes are unique, so the number of distinct families is close to the total gene number, but as we reach the higher eukaryotes, the number of distinct families is of the order of 50% of the total gene number.


If every gene is expressed, the total number of genes will correspond to the total number of proteins required to make the organism. This is sometimes called the proteome. However, because genes are duplicated, some of them code for the same protein (although it may be expressed in a different time or place) and others may code for related proteins that again play the same role in different times or places. What is the core proteome Xthe basic number of the different types of proteins in the organism? A minimum estimate is given by the number of gene families, ranging from 1400 in the bacterium, >4000 in the yeast, and 9500 and 8000 in the worm and fly, respectively.




Figure 3.8 The fly genome can be divided into genes that are (probably) present in all eukaryotes, additional genes that are (probably) present in all multicellular eukaryotes, and genes that are more specific to subgroups of species that include flies.

How many genes are common to all organisms (or to groups such as bacteria or higher eukaryotes) and how many are specific for the individual type of organism? Figure 3.8 summarizes the comparison between yeast, worm, and fly (Rubin et al., 2000). Genes that code for corresponding proteins in different organisms are called orthologs. Operationally, we usually reckon that two genes in different organisms can be considered to provide corresponding functions if their sequences are similar over >80% of the length. By this criterion, ~20% of the fly genes have orthologs in both yeast and the worm. These genes are probably required by all eukaryotes. The proportion increases to 30% when fly and worm are compared, probably representing the addition of gene functions that are common to multicellular eukaryotes. This still leaves the major proportion of genes as coding for proteins that are required specifically by either flies or worms, respectively.


Once we know the total number of proteins, we can ask how they interact. By definition, proteins in structural multiprotein assemblies must form stable interactions with one another. Proteins in signalling pathways interact with one another transiently. In both cases, such interactions can be detected in test systems where essentially a readout system magnifies the effect of the interaction. One popular such system is the two hybrid assay discussed in 20.12 Independent domains bind DNA and activate transcription. Such assays cannot detect all interactions: for example, if one enzyme in a metabolic pathway releases a soluble metabolite that then interacts with the next enzyme, the protein proteins may not interact directly.


As a practical matter, assays of pairwise interactions can give us an indication of the minimum number of independent structures or pathways. An analysis of the ability of all 6000 (predicted) yeast proteins to interact in pairwise combinations shows that ~1000 proteins can bind to at least one other protein (Uetz et al., 2000). (The results of this analysis can be examined directly at the YeastPathCalling home page.) This is the beginning of an analysis that will lead to definition of the number of functional assemblies or pathways.


In addition to functional genes, there are also copies of genes that have become nonfunctional (identified as such by interruptions in their protein-coding sequences). These are called pseudogenes (see 4 Clusters and Repeats). The number of pseudogenes can be large. The sequence of human chromosome 22 shows ~679 genes and 134 pseudogenes. If this ratio is maintained throughout the genome, ~20% of the total number of gene sequences could be nonfunctional. Extrapolating from the total gene number of chromosome 22 to the whole genome would suggest a smaller figure for the total human gene number than previously seemed likely Xperhaps<100,000 (Dunham et al., 1999).


Besides needing to know the density of genes to estimate the total gene number, we must also ask: is it important in itself? Are there structural constraints that make it necessary for genes to have a certain spacing, and does this contribute to the large size of eukaryotic genomes?


This section updated 4-12-2000



Research
Adams. M. D. et al. (2000). The genome sequence of D. melanogaster. Science 287, 2185-2195.
Andersson, S. G. E. et al. (1998). The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133-140.
Blattner, F. et al. (1997). The complete genome sequence of E. coli K12. Science 277, 1453-1462.
C. elegans sequencing consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012-2022.
Dujon, B. et al. (1994). Complete DNA sequence of yeast chromosome XI. Nature 369, 371-378.
Dunham, I. et al. (1999). The DNA sequence of human chromosome 22. Nature 402, 489-496.
Johnston, M. et al. (1994). Complete nucleotide sequence of S. cerevisiae chromosome VIII. Science 265, 2077-2082.
Oliver, S. G. et al. (1992). The complete DNA sequence of yeast chromosome III. Nature 357, 38-46.
Rubin, G. M. et al. (2000). Comparative genomics of the eukaryotes. Science 287, 2204-2215.
Uetz, P. et al. (2000). A comprehensive analysis of protein-protein interactions in S. cerevisiae.. Nature 403, 623-630.
Wilson, R. et al. (1994). 22 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature 368, 32-38.



Genes VII
Genes VII
ISBN: B000R0CSVM
EAN: N/A
Year: 2005
Pages: 382

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net