Cogprints

Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences

Gorban, A.N. and Popova, T.G. and Zinovyev, A.Yu. (2004) Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. [Preprint]

Full text available as:

[img]
Preview
PDF
383Kb

Abstract

Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'', degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification.

Item Type:Preprint
Keywords:codon usage, cluster structure, mean field, frequency dictionary
Subjects:Biology > Theoretical Biology
ID Code:3915
Deposited By:Gorban, Prof Alexander N.
Deposited On:06 Nov 2004
Last Modified:11 Mar 2011 08:55

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Audic S, Claverie JM. Self-identification of protein-coding

regions in microbial genomes. (1998) {\it Proc Natl Acad Sci USA}. {\bf 95(17)}:10026-31.

Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. (2000) {\it Bioinformatics}. {\bf 16}(4):367-71.

Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L.,

Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. \textit{Physical Review Letters}\textbf{85}(6): 1342-1345.

BioJava open-source project. http://www.biojava.org

Borodovsky, M., McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. {\it Comp.Chem} {\bf 17},

123-133.

Carbone A., Zinovyev A., Kepes F. Codon Adaptation Index as a measure of dominating codon bias. (2003) {\it Bioinformatics}. {\bf 19}, 13, p.2005-2015.

Cluster structures in genomic word frequency distributions.

Web-site with supplementary materials. {\it http://www.ihes.fr/$\sim$zinovyev/7clusters/index.htm }

Gorban AN, Mirkes EM, Popova TG, Sadovsky MG. A new approach to the investigations of statistical properties of genetic texts. (1993) {\it Biofizika} {\bf 38} (5): 762-767.

Gorban AN, Bugaenko NN, Sadovskii MG. Maximum entropy method in analysis of genetic text and measurement of its information content. (1998) {\it Open systems and information dynamics}. {\bf 5}, pp.265-278.

Gorban AN, Popova TG, Sadovsky MG. Classification of symbol

sequences over their frequency dictionaries: towards the

connection between structure and natural taxonomy. (2000) {\it Open System and Information Dynamics}, {\bf 7}:1-17.

Gorban A.N., Zinovyev A.Y., Wunsch D.C. Application of The Method of Elastic Maps In Analysis of Genetic Texts. (2003) In {\it Proceedings of International Joint Conference on Neural Networks (IJCNN)}, Portland, Oregon, July 20-24.

Gorban A, Zinovyev A, Popova T. Seven clusters in genomic triplet distributions. (2003) {\it In Silico Biology}. {\bf V.3}, 0039.

(e-print: http://arxiv.org/abs/cond-mat/0305681 and

http://cogprints.ecs.soton.ac.uk/archive/00003077/ )

Gorban A.N., Zinovyev A.Yu., Popova T.G. Statistical approaches to the automated gene identification without teacher // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/34.

Available at {\it http://www.ihes.fr} web-site. (See alsow e-print:

http://arxiv.org/abs/physics/0108016 )

Gorban A.N., Zinovyev A.Yu. Visualization of data by method of elastic maps and its applications in genomics, economics and sociology // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/36. Available at {\it

http://www.ihes.fr} web-site.

Karlin S. (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. {\it Current opinion in microbiology} {\bf 1}(5): 598-610.

Lobry JR, Chessel D. (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. {\it J.Appl.Genet.} 44(2):235-61.

Mathe C., Sagot M.F., Schiex T., Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) {\it Nucleic Acids Res}. {\bf 30}(19):4103-4117.

Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessieres P. Mining Bacillus subtilis chromosome

heterogeneities using hidden Markov models. (2002) {\it Nucleic Acids Res.} {\bf 30}(6):1418-26.

Ou HY, Guo FB, Zhang CT. Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. (2003) \textit{FEBS Lett.} Apr

10;\textbf{540}(1-3):188-94.

Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov Models. (1998) {\it Nuc. Acids Res.} {\bf 26}(2): 544-548.

Trifonov,E.N. Translation framing code and frame-monitoring

mechanism as suggested by the analysis of mRNA and 16S rRNA

nucleotide sequences. (1987) {\it J.Mol.Biol.} {\bf 194},643-652.

Zhang,C.T.and Zhang,R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. (1991)

{\it Nucleic Acids Res.} {\bf 19},6313- 6317.

Zhang,C.T.and Chou,K.C. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. (1994) {\it J.Mol.Biol.} {\bf 238},1-8.

Zinovyev A. Visualizing the spatial structure of triplet

distributions in genetic texts. - IHES Preprint, France. 2002. - M/02/28. Available at {\it http://www.ihes.fr} web-site.

Zinovyev A., Gorban A., Popova T. Self-Organizing Approach for Automated Gene Identification. (2003). {\it Open Systems and Information Dynamics} {\bf 10}(4). p.321-333.

Metadata

Repository Staff Only: item control page