--- abstract: | Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'', degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification. altloc: - http://mathcircle.org/gorban/ - http://www.ihes.fr/~zinovyev/ chapter: ~ commentary: ~ commref: ~ confdates: ~ conference: ~ confloc: ~ contact_email: ~ creators_id: [] creators_name: - family: Gorban given: A.N. honourific: '' lineage: '' - family: Popova given: T.G. honourific: '' lineage: '' - family: Zinovyev given: A.Yu. honourific: '' lineage: '' date: 2004-10 date_type: published datestamp: 2004-11-06 department: ~ dir: disk0/00/00/39/15 edit_lock_since: ~ edit_lock_until: ~ edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 3915 fileinfo: /style/images/fileicons/application_pdf.png;/3915/1/7clustersCog.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: ~ issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: 'codon usage, cluster structure, mean field, frequency dictionary' lastmod: 2011-03-11 08:55:43 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: ~ pubdom: FALSE publication: ~ publisher: ~ refereed: FALSE referencetext: | Audic S, Claverie JM. Self-identification of protein-coding regions in microbial genomes. (1998) {\it Proc Natl Acad Sci USA}. {\bf 95(17)}:10026-31. Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. (2000) {\it Bioinformatics}. {\bf 16}(4):367-71. Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L., Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. \textit{Physical Review Letters}\textbf{85}(6): 1342-1345. BioJava open-source project. http://www.biojava.org Borodovsky, M., McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. {\it Comp.Chem} {\bf 17}, 123-133. Carbone A., Zinovyev A., Kepes F. Codon Adaptation Index as a measure of dominating codon bias. (2003) {\it Bioinformatics}. {\bf 19}, 13, p.2005-2015. Cluster structures in genomic word frequency distributions. Web-site with supplementary materials. {\it http://www.ihes.fr/$\sim$zinovyev/7clusters/index.htm } Gorban AN, Mirkes EM, Popova TG, Sadovsky MG. A new approach to the investigations of statistical properties of genetic texts. (1993) {\it Biofizika} {\bf 38} (5): 762-767. Gorban AN, Bugaenko NN, Sadovskii MG. Maximum entropy method in analysis of genetic text and measurement of its information content. (1998) {\it Open systems and information dynamics}. {\bf 5}, pp.265-278. Gorban AN, Popova TG, Sadovsky MG. Classification of symbol sequences over their frequency dictionaries: towards the connection between structure and natural taxonomy. (2000) {\it Open System and Information Dynamics}, {\bf 7}:1-17. Gorban A.N., Zinovyev A.Y., Wunsch D.C. Application of The Method of Elastic Maps In Analysis of Genetic Texts. (2003) In {\it Proceedings of International Joint Conference on Neural Networks (IJCNN)}, Portland, Oregon, July 20-24. Gorban A, Zinovyev A, Popova T. Seven clusters in genomic triplet distributions. (2003) {\it In Silico Biology}. {\bf V.3}, 0039. (e-print: http://arxiv.org/abs/cond-mat/0305681 and http://cogprints.ecs.soton.ac.uk/archive/00003077/ ) Gorban A.N., Zinovyev A.Yu., Popova T.G. Statistical approaches to the automated gene identification without teacher // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/34. Available at {\it http://www.ihes.fr} web-site. (See alsow e-print: http://arxiv.org/abs/physics/0108016 ) Gorban A.N., Zinovyev A.Yu. Visualization of data by method of elastic maps and its applications in genomics, economics and sociology // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/36. Available at {\it http://www.ihes.fr} web-site. Karlin S. (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. {\it Current opinion in microbiology} {\bf 1}(5): 598-610. Lobry JR, Chessel D. (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. {\it J.Appl.Genet.} 44(2):235-61. Mathe C., Sagot M.F., Schiex T., Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) {\it Nucleic Acids Res}. {\bf 30}(19):4103-4117. Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessieres P. Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. (2002) {\it Nucleic Acids Res.} {\bf 30}(6):1418-26. Ou HY, Guo FB, Zhang CT. Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. (2003) \textit{FEBS Lett.} Apr 10;\textbf{540}(1-3):188-94. Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov Models. (1998) {\it Nuc. Acids Res.} {\bf 26}(2): 544-548. Trifonov,E.N. Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. (1987) {\it J.Mol.Biol.} {\bf 194},643-652. Zhang,C.T.and Zhang,R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. (1991) {\it Nucleic Acids Res.} {\bf 19},6313- 6317. Zhang,C.T.and Chou,K.C. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. (1994) {\it J.Mol.Biol.} {\bf 238},1-8. Zinovyev A. Visualizing the spatial structure of triplet distributions in genetic texts. - IHES Preprint, France. 2002. - M/02/28. Available at {\it http://www.ihes.fr} web-site. Zinovyev A., Gorban A., Popova T. Self-Organizing Approach for Automated Gene Identification. (2003). {\it Open Systems and Information Dynamics} {\bf 10}(4). p.321-333. relation_type: [] relation_uri: [] reportno: ~ rev_number: 12 series: ~ source: ~ status_changed: 2007-09-12 16:54:16 subjects: - bio-theory succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: |- Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences type: preprint userid: 4198 volume: ~