Ensemble Clustering for Biological Datasets

ŞEKER Ş. E. , Pirim H.

in: Bioinformatics, Horacio Pérez-Sánchez, Editor, Intech, Rijeka, pp.287-299, 2012

  • Publication Type: Book Chapter / Chapter Research Book
  • Publication Date: 2012
  • Publisher: Intech
  • City: Rijeka
  • Page Numbers: pp.287-299
  • Editors: Horacio Pérez-Sánchez, Editor


Recent technologies and tools generated excessive data in bioinformatics domain. For example, microarrays measure expression levels of ten thousands of genes simultaneously in a single chip. Measurements involve relative expression values of each gene through an image processing task.

Biological data requires both low and high level analysis to reveal significant information that will shed light into biological facts such as disease prediction, annotation of a gene function and guide new experiments. In that sense, researchers are seeking for the effect of a treatment or time course change befalling. For example, they may design a microarray experiment treating a biological organism with a chemical substance and observe gene expression values comparing with expression value before treatment. This treatment or change make researchers focus on groups of genes, other biological molecules that have significant relationships with each other under similar conditions. For instance, gene class labels are usually unknown, since there is a little information available about the data. Hence, data analysis using an unsupervised learning technique is required. Clustering is an unsupervised learning technique used in diverse domains including bioinformatics. Clustering assigns objects into the same cluster, based on a cluster definition. A cluster definition or criterion is the similarity between the objects. The idea is that one needs to find the most important cliques among many from the data. Therefore, clustering is widely used to obtain biologically meaningful partitions. However, there is no best clustering approach for the problem on hand and clustering algorithms are biased towards certain criteria. In other words, a particular clustering approach has its own objective and assumptions about the data.

Diversity of clustering algorithms can benefit from merging partitions generated individually. Ensemble clustering provides a framework to merge individual partitions from different clustering algorithms. Ensemble clustering may generate more accurate clusters than individual clustering approaches. Here, an ensemble clustering framework is implemented as described in [10] to aggregate results from K-means, hiearchical clustering and C-means algorithms. We employ C-means instead of spectral clustering in [10]. We also use different

data sets. Two different biological datasets are used for each algorithm. A comparison of the results is presented. In order to evaluate the performance of the ensemble clustering approach, one internal and one external cluster validation indices are used. Silhouette (S) [31] is the internal validation index and C-rand [23] is the external one. The chapter reviews some clustering algorithms, ensemble clustering methods, includes implementation, and conclusion sections.