Classification with Two-Stage Correlation-Based Attribute Selection on Big Data Platform

Başarslan M. S. , Kayaalp F.

11. International Statistics Congress, Antalya, Turkey, 4 - 08 October 2019, pp.755-761

  • Publication Type: Conference Paper / Full Text
  • City: Antalya
  • Country: Turkey
  • Page Numbers: pp.755-761


The amount of created data has increased a lot in recent years. Because of this, data mining is used by various disciplines for extracting information through these datasets. Attribute selection which refers to identifying the attributes of the dataset that have more contribution to the result is an important stage in data mining processes. The platform to be used for data mining processes also has an effect on the performance of the task. Evaluating the performance of the new attribute selection method called Two-Stage Correlation Based Attribute Selection (TSCBAS) which has been proposed by our previous works is aimed in this study. For this aim, SVM and Random Forest classification algorithms are applied on bank marketing data set from UCI machine learning data warehouse on two different data mining platforms such as Spark and R. The dataset was separated as training and testing data by 5-fold cross-validation method. According to the results, SVM has shown better classification performance then random Forest both on raw dataset and the dataset created with TSCBAS. In addition, Spark has performed better runtime results than R. The results have also confirmed the importance of the attribute selection process.