A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 6 Issue 3
May  2019

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 6.171, Top 11% (SCI Q1)
    CiteScore: 11.2, Top 5% (Q1)
    Google Scholar h5-index: 51, TOP 8
Turn off MathJax
Article Contents
Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447
Citation: Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447

An Embedded Feature Selection Method for Imbalanced Data Classification

doi: 10.1109/JAS.2019.1911447
Funds:  This work was supported in part by the National Science Foundation of USA (CMMI-1162482)
More Information
  • Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of F-measure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.

     

  • loading
  • [1]
    F. Wang, T. Xu, T. Tang, M. C. Zhou, and H. Wang, " Bilevel feature extraction-based text mining for fault diagnosis of railway systems,” IEEE Trans. Intelligent Transportation Systems, vol. 18, no. 1, pp. 49–58, Jan. 2017. doi: 10.1109/TITS.2016.2521866
    [2]
    D. Ramyachitra and P. Manikandan, " Imbalanced dataset classification and solutions: a review,” Inter. J. Computing and Business Research (IJCBR) , vol. 5, no. 4, Jul. 2014.
    [3]
    E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, " SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowledge and Information Syst., vol. 33, no. 2, pp. 245–265, Nov. 2012. doi: 10.1007/s10115-011-0465-6
    [4]
    Q. Kang, X. Chen, S. Li, and M. C. Zhou, " A noise-filtered under-sampling scheme for imbalanced classification,” IEEE Trans. Cybernetics, vol. 47, no. 12, pp. 4263–4274, Dec. 2018.
    [5]
    B. Krawczyk, M. Woźniak, and G. Schaefer, " Cost-sensitive decision tree ensembles for effective imbalanced classification,” Applied Soft Computing, vol. 14, pp. 554–562, Jan. 2014. doi: 10.1016/j.asoc.2013.08.014
    [6]
    V. Lopez, S. del Rio, J. Manuel Benitez, and F. Herrera, " On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data,” in Proc. IEEE Int. Conf. Fuzzy Syst.. pp. 1905−1912, IEEE, Jul. 2014.
    [7]
    Z. L. Cai and W. Zhu, " Feature selection for multi-label classification using neighborhood preservation,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, pp. 320–330, Jan. 2018. doi: 10.1109/JAS.2017.7510781
    [8]
    C. Jian, J. Gao, and Y. Ao, " A new sampling method for classifying imbalanced data based on support vector machine ensemble,” Neurocomputing, vol. 193, pp. 115–122, 2016. doi: 10.1016/j.neucom.2016.02.006
    [9]
    I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
    [10]
    X. H. Yuan, L. B. Kong, D. C. Feng, and Z. C. Wei, " Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA J. Autom Sinica, vol. 4, no. 4, pp. 677–685, Oct. 2017. doi: 10.1109/JAS.2017.7510625
    [11]
    J. Wang, L. Qiao, Y. Ye, and Y. Chen, " Fractional envelope analysis for rolling element bearing weak fault feature extraction,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 2, pp. 353–360, 2017. doi: 10.1109/JAS.2016.7510166
    [12]
    N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004. doi: 10.1145/1007730
    [13]
    A. K. Uysal and S. Gunal, " A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems, vol. 36, pp. 226–235, 2012. doi: 10.1016/j.knosys.2012.06.005
    [14]
    L. Yu and H. Liu, " Feature selection for high-dimensional data: a fast correlation-based filter solution,” in Proc. Int. Conf. Machine Learning, vol. 3, pp. 856−863, 2003.
    [15]
    V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J. Manuel Benítez, and F. Herrera, " A review of microarray datasets and applied feature selection methods,” Information Sciences, vol. 282, pp. 111–135, 2014. doi: 10.1016/j.ins.2014.05.042
    [16]
    G. Chandrashekar and F. Sahin, " A survey on feature selection methods,” Computers & Electrical Engineering, vol. 41, no. 1, pp. 16–28, 2014.
    [17]
    H. Liu and H. Motoda, " Feature selection for knowledge discovery and data mining,” Springer Science & Business Media, vol. 454, 2012.
    [18]
    S. Shilaskar and A. Ghatol, " Feature selection for medical diagnosis: Evaluation for cardiovascular diseases,” Expert Syst. with Applications, vol. 40, no. 10, pp. 4146–4153, 2013. doi: 10.1016/j.eswa.2013.01.032
    [19]
    I. A. Gheyas and L. S. Smith, " Feature subset selection in large dimensionality domains,” Pattern Recognition, vol. 43, no. 1, pp. 5–13, 2010. doi: 10.1016/j.patcog.2009.06.009
    [20]
    S. Maldonado and R. Weber, " A wrapper method for feature selection using support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–2217, 2009. doi: 10.1016/j.ins.2009.02.014
    [21]
    Y. Zhu, J. Liang, J. Chen, and M. Zhong, " An improved NSGA-III algorithm for feature selection used in intrusion detection,” J. Knowledge-Based Syst., vol. 116, pp. 74–85, Jan. 2017. doi: 10.1016/j.knosys.2016.10.030
    [22]
    A. Moayedikia, K. L. Ong, Y. L. Boo, W. G. Yeoh, and R. Jensen, " Feature selection for high dimensional imbalanced class data using harmony search,” J. Engineering Applications of Artificial Intelligence, vol. 57, pp. 38–49, Jan. 2017. doi: 10.1016/j.engappai.2016.10.008
    [23]
    I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
    [24]
    S. Maldonado and J. Lopez, " Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification,” J. Applied Soft Computing, vol. 67, pp. 94–105, Jun. 2018. doi: 10.1016/j.asoc.2018.02.051
    [25]
    C. Apté, F. Damerau, and M. S. Weiss, " Automated learning of decision rules for text categorization,” ACM Trans. Information Syst., vol. 12, no. 3, pp. 233–251, 1994. doi: 10.1145/183422.183423
    [26]
    G. Forman, " An extensive empirical study of feature selection metrics for text classification,” J. Machine Learning Research, vol. 3, pp. 1289–1305, Mar. 2003.
    [27]
    C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, " Know your neighbors: Web spam detection using the web topology,” in Proc. the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423−430, Jul. 2007.
    [28]
    H. Koh, W. C. Tan, and G. C. Peng, " Credit scoring using data mining techniques,” Singapore Management Review, vol. 26, no. 2, pp. 252004.
    [29]
    J. R. Quinlan, " Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
    [30]
    J. R. Quinlan, "Constructing decision tree," C4, 5, pp. 17–26, 1993.
    [31]
    X. Chen, M. Wang, and H. Zhang, " The use of classification trees for bioinformatics,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, pp. 55–63, 2011. doi: 10.1002/widm.14
    [32]
    L. Breiman, " Classification and regression trees,” Routledge, 2017.
    [33]
    H. Y. Liu, M. C. Zhou, X.S. Lu, and C. Yao, " Weighted Gini index feature selection method for imbalanced data,” in Proc. 15th IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 1−6, Mar. 2018.
    [34]
    H. Y. Liu and M. C. Zhou, " Decision tree rule-based feature selection for large-scale imbalanced data,” in Proc. 26th IEEE Wireless and Optical Communication Conf. (WOCC), pp. 1−6, IEEE, Apr. 2017.
    [35]
    T. Q. Chen and T. He, " Xgboost: extreme gradient boosting,” R Package Version 0.4−2, 2015.
    [36]
    T. Fawcett, " An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. doi: 10.1016/j.patrec.2005.10.010
    [37]
    N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newsletter, vol. 1, pp. 1–6, 2004.
    [38]
    D. D. Lewis, and A. G. William. " A sequential algorithm for training text classifiers,” in Proc. 17th Annu. Int. ACM SIGIR Conf on Research and Development in Information Retrieval, Springer-Verlag New York, Inc., pp. 3−12, 1994.
    [39]
    C. J. Van Rijsbergen. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA, USA, 1979.
    [40]
    M. Friedman, " A comparison of alternative tests of significance for the problem of m rankings,” The Annu. of Mathematical Statistics, no. 1, pp. 86–92, 1940.
    [41]
    R. F. Woolson, " Wilcoxon signed-rank test,” Wiley Encyclopedia of Clinical Trials, pp. 1–3, 2007.
    [42]
    J. Demšar, " Statistical comparisons of classifiers over multiple data sets,” J. Machine Learning Research, vol. 7, pp. 1–30, Jan. 2006.
    [43]
    S. Garcia and H. Francisco, " An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons,” J. Machine Learning Research, vol. 9, pp. 2677–2694, Dec. 2008.
    [44]
    P. Zhang, S. Shu, and M. C. Zhou, " An Online Fault Detection Method based on SVM-Grid for Cloud Computing Systems,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 2, pp. 445–456, Mar. 2018. doi: 10.1109/JAS.2017.7510817
    [45]
    J. Cheng, M. Chen, M. Zhou, S. Gao, C. Liu, and C. Liu, " Overlapping Community Change Point Detection in an Evolving Network,” IEEE Trans. Big Data, DOI: 10.1109/TBDATA.2018.2880780, Nov. 2018.
    [46]
    S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, " Dendritic neuron model with effective learning algorithms for classification, approximation and prediction,” IEEE Trans-Neural Networks and Learning Syst., DOI: 10.1109/TNNLS.2018.2846646, 2018.
    [47]
    Q. Kang, L. Shi, M. C. Zhou, X. Wang, Q. Wu, and Z. Wei, " A Distance-based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification,” IEEE Trans. Neural Networks and Learning Syst., vol. 29, no. 9, pp. 4152–4165, Sep. 2018. doi: 10.1109/TNNLS.2017.2755595

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(17)

    Article Metrics

    Article views (1357) PDF downloads(84) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return