A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 6 Issue 3
May  2019

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 6.171, Top 11% (SCI Q1)
    CiteScore: 11.2, Top 5% (Q1)
    Google Scholar h5-index: 51, TOP 8
Turn off MathJax
Article Contents
Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447
Citation: Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447

An Embedded Feature Selection Method for Imbalanced Data Classification

doi: 10.1109/JAS.2019.1911447
Funds:  This work was supported in part by the National Science Foundation of USA (CMMI-1162482)
More Information
  • Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of F-measure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.


  • loading
  • [1]
    F. Wang, T. Xu, T. Tang, M. C. Zhou, and H. Wang, " Bilevel feature extraction-based text mining for fault diagnosis of railway systems,” IEEE Trans. Intelligent Transportation Systems, vol. 18, no. 1, pp. 49–58, Jan. 2017. doi: 10.1109/TITS.2016.2521866
    D. Ramyachitra and P. Manikandan, " Imbalanced dataset classification and solutions: a review,” Inter. J. Computing and Business Research (IJCBR) , vol. 5, no. 4, Jul. 2014.
    E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, " SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowledge and Information Syst., vol. 33, no. 2, pp. 245–265, Nov. 2012. doi: 10.1007/s10115-011-0465-6
    Q. Kang, X. Chen, S. Li, and M. C. Zhou, " A noise-filtered under-sampling scheme for imbalanced classification,” IEEE Trans. Cybernetics, vol. 47, no. 12, pp. 4263–4274, Dec. 2018.
    B. Krawczyk, M. Woźniak, and G. Schaefer, " Cost-sensitive decision tree ensembles for effective imbalanced classification,” Applied Soft Computing, vol. 14, pp. 554–562, Jan. 2014. doi: 10.1016/j.asoc.2013.08.014
    V. Lopez, S. del Rio, J. Manuel Benitez, and F. Herrera, " On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data,” in Proc. IEEE Int. Conf. Fuzzy Syst.. pp. 1905−1912, IEEE, Jul. 2014.
    Z. L. Cai and W. Zhu, " Feature selection for multi-label classification using neighborhood preservation,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, pp. 320–330, Jan. 2018. doi: 10.1109/JAS.2017.7510781
    C. Jian, J. Gao, and Y. Ao, " A new sampling method for classifying imbalanced data based on support vector machine ensemble,” Neurocomputing, vol. 193, pp. 115–122, 2016. doi: 10.1016/j.neucom.2016.02.006
    I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
    X. H. Yuan, L. B. Kong, D. C. Feng, and Z. C. Wei, " Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA J. Autom Sinica, vol. 4, no. 4, pp. 677–685, Oct. 2017. doi: 10.1109/JAS.2017.7510625
    J. Wang, L. Qiao, Y. Ye, and Y. Chen, " Fractional envelope analysis for rolling element bearing weak fault feature extraction,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 2, pp. 353–360, 2017. doi: 10.1109/JAS.2016.7510166
    N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004. doi: 10.1145/1007730
    A. K. Uysal and S. Gunal, " A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems, vol. 36, pp. 226–235, 2012. doi: 10.1016/j.knosys.2012.06.005
    L. Yu and H. Liu, " Feature selection for high-dimensional data: a fast correlation-based filter solution,” in Proc. Int. Conf. Machine Learning, vol. 3, pp. 856−863, 2003.
    V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J. Manuel Benítez, and F. Herrera, " A review of microarray datasets and applied feature selection methods,” Information Sciences, vol. 282, pp. 111–135, 2014. doi: 10.1016/j.ins.2014.05.042
    G. Chandrashekar and F. Sahin, " A survey on feature selection methods,” Computers & Electrical Engineering, vol. 41, no. 1, pp. 16–28, 2014.
    H. Liu and H. Motoda, " Feature selection for knowledge discovery and data mining,” Springer Science & Business Media, vol. 454, 2012.
    S. Shilaskar and A. Ghatol, " Feature selection for medical diagnosis: Evaluation for cardiovascular diseases,” Expert Syst. with Applications, vol. 40, no. 10, pp. 4146–4153, 2013. doi: 10.1016/j.eswa.2013.01.032
    I. A. Gheyas and L. S. Smith, " Feature subset selection in large dimensionality domains,” Pattern Recognition, vol. 43, no. 1, pp. 5–13, 2010. doi: 10.1016/j.patcog.2009.06.009
    S. Maldonado and R. Weber, " A wrapper method for feature selection using support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–2217, 2009. doi: 10.1016/j.ins.2009.02.014
    Y. Zhu, J. Liang, J. Chen, and M. Zhong, " An improved NSGA-III algorithm for feature selection used in intrusion detection,” J. Knowledge-Based Syst., vol. 116, pp. 74–85, Jan. 2017. doi: 10.1016/j.knosys.2016.10.030
    A. Moayedikia, K. L. Ong, Y. L. Boo, W. G. Yeoh, and R. Jensen, " Feature selection for high dimensional imbalanced class data using harmony search,” J. Engineering Applications of Artificial Intelligence, vol. 57, pp. 38–49, Jan. 2017. doi: 10.1016/j.engappai.2016.10.008
    I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
    S. Maldonado and J. Lopez, " Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification,” J. Applied Soft Computing, vol. 67, pp. 94–105, Jun. 2018. doi: 10.1016/j.asoc.2018.02.051
    C. Apté, F. Damerau, and M. S. Weiss, " Automated learning of decision rules for text categorization,” ACM Trans. Information Syst., vol. 12, no. 3, pp. 233–251, 1994. doi: 10.1145/183422.183423
    G. Forman, " An extensive empirical study of feature selection metrics for text classification,” J. Machine Learning Research, vol. 3, pp. 1289–1305, Mar. 2003.
    C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, " Know your neighbors: Web spam detection using the web topology,” in Proc. the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423−430, Jul. 2007.
    H. Koh, W. C. Tan, and G. C. Peng, " Credit scoring using data mining techniques,” Singapore Management Review, vol. 26, no. 2, pp. 252004.
    J. R. Quinlan, " Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
    J. R. Quinlan, "Constructing decision tree," C4, 5, pp. 17–26, 1993.
    X. Chen, M. Wang, and H. Zhang, " The use of classification trees for bioinformatics,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, pp. 55–63, 2011. doi: 10.1002/widm.14
    L. Breiman, " Classification and regression trees,” Routledge, 2017.
    H. Y. Liu, M. C. Zhou, X.S. Lu, and C. Yao, " Weighted Gini index feature selection method for imbalanced data,” in Proc. 15th IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 1−6, Mar. 2018.
    H. Y. Liu and M. C. Zhou, " Decision tree rule-based feature selection for large-scale imbalanced data,” in Proc. 26th IEEE Wireless and Optical Communication Conf. (WOCC), pp. 1−6, IEEE, Apr. 2017.
    T. Q. Chen and T. He, " Xgboost: extreme gradient boosting,” R Package Version 0.4−2, 2015.
    T. Fawcett, " An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. doi: 10.1016/j.patrec.2005.10.010
    N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newsletter, vol. 1, pp. 1–6, 2004.
    D. D. Lewis, and A. G. William. " A sequential algorithm for training text classifiers,” in Proc. 17th Annu. Int. ACM SIGIR Conf on Research and Development in Information Retrieval, Springer-Verlag New York, Inc., pp. 3−12, 1994.
    C. J. Van Rijsbergen. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA, USA, 1979.
    M. Friedman, " A comparison of alternative tests of significance for the problem of m rankings,” The Annu. of Mathematical Statistics, no. 1, pp. 86–92, 1940.
    R. F. Woolson, " Wilcoxon signed-rank test,” Wiley Encyclopedia of Clinical Trials, pp. 1–3, 2007.
    J. Demšar, " Statistical comparisons of classifiers over multiple data sets,” J. Machine Learning Research, vol. 7, pp. 1–30, Jan. 2006.
    S. Garcia and H. Francisco, " An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons,” J. Machine Learning Research, vol. 9, pp. 2677–2694, Dec. 2008.
    P. Zhang, S. Shu, and M. C. Zhou, " An Online Fault Detection Method based on SVM-Grid for Cloud Computing Systems,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 2, pp. 445–456, Mar. 2018. doi: 10.1109/JAS.2017.7510817
    J. Cheng, M. Chen, M. Zhou, S. Gao, C. Liu, and C. Liu, " Overlapping Community Change Point Detection in an Evolving Network,” IEEE Trans. Big Data, DOI: 10.1109/TBDATA.2018.2880780, Nov. 2018.
    S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, " Dendritic neuron model with effective learning algorithms for classification, approximation and prediction,” IEEE Trans-Neural Networks and Learning Syst., DOI: 10.1109/TNNLS.2018.2846646, 2018.
    Q. Kang, L. Shi, M. C. Zhou, X. Wang, Q. Wu, and Z. Wei, " A Distance-based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification,” IEEE Trans. Neural Networks and Learning Syst., vol. 29, no. 9, pp. 4152–4165, Sep. 2018. doi: 10.1109/TNNLS.2017.2755595


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(17)

    Article Metrics

    Article views (1357) PDF downloads(84) Cited by()


    DownLoad:  Full-Size Img  PowerPoint