An Embedded Feature Selection Method for Imbalanced Data Classification

Haoyue Liu; MengChu Zhou; Qing Liu

doi:10.1109/JAS.2019.1911447

Volume 6 Issue 3

May 2019

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2019 > 6(3): 703-715

Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447

Citation:

Haoyue Liu, MengChu Zhou and Qing Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703-715, May 2019. doi: 10.1109/JAS.2019.1911447

Citation:

PDF( 742 KB)

An Embedded Feature Selection Method for Imbalanced Data Classification

doi: 10.1109/JAS.2019.1911447

Funds: This work was supported in part by the National Science Foundation of USA (CMMI-1162482)

More Information

Author Bio:
Haoyue Liu (S’17) received the B.S. degree from Kunming University of Science and Technology, Kunming, China, in 2014, and the M.S. degree from the New Jersey Institute of Technology, Newark, NJ, USA in 2016, where she is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering. Her current research interests include machine learning, natural language processing, sentiment analysis, and big data analytics

MengChu Zhou (S’88−M’90−SM’93−F’03) received the B.S. degree in control engineering from Nanjing University of Science and Technology, Nanjing, China in 1983, M.S. degree in automatic control from Beijing Institute of Technology, Beijing, China in 1986, and Ph.D. degree in computer and systems engineering from Rensselaer Polytechnic Institute, Troy, NY in 1990. He joined New Jersey Institute of Technology (NJIT), Newark, NJ in 1990, and is now a Distinguished Professor of electrical and computer engineering. His research interests include Petri nets, intelligent automation, internet of things, big data, web services, and intelligent transportation. He has over 800 publications including 12 books, 460+ journal papers (360+ in IEEE transactions), 12 patents and 29 book-chapters. He was invited to lecture in Australia, Canada, China, France, Germany, Italy, Japan, Korea, Mexico, Qatar, Saudi Arabia, Singapore, and US and served as a plenary/keynote speaker for many conferences. He is the founding Editor of IEEE Press Book Series on Systems Science and Engineering and Editor-in-Chief of IEEE/CAA Journal of Automatica Sinica. He served as Associate Editor of IEEE Transactions on Robotics and Automation, IEEE Transactions on Automation Science and Engineering, IEEE Transactions on Systems, Man and Cybernetics: Systems, and IEEE Transactions on Industrial Informatics, and Editor of IEEE Transactions on Automation Science and Engineering. He served as a Guest-Editor for many journals including IEEE Internet of Things Journal, IEEE Transactions on Industrial Electronics, and IEEE Transactions on Semiconductor Manufacturing. He is also Associate Editor of IEEE Transactions on Intelligent Transportation Systems, IEEE Internet of Things Journal, and Frontiers of Information Technology & Electronic Engineering. He was General Chair of IEEE Conf. on Automation Science and Engineering, Washington D.C., August 23–26, 2008, General Co-Chair of 2003 IEEE International Conference on System, Man and Cybernetics (SMC), Washington DC, October 5–8, 2003, Founding General Co-Chair of 2004 IEEE Int. Conf. on Networking, Sensing and Control, Taipei, March 21–23, 2004, and General Chair of 2006 IEEE Int. Conf. on Networking, Sensing and Control, Ft. Lauderdale, Florida, USA. April 23–25, 2006. He was Program Chair of 2010 IEEE International Conference on Mechatronics and Automation, August 4–7, 2010, Xi’an, China, 1998 and 2001 IEEE International Conference on SMC and 1997 IEEE International Conference on Emerging Technologies and Factory Automation. He organized and chaired over 100 technical sessions and served on program committees for many conferences. Dr. Zhou has led or participated in over 50 research and education projects with total budget over $12M, funded by National Science Foundation, Department of Defense, National Institute of Standards and Technology (NIST), New Jersey Science and Technology Commission, and Industry, USA. He is a recipient of Humboldt Research Award for US Senior Scientists from Alexander von Humboldt Foundation, Franklin V. Taylor Memorial Award and the Norbert Wiener Award from IEEE Systems, Man and Cybernetics Society for which he serves as VP for Conferences and Meetings. He is a life member of Chinese Association for Science and Technology-USA and served as its President in 1999. He is a Fellow of International Federation of Automatic Control (IFAC), American Association for the Advancement of Science (AAAS) and Chinese Association of Automation (CAA)

Qing Liu is an Assistant Professor in the Department of Electrical and Computer Engineering at New Jersey Institute of Technology. Prior to that, he was a staff scientist at Science Data Group, Oak Ridge National Laboratory for 7 years. He received the Ph.D. degree in computer engineering from the University of New Mexico in 2008, M.S. and B.S. degrees from Nanjing University of Posts and Telecom, China, in 2004 and 2001, respectively. He is a member of Association for Computing Machinery (ACM). His research interests include high-performance computing, data management, and computer systems
Corresponding author: H. Y. Liu and Q. Liu are with the Department of Electrical and ComputerEngineering, New Jersey Institute of Technology, Newark, NJ 07102 USA(e-mail: hl394@njit.edu; qing.liu@njit.edu)
Received Date: 2018-09-20
Revised Date: 2018-12-31
Accepted Date: 2019-02-21

Available Online: 2019-04-24

Abstract

Abstract

Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of F-measure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
- Classification and regression tree,
- feature selection,
- imbalanced data,
- weighted Gini index (WGI)

FullText(HTML)

References(47)

References

[1]	F. Wang, T. Xu, T. Tang, M. C. Zhou, and H. Wang, " Bilevel feature extraction-based text mining for fault diagnosis of railway systems,” IEEE Trans. Intelligent Transportation Systems, vol. 18, no. 1, pp. 49–58, Jan. 2017. doi: 10.1109/TITS.2016.2521866
[2]	D. Ramyachitra and P. Manikandan, " Imbalanced dataset classification and solutions: a review,” Inter. J. Computing and Business Research (IJCBR) , vol. 5, no. 4, Jul. 2014.
[3]	E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, " SMOTE-RSB: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowledge and Information Syst.*, vol. 33, no. 2, pp. 245–265, Nov. 2012. doi: 10.1007/s10115-011-0465-6
[4]	Q. Kang, X. Chen, S. Li, and M. C. Zhou, " A noise-filtered under-sampling scheme for imbalanced classification,” IEEE Trans. Cybernetics, vol. 47, no. 12, pp. 4263–4274, Dec. 2018.
[5]	B. Krawczyk, M. Woźniak, and G. Schaefer, " Cost-sensitive decision tree ensembles for effective imbalanced classification,” Applied Soft Computing, vol. 14, pp. 554–562, Jan. 2014. doi: 10.1016/j.asoc.2013.08.014
[6]	V. Lopez, S. del Rio, J. Manuel Benitez, and F. Herrera, " On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data,” in Proc. IEEE Int. Conf. Fuzzy Syst.. pp. 1905−1912, IEEE, Jul. 2014.
[7]	Z. L. Cai and W. Zhu, " Feature selection for multi-label classification using neighborhood preservation,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, pp. 320–330, Jan. 2018. doi: 10.1109/JAS.2017.7510781
[8]	C. Jian, J. Gao, and Y. Ao, " A new sampling method for classifying imbalanced data based on support vector machine ensemble,” Neurocomputing, vol. 193, pp. 115–122, 2016. doi: 10.1016/j.neucom.2016.02.006
[9]	I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
[10]	X. H. Yuan, L. B. Kong, D. C. Feng, and Z. C. Wei, " Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA J. Autom Sinica, vol. 4, no. 4, pp. 677–685, Oct. 2017. doi: 10.1109/JAS.2017.7510625
[11]	J. Wang, L. Qiao, Y. Ye, and Y. Chen, " Fractional envelope analysis for rolling element bearing weak fault feature extraction,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 2, pp. 353–360, 2017. doi: 10.1109/JAS.2016.7510166
[12]	N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004. doi: 10.1145/1007730
[13]	A. K. Uysal and S. Gunal, " A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems, vol. 36, pp. 226–235, 2012. doi: 10.1016/j.knosys.2012.06.005
[14]	L. Yu and H. Liu, " Feature selection for high-dimensional data: a fast correlation-based filter solution,” in Proc. Int. Conf. Machine Learning, vol. 3, pp. 856−863, 2003.
[15]	V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J. Manuel Benítez, and F. Herrera, " A review of microarray datasets and applied feature selection methods,” Information Sciences, vol. 282, pp. 111–135, 2014. doi: 10.1016/j.ins.2014.05.042
[16]	G. Chandrashekar and F. Sahin, " A survey on feature selection methods,” Computers & Electrical Engineering, vol. 41, no. 1, pp. 16–28, 2014.
[17]	H. Liu and H. Motoda, " Feature selection for knowledge discovery and data mining,” Springer Science & Business Media, vol. 454, 2012.
[18]	S. Shilaskar and A. Ghatol, " Feature selection for medical diagnosis: Evaluation for cardiovascular diseases,” Expert Syst. with Applications, vol. 40, no. 10, pp. 4146–4153, 2013. doi: 10.1016/j.eswa.2013.01.032
[19]	I. A. Gheyas and L. S. Smith, " Feature subset selection in large dimensionality domains,” Pattern Recognition, vol. 43, no. 1, pp. 5–13, 2010. doi: 10.1016/j.patcog.2009.06.009
[20]	S. Maldonado and R. Weber, " A wrapper method for feature selection using support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–2217, 2009. doi: 10.1016/j.ins.2009.02.014
[21]	Y. Zhu, J. Liang, J. Chen, and M. Zhong, " An improved NSGA-III algorithm for feature selection used in intrusion detection,” J. Knowledge-Based Syst., vol. 116, pp. 74–85, Jan. 2017. doi: 10.1016/j.knosys.2016.10.030
[22]	A. Moayedikia, K. L. Ong, Y. L. Boo, W. G. Yeoh, and R. Jensen, " Feature selection for high dimensional imbalanced class data using harmony search,” J. Engineering Applications of Artificial Intelligence, vol. 57, pp. 38–49, Jan. 2017. doi: 10.1016/j.engappai.2016.10.008
[23]	I. Guyon and A. Elisseeff, " An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
[24]	S. Maldonado and J. Lopez, " Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification,” J. Applied Soft Computing, vol. 67, pp. 94–105, Jun. 2018. doi: 10.1016/j.asoc.2018.02.051
[25]	C. Apté, F. Damerau, and M. S. Weiss, " Automated learning of decision rules for text categorization,” ACM Trans. Information Syst., vol. 12, no. 3, pp. 233–251, 1994. doi: 10.1145/183422.183423
[26]	G. Forman, " An extensive empirical study of feature selection metrics for text classification,” J. Machine Learning Research, vol. 3, pp. 1289–1305, Mar. 2003.
[27]	C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, " Know your neighbors: Web spam detection using the web topology,” in Proc. the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423−430, Jul. 2007.
[28]	H. Koh, W. C. Tan, and G. C. Peng, " Credit scoring using data mining techniques,” Singapore Management Review, vol. 26, no. 2, pp. 252004.
[29]	J. R. Quinlan, " Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
[30]	J. R. Quinlan, "Constructing decision tree," C4, 5, pp. 17–26, 1993.
[31]	X. Chen, M. Wang, and H. Zhang, " The use of classification trees for bioinformatics,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, pp. 55–63, 2011. doi: 10.1002/widm.14
[32]	L. Breiman, " Classification and regression trees,” Routledge, 2017.
[33]	H. Y. Liu, M. C. Zhou, X.S. Lu, and C. Yao, " Weighted Gini index feature selection method for imbalanced data,” in Proc. 15th IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 1−6, Mar. 2018.
[34]	H. Y. Liu and M. C. Zhou, " Decision tree rule-based feature selection for large-scale imbalanced data,” in Proc. 26th IEEE Wireless and Optical Communication Conf. (WOCC), pp. 1−6, IEEE, Apr. 2017.
[35]	T. Q. Chen and T. He, " Xgboost: extreme gradient boosting,” R Package Version 0.4−2, 2015.
[36]	T. Fawcett, " An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. doi: 10.1016/j.patrec.2005.10.010
[37]	N. V. Chawla, N. Japkowicz, and A. Kotcz, " Editorial: special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newsletter, vol. 1, pp. 1–6, 2004.
[38]	D. D. Lewis, and A. G. William. " A sequential algorithm for training text classifiers,” in Proc. 17th Annu. Int. ACM SIGIR Conf on Research and Development in Information Retrieval, Springer-Verlag New York, Inc., pp. 3−12, 1994.
[39]	C. J. Van Rijsbergen. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA, USA, 1979.
[40]	M. Friedman, " A comparison of alternative tests of significance for the problem of m rankings,” The Annu. of Mathematical Statistics, no. 1, pp. 86–92, 1940.
[41]	R. F. Woolson, " Wilcoxon signed-rank test,” Wiley Encyclopedia of Clinical Trials, pp. 1–3, 2007.
[42]	J. Demšar, " Statistical comparisons of classifiers over multiple data sets,” J. Machine Learning Research, vol. 7, pp. 1–30, Jan. 2006.
[43]	S. Garcia and H. Francisco, " An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons,” J. Machine Learning Research, vol. 9, pp. 2677–2694, Dec. 2008.
[44]	P. Zhang, S. Shu, and M. C. Zhou, " An Online Fault Detection Method based on SVM-Grid for Cloud Computing Systems,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 2, pp. 445–456, Mar. 2018. doi: 10.1109/JAS.2017.7510817
[45]	J. Cheng, M. Chen, M. Zhou, S. Gao, C. Liu, and C. Liu, " Overlapping Community Change Point Detection in an Evolving Network,” IEEE Trans. Big Data, DOI: 10.1109/TBDATA.2018.2880780, Nov. 2018.
[46]	S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, " Dendritic neuron model with effective learning algorithms for classification, approximation and prediction,” IEEE Trans-Neural Networks and Learning Syst., DOI: 10.1109/TNNLS.2018.2846646, 2018.
[47]	Q. Kang, L. Shi, M. C. Zhou, X. Wang, Q. Wu, and Z. Wei, " A Distance-based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification,” IEEE Trans. Neural Networks and Learning Syst., vol. 29, no. 9, pp. 4152–4165, Sep. 2018. doi: 10.1109/TNNLS.2017.2755595

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(17)

Get Citation

PDF

XML

Article Metrics

Article views (2317) PDF downloads(179)

An Embedded Feature Selection Method for Imbalanced Data Classification

doi: 10.1109/JAS.2019.1911447

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content