论文部分内容阅读
机器学习中各类别样本数目不等是普遍存在且备受关注的不均衡问题。广泛用于特征选择的信息增益IG(information gain)算法,在这类不均衡问题中的表现却极少被研究。本文在讨论IG算法在不同均衡度数据集上性能的基础上,提出了一种新的解决不均衡问题的特征选择算法Im-IG(imbalanced-information gain)。Im-IG通过提高小类分布在信息熵计算中的权重,优先选入有利于小类正确分离的特征。在提升整体分类性能的同时,着眼于提高小类的正确率。在多个不均衡数据集上的实验结果表明,Im-IG算法能较好地解决IG算法在不均衡问题中的不适应性,是一种有效的不均衡问题特征选择算法。
The number of each category in machine learning ranged from ubiquitous to the imbalanced issue that received much attention. The widely used information gain algorithm for feature selection has rarely been studied in this kind of unbalanced problem. In this paper, based on the discussion of the performance of IG algorithms on different equalization datasets, a new feature selection algorithm Im-IG (imbalanced-information gain) is proposed to solve the unbalanced problem. Im-IG, by increasing the weight of sub-class distribution in the information entropy calculation, preferentially selects the features that are conducive to the correct separation of sub-classes. While improving overall classification performance, we should focus on improving the accuracy of subcategories. Experimental results on a number of unbalanced datasets show that Im-IG algorithm can better solve the IG algorithm’s inequality in the unbalanced problem and is an effective feature selection algorithm for unbalanced data.