论文部分内容阅读
针对Web页面分类方法一般只能处理小规模数据的问题,提出一种核心子集选择训练的大规模中文网页分类方法.该方法通过将支持向量机的最优化求解问题转化为等价的近似最小闭包球求解问题,使得只需选择数据集的核心子集参与分类器训练;并且,在特征选择阶段采用改进的基于词性的互信息特征选择模型,有效提高Web页面分类的大规模数据处理能力.在搜狗实验室提供的大规模Web页面数据集上进行了实验,实验结果表明不仅准确率可达到支持向量机同等的效果,且训练时间大大减少;而对不均衡类别数据的测试结果表明,该方法在处理不均衡类别数的Web网页分类上也能获得很好的效果.
Aiming at the problem that the Web page classification method can only deal with the problem of small-scale data, a large-scale Chinese Web page classification method based on core subset selection training is proposed. This method transforms the optimization problem of support vector machine into the equivalent approximate minimum In the process of feature selection, an improved model of feature selection based on part-of-speech is proposed to effectively improve the large-scale data processing capability of Web page classification Experiments on the large-scale Web page dataset provided by Sogou Laboratory show that not only the accuracy rate can achieve the same effect as the support vector machine, but also the training time is greatly reduced. The test results of unbalanced category data show that, This method can also get good results in dealing with the classification of Web pages with unbalanced categories.