Instance reduction for supervised learning using input-output clustering method

来源 :Journal of Central South University | 被引量 : 0次 | 上传用户:houqiusheng
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
A method that applies clustering technique to reduce the number of samples of large data sets using input-output clustering is proposed.The proposed method clusters the output data into groups and clusters the input data in accordance with the groups of output data.Then,a set of prototypes are selected from the clustered input data.The inessential data can be ultimately discarded from the data set.The proposed method can reduce the effect from outliers because only the prototypes are used.This method is applied to reduce the data set in regression problems.Two standard synthetic data sets and three standard real-world data sets are used for evaluation.The root-mean-square errors are compared from support vector regression models trained with the original data sets and the corresponding instance-reduced data sets.From the experiments,the proposed method provides good results on the reduction and the reconstruction of the standard synthetic and real-world data sets.The numbers of instances of the synthetic data sets are decreased by 25%-69%.The reduction rates for the real-world data sets of the automobile miles per gallon and the 1990 census in CA are 46% and 57%,respectively.The reduction rate of 96% is very good for the electrocardiogram(ECG) data set because of the redundant and periodic nature of ECG signals.For all of the data sets,the regression results are similar to those from the corresponding original data sets.Therefore,the regression performance of the proposed method is good while only a fraction of the data is needed in the training process. A method that applies clustering technique to reduce the number of samples of large data sets using input-output clustering is proposed. Proposed method clusters the output data into groups and clusters the input data in accordance with the groups of output data .hen, a set of prototypes are selected from the clustered input data. inessential data can be ultimately discarded from the data set. proposed method can reduce the effect from outliers because only the prototypes are used. This method is applied to reduce the data set in regression problems.Two standard synthetic data sets and three standard real-world data sets are used for evaluation.The root-mean-square errors are compared from support vector regression models trained with the original data sets and the corresponding instance-reduced data sets. From the experiments, the proposed method provides good results on the reduction and the reconstruction of the standard synthetic and real-world data sets.The numbers of instanc es of the synthetic data sets are decreased by 25% -69%. reduction rates for the real-world data sets of the automobile miles per gallon and the 1990 census in CA are 46% and 57%, respectively. reduction rate of 96% is very good for the electrocardiogram (ECG) data set because of redundant and periodic nature of ECG signals. For all of the data sets, the regression results are similar to those from corresponding that the original data sets. Agofore the regression performance of the proposed method is good while only a fraction of the data is needed in the training process.
其他文献
在现有拟阵和模糊拟阵理论的基础上,本文主要研究了模糊拟阵模糊基,特别是准模糊图拟阵模糊基的性质。  1)研究了模糊拟阵的闭、正规与其模糊基的存在性之间的关系,给出了闭模
本文通过对荣华二采区10
本文通过对荣华二采区10
非线性边值问题来源于生活中数学、物理以及多个领域的多个方面,在数学理论研究中有着举足轻重的地位.微分方程(组)理论在数学研究领域中有着久远的历史,它所体现的结构有着
本文研究了一类首项系数多次变号且在变号点处带有转移条件的Sturm-Liouville问题,利用分析的方法证明了判定该类问题的特征值的充分必要条件,以及给出了特征值为实数的结论.本
资源型城市,是资源富集区的一种典型,是依靠矿产等自然资源而产生和发展的城市。主要具有两个特征,一是自然资源、特别是矿产资源较为富集;二是城市化发展水平较高,且已经设
保险公司都致力于寻找相应的方法来减少公司所承担的风险.其中再保险就是是保险公司减少风险的一种行之有效的办法.但是如果保险公司为了降低风险而签订了再保险协议,那么该公司的收益也会随之降低.近几年,考虑如何通过再保险策略和分红策略来权衡公司的风险和收益成为一个热点问题.本文运用了随机控制理论、最优策略理论和HJB方程等数学理论来研究保险公司的最优再保险、分红和融资的决策.文章中分别考虑了一般的扩散过程
对未知参数的估计可以被看成决策问题.以贝叶斯的观点,得到的解一定与损失函数和先验分布这两者有关.本文介绍了本质差异的相关知识并展示了如何使用本质差异和相关贝叶斯分析
多年来,气冷离子激光器是可获得输出功率达数十毫瓦的最主要的实用可见光源。它在电子复印、流动血细胞计数及DNA排序这些应用中发挥重要作用。然而最近它面临挑战,几种固体