论文部分内容阅读
详细介绍传统的Single-Pass算法并分析它的特点和不足之处,并针对传统的Single-Pass算法对输入顺序敏感的问题,提出一种改进方法,即找出含有话题信息丰富的微博客文本优先聚类,得到初始的话题簇,再对余下的微博客文本进行聚类以提高聚类的精度。对话题发现的流程:文本预处理、向量模型的构建、Single-Pass聚类、凝聚层次聚类进行详细的描述,实验结果表明该方法在召回率、准确率、F值指标上均优于传统的方法。“,”Introduces the traditional Single-Pass algorithm in details and analyses its characteristics and disadvantages, and in view of the traditional Single-Pass algorithm is sensitive to the problem of input sequence. In order to solve the problem and improve the accuracy of clustering, proposes an improved method, namely, identifies the topic information rich microblog text to cluster to get the initial cluster result, then clusters the rest of the micro blog text. Topic discovery process:text pretreatment, vector model build, Single-Pass algorithm, hierarchical clustering algorithm has carried on the detailed description. The test shows that the method on the recall ratio and accuracy, F value in-dex is superior to the traditional method.