论文部分内容阅读
针对科技论文具有半结构化的特点,提出利用科技论文的元数据的多层次分类模型.这里元数据包含论文的标题、关键词集合和摘要等信息.实验证明,若只利用元数据,可以取得与传统的基于全文信息分类方法近似的分类精度;若基于领域知识所产生的分类法,先利用元数据进行粗分类,然后再进行全文分类,所得到的分类精度要高于已知最好算法.因元数据的尺寸远远小于论文全文的尺寸,而粗分类后每类的论文数要远远小于全体论文数,故在分类类别数目较多且分类文本分布较为平均的情况下,可极大地缩短分类的时间.
In view of the semi-structured features of scientific papers, this paper proposes a multi-level classification model that uses metadata of scientific papers, where the metadata contains the title of the paper, keyword sets and digests, etc. Experiments show that if only meta-data is available, Compared with the traditional classification accuracy based on the full text information classification method, if the classification method based on domain knowledge is used, the metadata are used for rough classification and then the full-text classification, the classification accuracy obtained is higher than the best known algorithm Because the size of the metadata is far less than the size of the full text of the paper, and the number of papers in each category after rough classification is much less than the total number of papers, in the case of more classification categories and more evenly distributed texts, Earth shorter classification time.