A New Retrieval Model Based on TextTiling for Document Similarity Search

来源 :计算机科学技术学报(英文版) | 被引量 : 0次 | 上传用户:z306075045
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine,etc. Traditional retrieval models, including the Okapis BM25 model and the Smarts vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice,the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show:1) the popular retrieval models (the Okapis BM25 model and the Smarts vector space model with length normalization)do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.
其他文献
列举了科技论文中常见的表格主谓倒置现象,提出了具体的修正方法。认为科技论文的写作应重视表格制作的规范,编辑要重视表格的审读和修改,才能避免主谓倒置等不规范现象,从而
论述了编辑要超越自身,做一个自然的编辑人;超越自心,美丽一颗灵动的编辑心和陶冶灵魂,铸就一缕通透的编辑魂。 Discusses the editor to go beyond their own, to be a nat
对《青海大学学报》2004—2006年论文发表时滞进行了调查分析。结果表明:三年间论文平均发表时滞为191天(约6.3个月),其中年均最短为160天(2006年),最长为217天(2005年),2006
目的了解我国生物医学期刊在线投稿与审稿系统(OSPRS)的建设、使用情况,并比较不同 OSPRS 间的优劣。方法选择被 SCI 收录的生物医学期刊和《中文核心期刊要目总览(2004年)》
分析了浙江省冠校名的理工科类高校学报2005年被引指标及2001~2005年总被引频次和影响因子的整体态势,介绍了该类学报被国内外重要检索系统的收录情况、进入期刊方阵情况和获
汤姆森路透科技信息集团最近公布,经过严格的选刊评估,《科学引文索引》数据库(SCI,SSCI&AHCI)新增700种区域性优秀期刊,这一最新举措极大地拓展了 Web of Science(WoS)期刊
文章在阐述合作期刊发展过程的基础上,详细分析了影响国内与境外合作内地出版的中文版期刊发展的三方面问题,并针对问题提出了三点对策与思考。 Based on the development o
搜索并统计分析了中国科协所属898种科技期刊中OA期刊的数量、上网形式、学科分布、类别分布、开放全文的回溯年代、期数、篇数、时滞和格式等.结果表明,中国科协科技期刊中O
构建了山羊β-酪蛋白基因座基因打靶载体pGBC-GFP-neo,载体包含了正负筛选标记基因neo和tk,以及无启动子的GFP基因.打靶载体线性化后用脂质体包裹转染山羊乳腺上皮细胞,利用G
An effective approach for the construction of the decalin ring skeleton of labdane diterpenoids was developed based on a key biomimetic cationic polyene cycliza