论文部分内容阅读
针对Web网页的二维结构和内容的特点,提出一种树型结构分层条件随机场(TH-CRFs)来进行Web对象的抽取.首先,从网页结构和内容两个方面使用改进多特征向量空间模型来表示网页的特征;第二,引入布尔模型和多规则属性来更好地表示Web对象结构与语义的特征;第三,利用TH-CRFs来进行Web对象的信息提取,从而找出相关的招聘信息并优化模型训练的效率.通过实验并与现有的Web信息抽取模型对比,结果表明,基于TH-CRFs的Web信息抽取的准确率已有效改善,同时抽取的时间复杂度也得到降低.
Aiming at the characteristics of two-dimensional structure and content of Web pages, this paper proposes a TH-CRFs model to extract Web objects.Firstly, we use improved multi-eigenvector Space model to represent the characteristics of web pages; secondly, the introduction of Boolean model and multi-rule attributes to better represent the structure and semantic features of Web objects; third, the use of TH-CRFs for Web object information extraction to find the relevant And optimize the training efficiency of the model.Compared with the existing Web information extraction models, the experimental results show that the accuracy of Web information extraction based on TH-CRFs has been effectively improved and the time complexity of the extraction has also been reduced .