语料库短语序列提取系统的设计与开发

来源 :外语电化教学 | 被引量 : 0次 | 上传用户:jiaoqianqian
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
语料库短语序列提取一直是短语学研究的关键技术环节。囿于计算和操作的复杂性,前人研究多使用相对单一的统计方法测量和提取短语序列,导致提取的数据包含大量噪音。文章使用前沿的大数据处理手段和计算技术,实现了基于频数、互信息、边界熵等多种统计手段的短语序列提取方法,并研制开发了相应的系统。实验结果表明,该系统能够在普通计算机上支持千万词级规模的大型语料库运算,并能显著提高短语序列的提取质量。 Corpus phrase sequence extraction has been the key technology of phraseology research. Due to the complexity of computation and operation, previous studies mostly used relatively simple statistical methods to measure and extract phrase sequences, which resulted in a large amount of noise in the extracted data. Using the frontier big data processing and computing techniques, the article has realized the phrase sequence extraction method based on multiple statistical means such as frequency, mutual information, boundary entropy, and developed the corresponding system. The experimental results show that this system can support large-scale corpus-scale computing on the average computer and can significantly improve the quality of phrase extraction.
其他文献
当前,视频侦查已经成为继刑事技术、行动技术、网侦技术之后的第四大支撑,但在侦查实践中,对画面信息深度解析一直是困扰民警的难题,为更好地指导侦查破案,指导民警系统性地