语料库短语序列提取系统的设计与开发

论文部分内容阅读

语料库短语序列提取一直是短语学研究的关键技术环节。囿于计算和操作的复杂性,前人研究多使用相对单一的统计方法测量和提取短语序列,导致提取的数据包含大量噪音。文章使用前沿的大数据处理手段和计算技术,实现了基于频数、互信息、边界熵等多种统计手段的短语序列提取方法,并研制开发了相应的系统。实验结果表明,该系统能够在普通计算机上支持千万词级规模的大型语料库运算,并能显著提高短语序列的提取质量。 Corpus phrase sequence extraction has been the key technology of phraseology research. Due to the complexity of computation and operation, previous studies mostly used relatively simple statistical methods to measure and extract phrase sequences, which resulted in a large amount of noise in the extracted data. Using the frontier big data processing and computing techniques, the article has realized the phrase sequence extraction method based on multiple statistical means such as frequency, mutual information, boundary entropy, and developed the corresponding system. The experimental results show that this system can support large-scale corpus-scale computing on the average computer and can significantly improve the quality of phrase extraction.

其他学术论文