论文部分内容阅读
互联网上分布的许多用于搜集网络信息的WebSpiders(网络爬虫)一般都工作在单机上,难以快速完成大规模的信息采集工作.对此提出了一种集群式Spider系统的构想,它能够使许多Spider工作在不同的主机上完成同一项任务(每个Spider负责一部分,可动态调整),因此可大大加速信息采集工作.文中描述了这种系统的体系结构与模型,并介绍了该系统的一种实现,即ChinaWebWizard.它不仅可以在集群模式下工作,还能动态地发现新的站点.该系统为搜索引擎提供了底层支持,对网点建设者和开发者具有参考价值.
Many WebSpiders (web crawlers) distributed on the Internet for collecting network information generally work on a single computer, so that it is difficult to quickly complete large-scale information collection. In this regard, a concept of a cluster-based Spider system is proposed, which can speed up the collection of information by enabling many Spider to work on the same task on different hosts (each Spider is responsible for part of the dynamic adjustment). This article describes the system architecture and model, and describes an implementation of the system, that is, ChinaWebWizard. Not only does it work in clustered mode, but it also dynamically discovers new sites. The system provides the underlying support for search engines, the site builders and developers have a reference value.