论文部分内容阅读
随着大数据技术的广泛应用和示范效应,企业越来越重视数据的价值挖掘,尤其是结合企业内外部数据进行客户行为、偏好的分析与识别。而电商数据,对一些企业来讲无疑是宝贵的外部数据资源。但电商数据的获取,会受到电商网站的一些反爬技术限制,使其采集变得越来越困难。针对电商领域的数据采集以及在采集电商网站数据过程遇到的数据大,速度慢,访问校验,IP访问限制等问题,结合实际需求,研究、提出一种基于Nutch的分布式电商数据采集方案。
With the widespread application and demonstration effect of big data technology, enterprises pay more and more attention to the value mining of data, especially the analysis and identification of customer behavior and preference based on the internal and external data of the enterprise. E-commerce data, for some enterprises, is undoubtedly a valuable external data source. However, access to e-commerce data will be subject to some anti-crawling technology restrictions on e-commerce websites, making it harder and harder to collect data. In view of the data acquisition in the field of e-commerce and the problems such as large data, slow speed, access check and IP access restrictions encountered during the collection of e-commerce website data, a Nutch-based distributed e-commerce Data collection program.