A JavaTM technology web crawler has been created by Become.com that could be the most sophisticated and massively scaled Java technology application today, according to a Sun Developer Network article by Janice J. Heiss. Reportedly, this web crawler obtains information on over three billion web pages and writes over eight terabytes of data on 30 fully distributed servers in seven days.
CTO, Chairman and Cofounder of Become.com Yeogirl Yun decided to use Java for the company's web crawler due to previously poor experiences with C++ in relation to memory and threading.
Yun explains, "We needed to do it faster this time. So we made the radical decision to implement a crawler using Java technology. No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement. The built-in network library, multithreading framework, and RMI (remote method invocation) saved a lot of development time. The performance is pretty good. My experience with Wisenut made it clear that managing uncertain data on the web is a big challenge. There can be memory leaks and threading issues. We're very pleased with the performance of the Java platform."
The first crawler created by Become.com was written entirely in the Java programming language. A second crawler was completed by writing the fetcher in the Java language and the controller in C++. The fetcher does I/O and gathers, parses and analyzes the content of web pages. It extracts links and sends data to the controller, which manages data structures and records data to disks. Fetchers communicate with controllers but not with each other.
"Each crawler uses 200 Mbps lines and writes more than 8 terabytes of data. One run takes roughly a week. Crawler A was written first, starting in June 2004. Crawler B was begun in November 2004," Heiss explains. "Both crawlers are pure Java software, with no Java Native Interface (JNI). Crawlers share some packages for content analysis, and all content-analysis software used during the crawl is written entirely in the Java language."
In this article, Heiss breaks down the development of and challenges faced with this web crawler, and how it has helped Become.com build a web index of valuable information.
[...read more...]