The Crawling Module and Web Pages

786 Words2 Pages

With the first one, a collection can have various copies of web pages grouped according to the crawl in which they were found. For the second one, only the most recent copy of web pages is to be saved. For this, one has to maintain records of when the web page changed and how frequently it was changed. This technique is more efficient than the previous one but it requires an indexing module to be run with the crawling module. The authors conclude that an incremental crawler can bring brand new copies of web pages more quickly and maintain the storage area fresher than a periodic crawler. III. CRAWLING TERMINOLOGY The web crawler keeps a list of unvisited URLs which is called as frontier. The list is initiate with start URLs which may be given by a user or some different program. Each crawling loop engages selecting the next URL to crawl from the frontier, getting the web page equivalent to the URL, parsing the retrieved web page to take out the URLs and application specific information, and lastly add the unvisited URLs to the frontier. Crawling process may be finished when a specific number of web pages have been crawled. The WWW is observed as a huge graph with web pages as its nodes and links as its edges. A crawler initiates at a few of the nodes and then follows the edges to arrive at other nodes. The process of fetching a web page and take out the links within it is similar to expanding a node in graph search. A topical crawler tries to follow edges that are supposed to lead to portions of the graph that are related to a matter. Frontier: The crawling method initialize with a seed URL, extracting links from it and adding them to an unvisited list of URLs, This list of unvisited URLs is known as Frontier. The frontier is basi... ... middle of paper ... ...ntier) till the whole web site is navigated. After creating this list of URLs, the second part of our application will start to get the HTML text of each link in the list and save it as a new record in the database. There is only one central database for storing all web pages. Given below figure is the snapshot of the user interface of the Web Crawler application, which is designed in the VB.NET Windows Application, for crawling a website or any web application using this crawler internet connection must be required and as input use URL in a format as shown in figure. At every crawling step, the program selects the peak URL from the frontier and sends this web sites information to a unit that will download web pages from the Website. For this implementation we use multithreading for parallelization of crawling process so that we can download many web sites parallel.

More about The Crawling Module and Web Pages

Open Document