781 Words4 Pages
Web crawlers are programs that traverse pages by exploiting the graph structure of the web. The key motivation behind designing Web crawlers is to create a localized repository of web pages. In their simplest form, a crawler initiates a crawl from a so called seed page and links to all the external links defined in the page. This process repeats with all subsequent pages until some predefined condition is satisfied~cite{crawlingthe}. \ Due to the dynamic nature of the Web, crawling is an important facet of any search engine. As it maintains a current search listing as pages are modified, created or removed. \ General purpose search engines (GPSE's) attempt to offer as wider coverage as is possible, and will attempt to offset the cost of crawling through the queries they receive. GPSE's will employ exhaustive crawlers that will simply crawl any link that they find. Their goal is simply to be as comprehensive as possiblecite{crawlingthe}. \ However, crawlers may also be selective in the content that they fetch, these are known as preferential or heuristic based crawlerscite{Chakrabarti99focusedcrawling} and are used in the construction of focused repositories and automated resource discovery. Preferential or heuristic crawlers that are designed to crawl pages within a specific topic are known as focussed or topical crawlers. section{Basic structure of a Web crawler} A crawler will generally create and maintain a list of unvisited URLs known as the frontier that is seeded with URLs that have been provided by either the user or another program. The crawler will then loop through the frontier where each URL is visited, parsed and new topic specific URLs are retrieved and added to the frontier. \ Crawling may be ter... ... middle of paper ... ...imes that the each URL was crawled by the crawler. It in essence creates a map from the seed URL to the current URL's in the frontier. This history provides a efficient way to lookup if a page has been crawled already or not. subsection{Fetching} When fetching a pager, the HTTP client needs to ensure that reasonable amounts of time are spent on each request to ensure that slow servers don't cause the crawler to stall. This can be remedied through either the use of a timeout or alternatively only downloading a small portion of a pagecite{Chakrabarti99focusedcrawling}. One should also take into account a sites robots.txt file that is stored in the root directory of the web server. This file dictates to crawlers the servers administrators policies with regards to content that my or may not be crawledcite{crawlingthe}. subsection{Parsing} After the page has been

More about test

Open Document