Specifying Web Source General Settings

Access the General Settings page by selecting PeopleTools, Search Framework, Search Designer Activity Guide, Search Definition and selecting a Source Type of Web Source.

Use the General Settings page to specify the location of the files to be indexed as well as the crawler settings.

Field or Control	Description
Description	Add a brief description to help identify the purpose of the search definition.
Source Type	Displays the type of search definition, such as Query, Web, File, and so on.
Starting URLs	Contains the URL of the web address. The search engine uses the URL as an entry point for starting to crawl a website. Important! Only HTTP URLs are supported. The starting URL's mentioned should be accessible without any user credentials. The search engine crawler will ignore web sites requiring login.
Index All	This will index all the URLs which are allowed to access by the search engine crawler. This will not limit crawling in to a specific domain or host. As the number of URLs to index increases, time required to complete indexing also increases.
Stay in Host	This will limit the indexing only to the specified host. For example, if you are indexing www.oracle.com and you select this option, you can index documents on www.oracle.com, but not on www.1.oracle.com. Important! If neither option is selected, then the system switches to Stay in Domain mode. In this mode, indexing will be limited to a single domain. For example if you are indexing www.oracle.com it will consider all URLs with in this domain, including www.1.oracle.com, but URLs from a different domain, such as www.yahoo.com, would not be indexed.
Crawler Timeout	Indicates the maximum allowed time to retrieve a file for crawling.
Crawl Depth	The number of nested links the crawler follows, with the initial URL, or home page, residing at a depth of 0. With a crawling depth of 1, the crawler also fetches any document linked to from the starting URL. With the crawling depth set to 2, the crawler fetches any document linked to from the starting URL (depth of 0), and also fetches any document linked to from the depth of 1, and so on. By adding a value for Crawl Depth, the system uses that value to enforce the crawling limit. If you enter no value, leaving the Crawl Depth blank, the system considers the crawling depth to be unlimited. As you increase the crawl depth, the content to be indexed can increase exponentially, which results in longer crawling durations.
Max Document Size	The maximum document size in megabytes that the system will crawl. Larger documents are not crawled.
Honor Robots Exclusion	Robot exclusion policies are set at web server and the web page level. The Honor Robots Exclusion setting controls whether the search engine recognizes or ignores the robot exclusion settings. Yes. The crawler traverses the pages based on the access policy specified in the web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML meta tags. No. The crawler ignores any specified robot policy defined on the web server.
Index Dynamic Pages	Controls whether search engine crawls and indexes dynamic pages. Typically, database applications serve dynamic pages, and the pages have a URL containing a question mark (?). The search engine considers URLs containing question marks dynamic pages. Yes. The search engine crawls dynamic pages. No. The search engine does not crawl dynamic pages.