Running the Content Categorization Spider

This chapter provides an overview of the spider process and discusses how to:

Run the content categorization spider.
Debug the categorization spider.
View folders.

Understanding the Spider Process

You can import content in bulk automatically. For example, by running a spider utility that parses through collections of documents to search for file servers and web sites, you can automatically update your portal with the information that you think pertinent. You control the spider, specifying what type of content to retrieve, where to look, and how much information to recall. Using PeopleSoft Process Scheduler to refresh folders, you can keep your information current by the day, or even hour.

An administrator creates the mapping information needed to categorize documents. For instance, if an intranet server were being crawled, then all documents for the Human Resources department could be mapped to the appropriate folder in the departmental hierarchy.

The features of automated categorization include:

A document can appear in multiple folders and hierarchies.

This means users can find documents via whichever navigation scheme is most intuitive to them.
You can map documents from different sources or folders to the same folder.

Thus, you can aggregate related documents from a variety of sources.
Folders are mapped individually, so a hierarchy does not need to reflect the exact structure of the source server.

Thus, you can organize documents in the most logical fashion.

To invoke the spider, you schedule an Application Engine program. The Application Engine program passes parameters to the spider, such as the type of crawl (file system versus HTTP), the starting directory or URL, the number of files to crawl, and so on.

As the spider finds files and directories, it returns them to the content categorization system. The primary data passed back into the content categorization system is a URL and its directory path.

PeopleCode then maps the file to the appropriate folders. If the administrator has enabled and selected autocategorization, then the file is passed to the autocategorization engine.

The file is analyzed, and the directory path is replaced with a folder path or paths. If the resulting paths match a folder, the URL or file is mapped or placed within it. The spider can do this for hundreds or thousands of documents, thus filling in a hierarchy.

Any files and any auto-expanded sub folders created with the spider can only be added, modified, or removed by the spider. This prevents having manually added entries orphaned if the spider no longer finds a directory path upon subsequent passes.

Running the Content Categorization Spider

This section discusses how to:

Create spider run control entries.
Associate hierarchies with spider run control entries.
Run the categorization spider.

Pages Used to Populate Hierarchies Automatically

Page Name	Object Name	Navigation	Usage
Run Categorization Spider	EPPCM_SPIDR_RUN	Content Management, Administration, , Run Categorization Spider	Establish a content source.
Content Source	EPPCM_CATG_SPIDR	Content Management, Content Categorization, Browse Folders, Browse Categorized Content Click a category title on the Browse Categorized Content page to navigate through the folder levels. Click the Add Folder button at the appropriate folder level on the Browse Categorized Content page. Click the Content Source tab on the Folder Properties page.	Select folder data sources.

Page Name

Object Name

Navigation

Usage

Run Categorization Spider

EPPCM_SPIDR_RUN

Content Management, Administration, , Run Categorization Spider

Establish a content source.

Content Source

EPPCM_CATG_SPIDR

Content Management, Content Categorization, Browse Folders, Browse Categorized Content

Click a category title on the Browse Categorized Content page to navigate through the folder levels.

Click the Add Folder button at the appropriate folder level on the Browse Categorized Content page.

Click the Content Source tab on the Folder Properties page.

Select folder data sources.

Creating Spider Run Control Entries

Access the Run Categorization Spider page.

Run Control ID

Give the run control a name that is similar to the source name.

Source Type

Select the type of crawl that will be performed. Choose from:

Auto Categorized File Server Categorization is not based upon a file's directory path. Instead, the file is sent to an autocategorization engine. It is analyzed and categorized, and the folder path or paths are returned by the categorization engine.

See Performing Content Autocategorization.

File Server The run control specifies a starting directory relative to the application server. The directory and any subdirectories are searched for files of the specified MIME type. Path names for the files that are found are converted into URLs, using the specified information, so that the documents can be accessed from links in the portal.

The categorization path or hierarchy for a file is its full directory path, minus any drive designation on the left and the filename on the far right. For example, given the location F:\resources\docs\support\answer_books\WhereDoIStart.doc, the categorization path is \resources\docs\support\answer_books\.

Note. The system automatically converts backward slashes (Microsoft Windows) to forward slashes (UNIX), and you need not include the right-most slash. As a result, the following is an acceptable string: /resources/docs/support/answer_books

HTTP Crawler Starting in the specified URL, a file is opened and analyzed. Rather than searching for files in subdirectories, an HTTP crawl follows links to find more files. The links in the HTML document determine what files are found, not the directory structure.

The categorization path or hierarchy for a file is the subset of the URL, with the server removed from the left and the filename on the right. For example, given the URL http://www.peoplesoft.com/corp/en/products/line/appconnect/portal/features.asp, the categorization path is /corp/en/products/line/appconnect/portal/.

Source Name

Enter the source name, such as a folder or hierarchy name. This is an arbitrary name that you use to identify the source.

Index URL

Enter the source location as it relates to the batch server. For example, for a file server crawl, this should be a file protocol-based URL that is accessible from the machine on which the spider runs. This URL represents the starting point from which the spider begins its search for files; such as, c:/mydocuments, \\myserver\my _shared_folder\News Articles, c:\mydocs or /usr/local/public_docs. For an HTTP crawl, this can be any valid and accessible HTTP-based URL.

Remap Path to This URL

This field is relevant only for jobs that are based on file protocol-based URLs. Since all documents must be accessible by HTTP, this parameter specifies the HTTP URL equivalent of the file-based URL specified in the URL field. It is used to construct valid URLs for other directories and files that are encountered during a crawl. For example, if the specified URL is c:/news, when a file server crawl encounters the URL c:/news/article.html, the HTTP alias http://myserver/news indicates that article.html is accessible at http://myserver/news/article.html. Typically this requires that you incorporate content file systems into an accessible web server’s document tree using symbolic links, virtual directories, and so on.

Index only in this hostname

Available only when the source type is HTTP Crawler.

Select to prevent the spider from searching other web sites to which the source is linked.

Warning! You can potentially mix up data from multiple hosts if you perform a crawl with Host Affinity cleared.

Link Depth

Enter the number of levels of links that the job should process.

Maximum Document Count

Enter the maximum number of documents and directories that the job should process before terminating. This establishes reasonable limits for extremely large or open-ended crawls.

Create Log File

Select to create a log file. The log file appears in the process scheduler's files subdirectory: $PS_HOME/appserv/prcs/<server domain name>/files.

This file provides details of the processed URLs and their associated folder path directories.

Mime Types

Enter the file types to include in the search. Be sure that your browser can view these file types. Autocategorization search engines might have additional file type limitations. For file server crawls, this should be a comma-delimited list of file extensions, such as .txt, .xml, or .html. For HTTP protocol-based crawls, this should be a comma-delimited list of Multipurpose Internet Mail Extensions (MIME) types, such as text/html, application/pdf.

Autocategorization

The Autocategorization group box appears only when the source type is Autocategorized File Server

Autocat Name (autocategorization name)	Select the name of the autocategorization engine from the list of source names entered when defining vocabularies
Vocabulary Name	Select a defined vocabulary name that is used by the selected autocategorization engine for a hierarchy into which it can classify documents.
Max. Categories (maximum categories)	Enter the maximum number of categories (folders) into which the autocategorization engine autoclassifies each document. Some engines may not support this feature.
Autocategorization Threshold	Some autocategorization engines rank their categorization scores that indicate how confident the engine is about its recommendation. Not all autocategorization engines support rankings, and the range of allowable values is engine-dependent. Enter a value here to override the threshold on the Content Source page.

Associating Folders With Spider Run Controls.

Access the Content Source page.

This page is used to specify the crawled spider source that generates the child content and child folders in this parent folder.

Note. You must set up your content sources using the Run Categorization Spider page before entering information on this page.

Source Type	Select the type of crawl that will be performed. Choose from: Auto Categorized File Server, File Server, or HTTP Crawler.
Source Name	The choice of names is restricted to those that you defined on the Run Categorization Spider page with the selected source type.
Source Path	Specify the content location within the source. For file servers and HTTP sources, this is the directory path, such as /docs/departments/hr. For HTTP crawls, this is the directory path within the URL. For example, if the URL is http://myserver/docs/departments/hr/holidays.doc, then the directory path is /docs/departments/hr. For autocategorization sources, this is the folder or classification path, such as /technology/computers/hardware.
Auto Expand Folder	Select to have the spider process sub folders of the source detail. For example, if the path /docs is specified as the source detail of a file-server-based folder course, all the sub-folders and content below the /docs directory is processed recursively during a crawling job. Every sub-folder becomes a new folder within the hierarchy if it doesn’t already exist. If you don’t select this check box, only the content that resides directly within the /docs directory is processed.
Score	Enter the rank that a piece of content must have to be retrieved from an autocategorization engine. Some autocategorization engines rank categorizations with scores that indicate how confident the engine is about its recommendation. Stipulating a minimum allowable score establishes the degree of relevancy that content must have before being allowed into a folder. All autocategorization engines might not support rankings, and the range of allowable values is engine-dependent. See Performing Content Autocategorization.

Running the Categorization Spider

The Content Categorization Spider is an application engine job (EPPCM_SPIDER), which is a batch process. The spider job must be run as a batch process because the actual time needed to complete the task can vary widely depending on the scope and depth of the content to be crawled.

The process is triggered by clicking the Run button on the Run Categorization Spider page. Use the Process Monitor link to monitor the status of the process.

Debugging the Categorization Spider

All content that is crawled is automatically published and immediately appears to the appropriate users. If you discover a document that is not appropriate for viewing, go to the Unpublish Content page to unpublish the content item, or click Save/Unpublish on the individual content item.

If the spider runs successfully but no content is mapped into your hierarchy, there could be a number of causes and resolutions. Most likely, you may not have mapped a folder to a valid directory or folder path returned by the spider. Content source paths for file server crawls should be the directory path minus any drive indicator and the filename. Content source paths for HTTP crawls should be the directory path of the URL, so remove the domain and filename.

On the Content Source page for a folder, you should ensure that the source details are set to a text string or path that is the same as found by the spider. In the log file, look for log entries containing the phrase, "Got path from url."

Viewing Folders

In this section we discuss how to review delivered folder sample data.

To view available folders, go to Content Management, Categorized Content, Browse Folders. This displays the top-level list of folders to which you have security to view. Clicking further down the hierarchy allows you to navigate or discover sub-folders and related documents. Clicking the link for a document displays the document contents.

Reviewing Delivered Folder Sample Data

The sample data delivered with the PeopleSoft Enterprise Portal includes one folder called PeopleSoft Documents. It shows some representative documents from http://www.peoplesoft.com, such as product information, white papers, and case studies. It is designed to show a combination of manually added folders, documents and crawled content.

The first-level folders, About PeopleSoft, Portal Implementation, Products, and Support, are divided into two groups of user access. The two general information folders, About PeopleSoft and Products, are visible to all users (PAPP_USER, PAPP_CUSTOMER, PAPP_EMPLOYEE, and PAPP_SUPPLIER). The other two folders, Portal Implementation and Support, contain documents relevant to customers, and therefore have restricted access (PAPP_CUSTOMER and PAPP_EMPLOYEE,).

Note the difference in how content was added by clicking the Properties link and reviewing the property details. Some documents were added manually; others were added via the categorization spider. The latter collection is from a directory structure that simulates data from the PeopleSoft corporate Web site. Once the PeopleSoft Enterprise Portal is installed, you will find these files on your Web server under one of the following directories:

For WebLogic:

<weblogic install directory> bea\wlserver6.1\config\peoplesoft\applications\PORTAL\ps\images\portal_pa\website
For WebSphere:

<websphere install directory>\WebSphere\Appserver\installedApps\peoplesoft\PORTAL\ps\images\portal_pa\website

Products Folder

The Products folder is an example of an auto-expanded folder. About PeopleSoft and Support are also auto-expanded folders. Look at the directory structure under Products. It is divided into three types of product lines, with many products under each. Each product directory typically has subdirectories containing articles, case studies, or both.

The auto-expand feature enables an administrator to define just the desired top folder. Sub-folders underneath are then automatically created as they are encountered. Look at the definition for the Products folder.

Creating an entry on the Content Source page means that content from the spider should potentially be mapped into this folder. First, it defines the Content Source, a file server crawl called PeopleSoft Documents. Within that spider run, it will map all documents contained within that directory to this Content Categorization folder if it encounters a directory specified. In this case, Source Details is the subdirectory, as in:

/ps/images/portal_pa/website/products

Start the auto-expand feature by checking the Expand box. The 45 sub-folders under Products will then be created as sub-folders, and the 148 documents contained in those sub-folders will be mapped to their appropriate sub folder. If the publish and remove dates are specified, then all 148 documents found under Products inherit these dates.

Access the Folder Security page for the Products folder.

There are four roles assigned for viewing, since this folder contains documents applicable to almost all users. The sub-level folders created with the auto-expand feature also inherit these viewer and administrator roles. Click the Return to Browse Categorized Content link, then click the Products folder and view its contents.

There are no documents contained within this folder, since there were no files in the directory C:\bea\wlserver6.1\config\peoplesoft\applications\PORTAL\ps\images\portal_pa\website\products (only sub-folders). Note that there are no Delete buttons associated with the three sub-level folders, and property information is read-only. Any folder or content that is automatically created by the spider cannot be edited or removed by manual administration. This is because it may change or disappear the next time the spider is invoked. All auto-expanded folders and content can only be added, modified or removed with the spider.

Note. It is important to activate the Expand option at a strategic level that minimizes manual administration, and that is not too high in the taxonomy to maximize flexibility.

Portal Implementation Folder

The Portal Implementation folder is an example that combines manual entries with auto-expand. First, note that its folder definition does not activate the expand option. Thus, folders and content can be manually added below it.

Navigating into the Portal Implementation folder shows it contains both manual and automated entries.

The three sub-level folders at this level have all been defined manually. This way, they can contain both manual and automated entries (see the Technology folder). Note that the Expand option is activated in each case, so that any sub-folders are automatically created. The two documents, Portal Solutions and Portal Solutions Overview, were manually added and reference approved documents in the Content Management system.

The business_persons_guide.pdf document has been added by the spider. Go to the Folder Properties page and see that most of the attributes of the auto-created document entry cannot be modified. However, an administrator can manually change the title to something more meaningful for users. Also, an administrator can unpublish the document, in case it is deemed inappropriate for any reason.

Run Control ID	Give the run control a name that is similar to the source name.
Source Type	Select the type of crawl that will be performed. Choose from: Auto Categorized File Server Categorization is not based upon a file's directory path. Instead, the file is sent to an autocategorization engine. It is analyzed and categorized, and the folder path or paths are returned by the categorization engine. See Performing Content Autocategorization. File Server The run control specifies a starting directory relative to the application server. The directory and any subdirectories are searched for files of the specified MIME type. Path names for the files that are found are converted into URLs, using the specified information, so that the documents can be accessed from links in the portal. The categorization path or hierarchy for a file is its full directory path, minus any drive designation on the left and the filename on the far right. For example, given the location F:\resources\docs\support\answer_books\WhereDoIStart.doc, the categorization path is \resources\docs\support\answer_books\. Note. The system automatically converts backward slashes (Microsoft Windows) to forward slashes (UNIX), and you need not include the right-most slash. As a result, the following is an acceptable string: /resources/docs/support/answer_books HTTP Crawler Starting in the specified URL, a file is opened and analyzed. Rather than searching for files in subdirectories, an HTTP crawl follows links to find more files. The links in the HTML document determine what files are found, not the directory structure. The categorization path or hierarchy for a file is the subset of the URL, with the server removed from the left and the filename on the right. For example, given the URL http://www.peoplesoft.com/corp/en/products/line/appconnect/portal/features.asp, the categorization path is /corp/en/products/line/appconnect/portal/.
Source Name	Enter the source name, such as a folder or hierarchy name. This is an arbitrary name that you use to identify the source.
Index URL	Enter the source location as it relates to the batch server. For example, for a file server crawl, this should be a file protocol-based URL that is accessible from the machine on which the spider runs. This URL represents the starting point from which the spider begins its search for files; such as, c:/mydocuments, \\myserver\my _shared_folder\News Articles, c:\mydocs or /usr/local/public_docs. For an HTTP crawl, this can be any valid and accessible HTTP-based URL.
Remap Path to This URL	This field is relevant only for jobs that are based on file protocol-based URLs. Since all documents must be accessible by HTTP, this parameter specifies the HTTP URL equivalent of the file-based URL specified in the URL field. It is used to construct valid URLs for other directories and files that are encountered during a crawl. For example, if the specified URL is c:/news, when a file server crawl encounters the URL c:/news/article.html, the HTTP alias http://myserver/news indicates that article.html is accessible at http://myserver/news/article.html. Typically this requires that you incorporate content file systems into an accessible web server’s document tree using symbolic links, virtual directories, and so on.
Index only in this hostname	Available only when the source type is HTTP Crawler. Select to prevent the spider from searching other web sites to which the source is linked. Warning! You can potentially mix up data from multiple hosts if you perform a crawl with Host Affinity cleared.
Link Depth	Enter the number of levels of links that the job should process.
Maximum Document Count	Enter the maximum number of documents and directories that the job should process before terminating. This establishes reasonable limits for extremely large or open-ended crawls.
Create Log File	Select to create a log file. The log file appears in the process scheduler's files subdirectory: $PS_HOME/appserv/prcs/<server domain name>/files. This file provides details of the processed URLs and their associated folder path directories.
Mime Types	Enter the file types to include in the search. Be sure that your browser can view these file types. Autocategorization search engines might have additional file type limitations. For file server crawls, this should be a comma-delimited list of file extensions, such as .txt, .xml, or .html. For HTTP protocol-based crawls, this should be a comma-delimited list of Multipurpose Internet Mail Extensions (MIME) types, such as text/html, application/pdf.