Running the Content Categorization Process

This section discusses how to:

Identify data source folders and files.
Define a content source by creating a search definition, search category, and deploying the search definition and search category.
Associate a content source with a folder.
Run the content categorization process.

Page Name	Definition Name	Navigation	Usage
Categorized Content Source	EPPCM_CATG_SOURCE	Content Management > Categorized Content > Browse Folders Click a folder title link on the Browse Categorized Content page to navigate through the folder levels. Click a folder Properties link or click the Add Folder button at the appropriate folder level on the Browse Categorized Content page. Select the Categorized Content Source tab.	Associate an already defined content data source with the selected folder in categorized content.
Categorize Crawled Content	EPPCM_CATG_RUN	Content Management > Categorized Content > Categorize Crawled Content	To read index and build categorized content.
Process Scheduler Request	PRCSRQSTDLG	Content Management > Categorized Content > Categorize Crawled Content Click Run.	To run the content categorization Application Engine process.

Page Name

Definition Name

Navigation

Usage

Categorized Content Source

EPPCM_CATG_SOURCE

Content Management > Categorized Content > Browse Folders

Click a folder title link on the Browse Categorized Content page to navigate through the folder levels.

Click a folder Properties link or click the Add Folder button at the appropriate folder level on the Browse Categorized Content page.

Select the Categorized Content Source tab.

Associate an already defined content data source with the selected folder in categorized content.

Categorize Crawled Content

EPPCM_CATG_RUN

Content Management > Categorized Content > Categorize Crawled Content

To read index and build categorized content.

Process Scheduler Request

PRCSRQSTDLG

Content Management > Categorized Content > Categorize Crawled Content

Click Run.

To run the content categorization Application Engine process.

PeopleSoft Interaction Hub is integrated with Oracle search engine, so search definitions are used to specify exactly what is crawled.

Oracle search engine can crawl files or directories located on the server where search engine is installed or network file paths accessible by the server. Also, search engine can crawl a web server.

Before running the Application Engine process, you should become familiar with the folder hierarchy and documents available on the source system. You will want to examine the source system to determine:

The root folder in which to begin the crawl.
The depth to which you want the crawler to crawl.
The type of documents you want to retrieve.

See the product documentation for PeopleTools: Search Technology, “Creating File Source Search Definitions,” Specifying File Source General Settings.

See the product documentation for PeopleTools: Search Technology, “Creating Web Source Search Definitions,” Specifying Web Source General Settings.

The Oracle default document types for crawling are:

PDF
HTML
TXT (plain text)
Microsoft Word
Microsoft Excel
Microsoft PowerPoint

See the product documentation for PeopleTools: Search Technology, “Creating File Source Search Definitions,” Specifying Document Types.

See the product documentation for PeopleTools: Search Technology, “Creating Web Source Search Definitions,” Specifying Document Types.

As part of defining a content source, you must complete the following tasks:

Create a search definition for each file source or web source to be crawled.
The search definition determines among other things:
- The starting URL (or multiple URLs).
- The maximum file size (which could exclude some files from the results).
- The valid file types (which could exclude other files from the results if not set appropriately).
See Creating Custom Search Definitions and Categories.
Create a corresponding search category for each search definition.
See Creating Custom Search Definitions and Categories.
Deploy the newly created search definition. The corresponding search category is automatically deployed when you deploy the search definition.
See Deploying Search Definitions and Categories.
Build the search index.
See Building Search Indexes.
Create a folder and assign it to a crawl source.
See Categorized Content Source Page.
Run the Application Engine (EPPCM_CATG) process to read the index and build categorized content.
See Categorize Crawled Content Page.

Use the Categorized Content Source page (EPPCM_CATG_SOURCE) to associate an already defined content data source with the selected folder in the content management system.

Navigation:

Content Management > Categorized Content > Browse Folders
Click a folder title link on the Browse Categorized Content page to navigate through the folder levels.
Click a folder Properties link or click the Add Folder button at the appropriate folder level on the Browse Categorized Content page.
Select the Categorized Content Source tab.

This page is used to specify the crawled source that generates the child content and child folders in this parent folder.

Note: You must set up your content sources using custom search definitions and categories before you enter information on this page.

Field or Control	Description
Source Type	Specify the same source type that was specified when creating the custom search definition. Available options are: File Source Web Source
Search Definition	Select a search definition from a list of available names. The search definition is restricted to those that you created for crawling content with the selected source type.
Spider Source Values	When you select a value for the search definition using the lookup prompt, the system automatically inserts the correctly formatted string into the Spider Source Values field based on the value defined for the Starting URL field on the Search Definition page.

Field or Control

Description

Source Type

Specify the same source type that was specified when creating the custom search definition. Available options are:

File Source
Web Source

Search Definition

Select a search definition from a list of available names.

The search definition is restricted to those that you created for crawling content with the selected source type.

Spider Source Values

When you select a value for the search definition using the lookup prompt, the system automatically inserts the correctly formatted string into the Spider Source Values field based on the value defined for the Starting URL field on the Search Definition page.

Use the Categorize Crawled Content page (EPPCM_CATG_RUN) to read index and build categorized content.

Navigation:

Content Management > Categorized Content > Categorize Crawled Content

The Categorize Crawled Content page is used to run the content categorization Application Engine (EPPCM_CATG) process.

The content categorization Application Engine process reads the search index through PeopleSoft Search Framework (PTSF) APIs and extracts details about the content such as title, description which is used to create the content. The content created is of type U with a URL value pointing to either a web page or the location on the server for a file.

When the Application Engine process is re-run, new content is added to the content management system. Existing content is ignored. Due to the latter scenario, revised content will not be updated and would have to be manually removed from the Published status or deleted from the PeopleSoft Interaction Hub system prior to re-running the Application Engine process. Also, in cases where a file is renamed, the content title in the content management system will not be synchronised and it would be better to delete from the folder and re-run.

Note: The Application Engine process does not have the ability to process children folders, so only documents from the specified root folder are crawled.

Note: The search engine crawl is for non-secured content. Folder level security does not apply to the crawled content because the content is external to the PeopleSoft Interaction Hub database. If you require security at the folder level, you need to use Managed Content instead, and attach each file in managed content using the standard Managed Content type of file attachment.

Field or Control	Description
Source Type	Specify the same source type that was specified when creating the search definition for crawling. Available options are: File Source Web Source
Search Definition	Select a search definition from a list of available names. The search definition is restricted to those that you created for crawling content with the selected source type.
Maximum Document Count	Enter the maximum number of documents and directories that the job should categorize before terminating. Note: The content categorization process can categorize a maximum of 999,999 items.
Create Log File	Select to create a log file. The log file appears in the files subdirectory of process scheduler: PS_CFG_HOME/appserv/prcs/`DOMAIN_NAME`/files. This file provides details about the processed URLs and their associated folder path directories.
Purge Only (Do Not Categorize)	Select to delete the contents of the folder.

To read index and create links to the crawled content:

Select Content Management > Categorized Content > Categorize Crawled Content.
Enter details in the fields.
Click Run.

This example illustrates the Process Scheduler Request page.

After you have run the process using the Process Scheduler Request page, use the Process Monitor page to monitor the status of your process request and verify that the process has completed successfully.

On successful completion of the process, the folder associated with the content source will be populated with content in Published status.

Running the Content Categorization Process

Pages Used to Run the Content Categorization Process

Identifying Data Source Folders and Files

Defining a Content Source

Categorized Content Source Page

Categorize Crawled Content Page