This chapter provides an overview of PeopleSoft search indexes and discusses how to:
Work with indexes.
Build record-based indexes.
Build file system (spider) indexes.
Build HTTP spider indexes.
Administer search indexes.
Modify the VdkVgwKey key.
This section provides an overview of search indexes and discusses:
Types of indexes.
Components of the search architecture.
Index building.
Search index limitations.
User search strategies.
A search index is a collection of files that is used during a search to quickly find documents of interest. The process of creating the search index is also called building the search index. The set of files that make up the index is a collection. This collection contains a list of words in the indexed documents, an internal documents table containing document field information, and logical pointers to the actual document files.
Fields contain metadata about a document. For example, Author and Title might be fields in an index. VdkVgwKey is a special field that identifies each document and is unique to all of the documents in the collection.
The document table is a relational table with one row for each document and columns of fields. Every index can be modified by defining a set of fields for it.
In PeopleSoft search implementations, every search index has a home location where all of the files pertaining to that index are located. This directory is the home directory of the index and is typically located at PS_HOME/data/search/INDEXNAME. You can change this location through application server and process scheduler configuration files. Under this directory is another directory named for the database to which the application server or the process scheduler is connected. The actual collection files reside in this database directory.
Every search index can be modified by changing the configuration files that are associated with the index. These configuration files are known as style files and reside in the style directory under the database directory. A typical configuration of style files define fields for a particular index.
PeopleSoft software supports three types of search indexes:
Record-based indexes.
HTTP spider indexes.
File system indexes.
Record-based indexes are used to create indexes of data in PeopleSoft tables. For example, if the PeopleSoft application has a catalog record that has two fields (Description and PartID), you can create a record-based index to index the contents of the Description and PartID fields. Once the index is created, you can use the PeopleCode search application programming interface (API) to search this index.
HTTP spider indexes index a web repository by accessing the documents from a web server. You typically specify the starting uniform resource locator (URL). Then the indexer walks through all documents by following the document links and indexes the documents in that repository. You can control to what depth the indexer should traverse.
File system indexes are similar to HTTP spider indexes, except that the repository that is indexed is a file system. You typically specify the path to the folder or directory. Then the indexer indexes all documents within that folder. HTTP spider indexes and file system indexes are sometimes collectively referred to as spider indexes. The indexer recognizes a wide variety of document formats, such as Word or Excel documents. Any document that is an unknown format will be skipped by the indexer.
PeopleSoft search architecture uses two main technologies: that provided by the PeopleSoft Portal and that provided by Verity. They are connected by the PeopleSoft search API.
PeopleSoft Portal Technologies
The PeopleSoft Portal search technology contains the following components:
Search input field.
Captures a query string that is entered by users in the portal header.
Search API.
Passes the query string that is captured in the search input field to the Verity search engine.
Portal Registry API.
Applies security to filter the search results.
Portal registry.
Contains a repository of content references that can be searched.
Search results page.
Formats and displays search results for the user.
Search options.
Enables users to personalize search behavior and results.
Note. By default, the PeopleSoft search performs case-insensitive searches.
The basic items of the Verity architecture that are incorporated in the PeopleSoft Portal search architecture are:
Verity collection.
This is the set of files forming a search index. When a user performs a search, the search is conducted against the Verity collection. You can create and maintain your own collections with the Search Design and Search Administration PeopleTools.
BIF file.
This is an intermediate file that is created in the process of building a Verity collection. The BIF file is a text file that is used to specify the documents to be submitted to a collection. It contains a unique key, the document size (in bytes), field names and values, and the document location in the file system.
XML file.
This is another intermediate file that is created in the process of building a Verity collection. The XML file is a text file named indexname.xml that contains all of the information from the documents that are searchable but not returned in the results list. This information is stored in zones. Zones are specific regions of a document to which searches can be limited.
Style files.
These files describe a set of configuration options that are used to create the indexes that are associated with a collection.
mkvdk.
This Verity command-line tool is used to:
Index a collection.
Insert new documents into a collection.
Perform simple maintenance tasks, like purging and deleting a collection.
Control indexing behavior and performance.
To create and administer search indexes for use with PeopleSoft software, use the PeopleTools utilities under PeopleTools, Search Engine. The utilities enable you to administer indexes and to create file system, spider, and record-based indexes.
For both HTTP spider and file system indexes, options are available to include or exclude certain documents based on file types and Multipurpose Internet Mail Extensions (MIME) types. The index building procedure is different for record-based indexes and the spider indexes. Typically, the index building procedure is carried out from an Application Engine job that is scheduled by using the process scheduler.
The steps for building record-based indexes are:
The data from the application tables is read and two files called indexname.xml and indexname.bif are created.
indexname.xml contains one XML record for each document that needs to be indexed. The XML record contains all of the data that needs to be indexed. indexname.bif contains field information, the VdkVgwKey document, and offsets to denote the start and end of each document in the XML file.
The XML and the bulk insert file (BIF) files are typically generated through PeopleCode and reside in the home location of the index. The Verity utility, mkvdk, is called, passing in the BIF file as the argument to build the index.
The steps for building spider indexes are:
The Verity utility, vspider, is called.
The vspider utility takes a number of arguments, but the most important ones are the starting URL or directory to spider and the number of links to follow.
The vspider utility walks through all of the documents in the repository and builds the index.
Following are the PeopleSoft search index limitations:
Verity collections must reside on the PeopleSoft application server or be accessible from it through a shared drive.
Satisfying this requirement can take several forms, depending on the application server's operating system. On Microsoft Windows, this could be a network drive. On UNIX, this could be an NFS-mounted drive.
Verity collections are most efficient if you index large groups of data, rather than indexing one or two documents at a time.
Small updates degrade the index and require that you run the Verity cleanup utility.
Style files are located in the style subdirectory of the index.
To make style changes, apply them to the files in this directory.
You can have only one language per collection.
Additionally, a number of Verity search index features are limited to certain maximum values, as follows:
Feature |
Limitation |
Wildcards |
Wildcard auto-expansion is limited to 16,000 matches. |
Number of collections |
The maximum number of physical collections that can be searched at one time is 128. |
Documents per collection |
The maximum number of documents allowed per collection is 16 million, subject to disk space availabilty. |
Fields per collection |
The maximum number of fields allowed per collection is 250. |
Field length |
The maximum length of any field is 32 kilobytes. Note. The actual number of characters that translates to depends on the character set being used. |
Field value length in bulk files |
The maximum length of a field value in a bulk file is 32 kilobytes. Note. The actual number of characters that translates to depends on the character set being used. |
Zones per document |
The number of zones allowed per doocument is unlimited. |
Characters in path |
The maximum path size allowed is 256 characters. |
Maximum documents with sort specification |
The maximum number of documents that are returned when a sort specification is applied is 16,000. |
Sort fields per search |
The maximum number of fields that can be included in a sort specification is 16. |
Refer to the Verity documentation for details about these features.
A user submits a search request by entering a search string into the search input form field in the portal header. The “<form action=...>” element in the portal header is generated at runtime to link to a PeopleSoft Internet Architecture page, and a Java script submits the form. The query string is passed to the Search API as a parameter named PortalSearchQuery to find matching results. Those results are filtered for security through PeopleCode by the Portal Registry API. The search results page echoes the original query string and displays a list of content references that match the request. If the user clicks the Go button but does not enter a search query, the search results page displays without any results.
The search results page performs the following steps:
Changes the case of the entered text to all uppercase characters.
By default, the Verity search engine searches for all mixed-case variations when a query string is entered in all lowercase or in all uppercase. However, search queries that are entered in mixed-case automatically become case sensitive. (For example, a query on Apple behaves as if the user had specified Apple, which would find only the precise string Apple, while a query on apple finds APPLE, Apple, and apple.) But the portal makes one important change: It changes the case of the query sting to all uppercase, prohibiting users from truly executing case-sensitive searches. This avoids situations where mixed-case searches would otherwise return no results. On the search results page, however, the original case is echoed back to the user.
Formats the query string to pass to the Search API.
This includes filtering out expired and hidden content reference, and content references that are not valid yet.
Calls the Search API.
This returns the query results.
Calls the Portal Registry API.
This is done to apply security filtering to the results. Security is applied in PeopleCode by checking the Authorized property.
Formats and displays search results.
This completes the user's search request.
This section provides overviews of common controls and supported MIME types, and discusses how to:
Open existing collections.
Create new collections.
The following controls appear on the pages that are used for designing record-based, file system, or HTTP spider indexes.
Index |
Shows the name of the index that you opened or the name that you gave the index on the Add New Value page. |
Build Index |
Invokes the collection build program. Before clicking this button, select all of the appropriate options for the collection. |
Test Index |
After building an index, click to test that the build program assembled the index properly. The Test Index page contains a single text field with a query button. Enter text to search for in the collection and click the [?] button to submit the query. The results return a list of the keys that are stored by Verity in the collection. |
Show Logs |
View the log files that are produced by the collection build program during execution. This is used mainly for troubleshooting. |
Append to Verity Command Line |
This control is for PeopleSoft internal use only. |
The following list contains the supported document MIME types. Any document that is not one of these types is ignored during the indexing process.
application/msword
application/wordperfect5.1
application/x-ms-excel
application/x-ms-powerpoint
application/x-ms-works
application/postscript
application/rtf
application/x-lotus-amipro
application/x-lotus-123
application/x-ms-wordpc
application/x-corel-wordperfect
application/x-wordprocessor
application/x-spreadsheet
application/x-presentation
application/x-graphics
application/x-keyview
application/x-ms-write
application/pdf
application/x-executable
message/rfc822
message/news
text/html
text/sgml
text/xml
text/ascii
text/enriched
text/richtext
text/tab-separated-values
text/plain
text/x-empty
image/gif
application/x-verity
To open an existing collection:
Select PeopleTools, Search Engine.
From the available menus, select the type of collection that you want to open, as in record-based indexes, file system indexes, or HTTP spider indexes.
On the Find an Existing Value tab, use the Search for drop-down list box to select the appropriate criteria (begins with or contains).
In the edit box to the right, enter the character string that reflects the appropriate begins with or contains criteria.
Click Search.
To create a new collection:
Select PeopleTools, Search Engine.
From the available menus, select the type of collection that you want to create, as in record-based indexes, file system indexes, or HTTP spider indexes.
Select the Add a New Value page.
Enter a name for the collection.
Click Add.
Specify the appropriate attributes for the collection as described in the following sections.
Save your work.
Note. You cannot create indexes of the same name even if they are of different types; for example, record, HTTP, or file.
Build the index.
The record-based index extracts data from database tables and inserts the data into BIF and XML files, which are then indexed by Verity. The individual creating the index chooses the records (tables) to be indexed.
Note. The record-based index supports only data that is stored in PeopleSoft databases.
This section discusses how to:
Modify record-based index properties.
Add subrecords to search indexes.
Select PeopleTools, Search Engine, Record-Based Indexes to access the Design a Search Index page.
Parent Data Record
Record (Table Name) |
Enter tables, views, or a PeopleSoft view that contains data. To combine the data from multiple PeopleSoft tables, to create a view on those tables and specify the name of that view here. |
WHERE clause to append |
Fine-tune the data that you receive by entering a Structured Query Language (SQL) WHERE clause. |
Key returned in search results |
Use to synthesize the VdkVgwKey, which supports an XML-like syntax enabling you to modify the tag that is returned by Verity. You have the following options:
|
Edit Key |
Click to access the page where you can change the results that are returned by the Key returned in search results functionality. |
Fields
How to Zone the Index |
One Zone: Select to put all of the data into one zone. With this option, the collection builds more quickly but the application can't restrict searches to the portions of the index that come from a particular field. Field Zones: Select to create one zone for each PeopleSoft field on the record. Applications can specify that they want to access that particular zone in their searches. |
Field Name |
After you specify a record name, the fields in that record appear in this grid. Select the following options for each field in the record: Verity Field, Word Index, or Has Attachment (each option is explained in the following sections). |
Verity Field |
Select if the PeopleSoft field should be indexed as a Verity field. In general, PeopleSoft fields that contain a lot of descriptive text, such as description fields, should be indexed as word indexes (See the following definition) and PeopleSoft fields that contain metadata about what is being indexed (such as ProductID) should be indexed as Verity fields. |
Word Index |
Select if this PeopleSoft field should be indexed as a word index. See the preceding Verity Field definition for guidelines on defining a PeopleSoft field as a Verity field versus defining it as a word index. |
Has attachment |
Enables you to index attachments that are referenced in the field as uniform resource identifiers (URIs). Refer to the PeopleCode Developer's Guide for a description of file attachments. If this field contains the URL to an attachment, select this check box. The indexer downloads the attachment and indexes it as part of the document. This item is enabled only if the corresponding PeopleSoft field contains character data, because numeric fields cannot contain URLs. To use this field, you need a record that is designed with this feature in mind. In the record, each row has a text field that contains a URI or an empty string. The text must be a valid File Transfer Protocol (FTP) URI (including the login and password string) of the following form:
The third form references an entry in the URL table (Utilities, Administration, URLs). If the URL ID that is named in the name attribute is valid, the entire URI is rewritten with the part in brackets replaced by the actual URI. For example, if A_URLID is equal to ftp://anonymous:[email protected], the entire string in the previous example becomes ftp://anonymous:[email protected]/path/to/file.doc and is treated like any other FTP URI. Rows of data with empty strings in the URI field are ignored with no error. If the string is one of these three valid URI forms and a document can be retrieved at that URI, the document is indexed with the same key as the rest of the row of data and is searchable. |
To add subrecords to the index, select the Subrecords tab, and insert the child records that you want to include in the index.
Select PeopleTools, Search Engine, Record-Based Indexes, Subrecords.
To index more than one record as a single document, the records must be hierarchically related. For example, the record that is specified on the previous page must be a parent of all the others. Formally, this means that the keys of each subrecord named must be a superset of the keys of the parent record. The parent record is the one that you specify in the Record (Table Name) field on the Primary Record page.
To add subrecords to an index:
Create and save the index definition.
Select PeopleTools, Search Engine, Record-Based Indexes, Subrecords.
Click the Add a new row button to insert the names of the records that are children of the parent record that is defined on the Primary Record page.
On the Primary Record page, the fields of the child record are added to the Fields grid. When you build the index, data from the child records whose keys match the row in the parent record is included as part of the parent record. When an end user searches for data that is found in the child record, the system returns a reference (VdkVgwKey) for the parent record.
You can index file systems that are local to the application server. This refers to any file system on the physical server on which your application server domain runs, and it also refers to any drives that are accessible from the application server machine. File systems might include file servers, report repositories, and so on.
The index is compiled by using vspider. The program descends into the directory structure recursively and indexes the file types that you've selected to be indexed. It indexes only files that Verity supports for collections.
This section discusses how to:
Set file system options
Define what to index
Select PeopleTools, Search Engine, Filesystem Indexes to access the Filesystem Options page.
List local filesystem paths to spider |
Specify the network file system path that contains the documents to index. Ensure that the local application server has the proper access to the file systems that you include in the list. For Microsoft Windows, this means the drive mappings must be set up from the applications server. For UNIX, this means the correct network file system (NFS) mappings must be set on the application server. To add a system path to the list, click the plus button. To remove a file system, click the minus button. |
Remap Path to This URL |
Do not use. |
Select PeopleTools, Search Engine, Filesystem Indexes, What to Index to access the What to Index page.
Index all Mime-types |
Select to index all MIME types on a website. |
Index only these Mime-types |
Select to index only a certain MIME type, and specify the file type in the MIME/Types Allowed list box. Separate multiple MIME types with a space. |
Exclude these Mime-types |
Select to exclude a set of MIME types, and specify the MIME types to exclude. Separate multiple MIME types with a space. |
MIME/Types Allowed |
Add a list of MIME types, separated by spaces, if you selected Index only these Mime-types or Exclude these Mime-types. |
Filenames
Index all filenames |
Select to index all file types. |
Index only these filenames |
Select to index only a certain file type, and specify the file type in the Pathname Globs List list box. |
Exclude these filenames |
Select to exclude a set of file types, such as temporary files, but to index all others. Also specify the file types to exclude. |
Pathname Globs List |
Add the files that you want to incorporate into your index. Separate the entries with spaces. You can use wildcard characters (*) to denote a string and “?” to denote a single character. For example, the string '*.doc 19??.excel' means select all files that end with the “.doc” suffix and Microsoft Excel files that start with 19, followed by 2 characters. |
HTTP spider indexes are similar to the indexes that the spider functionality compiles for the file system index. When using the spider index on a website, vspider starts at the home page of the site and then follows each link on that page to the next level of the site. For each page at the next level, vspider follows each link on each page. After following a link, vspider indexes all of the data on the target page.
You can specify as many websites as you want, and you can configure the depth, or number of layers of links, that vspider follows into a website and index.
This section discusses how to:
Define HTTP gateway settings.
Define what to index.
Select PeopleTools, Search Engine, HTTP Spider Indexes to access the HTTP Gateway page.
Depth of Links to Follow |
Set the level of detail that you want to index within a certain site. If you enter 1, vspider starts at the homepage and follows each link on that page and indexes all of the data on the target pages. Then it stops. If you enter 2, vspider follows the links on the previous pages and indexes one more level into the website. As you increase the number, the number of links that vspider follows increases geometrically. Do not set this value too high, because it can impact performance negatively. You should not need to set this value higher than 10. |
List http://URLs to spider |
Click the plus button to add multiple URLs to spider. Click the minus button to remove a URL from the list. If you forget to include the http:// (scheme) portion of the URL, the system automatically includes it. URLs should contain only the alphanumeric characters as specified in RFC 1738. Any special character must be encoded. For example, encode a space character as %20, and encode a < as %3c. Additional examples are available. |
Stay in Domain |
Select to limit spidering to a single domain. For example, suppose that you are spidering www.peoplesoft.com and you select this option. If a link points to a site outside the PeopleSoft domain (as in yahoo.com), the collection ignores the link. |
Stay in Host |
Select to further limit spidering within a single server. If you select this option, the collection contains references to content only on the current web server or host. Links to content on other web servers within the domain are ignored. For example, if you are spidering www.peoplesoft.com and you select this option, you can index documents on www.peoplesoft.com, but not on www1.peoplesoft.com. |
Proxy HostnameandProxy Port |
Enter a host and port for vspider to use. Enter the same settings that you would use in your web browser if you need a proxy to access the internet. |
Select PeopleTools, Search Engine, HTTP Spider Indexes, What to Index. The fields on this page are documented in a previous section.
After you design and build your search indexes, the Search Administration interface enables you to schedule when and how frequently the indexes must be rebuilt. An important aspect of maintaining the collections involves scheduling PeopleSoft Process Scheduler jobs that, on a regular basis, rebuild the collection completely or incrementally update the index. Search index administration also includes deleting old indexes and building indexes to support additional languages.
This section discusses how to:
Specify the index location.
Administer the search index.
Edit properties.
Schedule administration.
Share indexes between application servers and PeopleSoft Process Scheduler.
By default, the files for an index are located in PS_HOME/data/search/indexname/db_name/language_code. However, you can change this location by specifying the search index location property in the application server and process scheduler configuration files.
Set the search index location at the application server level in the application server configuration file, PSAPPSRV.CFG. This enables you to specify alternate search index locations for an application server, if necessary. You also need to set this property in the process scheduler configuration file, PSPRCS.CFG, to point to the same location as specified in the application server configuration file.
Note. You must manually edit the file to include the locations. You do not add search index locations by using PSADMIN.
To add a search index location on the application server:
Open the PSAPPSRV.CFG file for the appropriate application server domain.
Locate the Search Indexes configuration section.
For example:
[Search Indexes] ;========================================================================= ; Search index settings ;========================================================================= : Search indexes can be given alternate locations if there is an entry here. ; Entries look like: IndexName=fs location (ie EMPLOYEE=c:\temp)
Add an entry for each search index location that you want to specify for an application server by using the following syntax:
index_name=location
For example, to specify the location for search INDEX_A and INDEX_B, your entries would look similar to the following:
[Search Indexes] ;========================================================================= ; Search index settings ;========================================================================= : Search indexes can be given alternate locations if there is an entry here. ; Entries look like: IndexName=fs location (ie EMPLOYEE=c:\temp) INDEX_A=c:\temp INDEX_B=n:\search
Note. Make sure that your entries are not commented out with a semicolon (;) appearing before them.
Save the PSAPPSRV.CFG file.
Note. The previous procedure assumes that you've already used the Search Index Designer to define, build, and store the search indexes that you specify in the PSAPPSRV.CFG file.
Repeat the process with PSPRCS.CFG for PeopleSoft Process Scheduler.
Select PeopleTools, Search Engine, Administration to access the Search Index Admin page.
Index |
Displays the name of the index so that you can identify specific indexes. To select an index, select the check box to the left of the index name. |
Index Location |
Displays the current location of the index. |
Edit Properties |
Click to access the interface for changing the index location and to build indexes to support additional languages. |
Schedule |
Click to access the interface for scheduling the program that maintains your collection. |
Delete checked Indexes |
If you have selected indexes to be deleted, click this button to remove them from the system. The deletion process deletes the index definition and the collections that are stored in the file system. |
Note. If you attempt to delete a scheduled index, you may see SQL errors on IBM DB2 UDB or Sybase database platforms.
Select PeopleTools, Search Engine, Administration, Edit Properties.
Index Location |
Displays the current location of the index. |
Language Code |
Select the language for which you want to build an index. |
Language to Map |
Currently disabled. |
Build |
After you add the additional indexes, click to create the indexes. |
Note. Style files are located in the style subdirectory of the index. To make style changes, apply them to the files in this directory.
Select PeopleTools, Search Engine, Administration, Schedule.
Add a new Recurrence Definition |
In PeopleSoft Process Scheduler, you define run recurrence definitions that enable you to schedule jobs to run at regular intervals, such as monthly, weekly, daily, and so on. The more current you keep the collections, the more accurate your search results will be. |
Type of Build |
Rebuild: Select to drop the existing collection and rebuild a new collection. This applies to all types of collections. Increment: Use only for the spider indexes. For record-based indexes, only the Rebuild option is available. |
Run Recurrence Name |
Select the appropriate run recurrence definition for the collection maintenance requirements. |
Server Name |
Specify the PeopleSoft Process Scheduler server on which you want the build program to run. The PeopleSoft Process Scheduler system must be installed and configured before you can schedule the collection build program to run as a job. |
The index files reside on a file system at the home location and must be accessible to all application servers and process schedulers that will manipulate the index. An application server uses the index for searching while the process scheduler invokes an Application Engine program that builds the indexes. Therefore, if you are running a process scheduler on a different machine than the application server, ensure that the index files are accessible to both. You can do this three ways:
Make a Microsoft Windows shared drive or NFS file system available for the index.
Specify the index location in both the application server and the process scheduler to point to the shared directory.
Run an instance of the process scheduler on the application server host and schedule only the building of indexes on this process scheduler.
Because the process scheduler and the application servers are running on the same host, they create and read files from the same location.
Use an external program such as FTP or Secure Copy (SCP) to copy all of the files and directories in the index home location from the process scheduler host (after the index has been built) to the application server host so that they are available for searching.
To make the VdkVgwKey more readable and easier to parse, use the following XML-like syntax:
<field fieldname='MYFIELD'/> <row/> <pairs/> <sql stmt="SELECT 'Y' FROM PS_INSTALLATION"/>
Fieldname and the SQL statement support single and double quotes, as well as no quotes at all (in which case only the first word is considered part of the option).
Using double quotes for the SQL statement is recommended.
The SQL statement must return only one column.
Multiple rows are ignored. Trying to return more than one column results in a collection-build-time error.
Currently, the only tag style that is supported is <tag/> with the slash (/) at the end.
The VdkVgwKey can include any amount of literal text interspersed with the tags.
This text is copied into the VdkVgwKey that goes into the BIF file, unmodified.
Field names are automatically set in uppercase.