This chapter provides an overview of the extract, transform, and load (ETL) process within EPM Foundation and discusses:
Prerequisites.
ETL load strategy.
Special loading requirements.
Data validation.
ETL setups.
Before you read this chapter you should be familiar with Ascential DataStage and the following documentation:
PeopleSoft Enterprise Performance Management 8.9 Supplemental Installation Documentation.
EPM 89 ETL Implementation Steps posted on Customer Connection.
This section discusses:
ETL and the EPM Architecture.
Ascential DataStage.
Key DataStage Components.
The PeopleSoft delivered ETL process enables you to extract data from disparate source transaction systems, integrate the data in a single EPM database, transform and enrich the data, and load it into specific EPM data models that are optimized for analysis and reporting. This process is facilitated by the best-in-class data integration platform Ascential DataStage and PeopleSoft delivered ETL jobs.
The ETL process migrates data across all layers of EPM warehouse structures and consists of two load types:
Stage I Load: Consists of all ETL jobs that extract data from your source transaction system and load it into Operational Warehouse - Staging (OWS) tables. Also included in this type of load (but less common) are ETL jobs that extract data from your source transaction system and load it directly into Multidimensional Warehouse (MDW) tables.
Stage II Load: Consists of all ETL jobs that extract data from the OWS tables and load it into the Operational Warehouse - Enriched (OWE) or the Multidimensional Warehouse (MDW) tables. Also included in this type of load (but less common) are ETL jobs that extract data from the OWE and load it into the MDW.
The following diagram depicts the ETL process through each layer of the EPM architecture.
ETL in EPM
After your data is extracted from the OWS it is loaded into specialized data models (target warehouse tables designed to aggregate or enrich your data), which are used by the analytical applications and functional warehouses for reporting and analysis.
See Also
Understanding PeopleSoft Enterprise Performance Management (EPM) Foundation
PeopleSoft has an original equipment manufacturer (OEM) agreement with Ascential for its DataStage tool and bundles this offering with the PeopleSoft Performance Management Warehouse. The Ascential DataStage tool uses ETL jobs to target specific data from a source database and migrate it to the OWS, OWE, and MDW tables.
For details on the tool itself and how to use it, refer to the Ascential documentation listed below. You can install PDF versions of the Ascential books on a Windows system as part of the Ascential tools install.
The following table lists the Ascential documentation and the information provided.
Ascential Book |
Description |
Provides information about the DataStage Designer, and gives a general description of how to create, design, and develop a DataStage application. |
|
DataStage Manager Guide |
Provides information about the DataStage Manager and describes how to use and maintain the DataStage Repository. |
DataStage Server: Server Job Developer Guide |
Provides information about the tools that are used in building a server job and supplies programmer's reference information. |
DataStage Enterprise Edition: Parallel Job Developer Guide |
Provides information about the tools that are used in building a parallel job and supplies programmer’s reference information. |
DataStage Enterprise MVS Edition: Mainframe Job Developer Guide |
Provides information about the tools that are used in building a mainframe job and supplies programmer’s reference information. |
DataStage Director Guide |
Provides information about the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs. |
DataStage Administrator Guide |
Provides information about DataStage setup, routine housekeeping, and administration. |
DataStage Install and Upgrade Guide |
Provides instructions for installing DataStage on Windows and UNIX platforms, and for upgrading existing installations of DataStage. |
DataStage NLS Guide |
Provides information about using the NLS features that are available in DataStage when NLS is installed. |
Ascential DataStage contains many different components that support the ETL process. Some of these components include stages, jobs, and parameters. Complete details on all of the DataStage components can be found in the Ascential Designer Guide documentation. Only the following key DataStage components are discussed in this section:
DSX Files
Jobs
Hash Files
Environmental Parameters
Shared Containers
Routines
PeopleSoft delivers a *.dsx file for each functional area within EPM. As part of your installation and configuration process you import the *.dsx file into a project that has been defined in your development environment. Included in the *.dsx file are various DataStage objects that define your project. The *.dsx files are organized by functional area and contain related ETL jobs.
To see a list of the PeopleSoft-delivered *.dsx files, refer to the file “DSX Files Import Description.xls” located in the following install CD directory path: <PSHOME>\SRC\ETL.
Each delivered *.dsx file contains the DataStage objects described in the following sections.
PeopleSoft delivers predefined ETL jobs for use with Ascential DataStage. ETL Jobs are a collection of linked stages, data elements, and transformations that define how to extract, transform, and load data into a target database. Stages are used to transform or aggregate data, and lookup information. More simply, ETL jobs extract data from source tables, process it, then write the data to target warehouse tables.
PeopleSoft deliver five types of jobs that perform different functions depending on the data being processed, and the warehouse layer in which it is being processed:
Load Stage |
Type |
Description |
I |
Source to OWS |
Jobs in this category extract data from your transaction system and populate target warehouse tables in the OWS layer of the warehouse. Source to OWS jobs assign a source system ID (SRC_SYS_ID) for the transaction system from which you are extracting data and populate the target OWS tables with that ID. |
I |
Source to MDW |
Jobs in this category extract data from your transaction system and populate target dimension and fact tables in the MDW layer of the warehouse. The Online Marketing data mart is the only product to use this type of job. |
II |
OWS to OWE |
Jobs in this category extract data from the OWS tables and populate target D00, F00, and base tables in the OWE layer of the warehouse. OWS to OWE jobs perform lookup validations for the target OWE tables to ensure there are no information gaps and maintain referential integrity. Many of the jobs aggregate your transaction data for the target F00 tables. |
II |
OWS to MDW |
Jobs in this category extract data from the OWS tables and populate target DIM and FACT tables in the MDW layer of the warehouse. OWS to MDW jobs generate a surrogate key that helps facilitate dimension key resolution. The surrogate key value is used as the primary key in the target DIM table and as the foreign key in the FACT table. The jobs also perform lookup validations for the target DIM and FACT tables to ensure there are no information gaps and maintain referential integrity. |
II |
OWE to MDW |
Jobs in this category extract data from the OWE tables and populate target DIM and FACT tables in the MDW layer of the warehouse. Properties of this job type mirror those of the OWS to MDW job. OWE to MDW jobs generate a surrogate key that helps facilitate dimension key resolution. The surrogate key value is used as the primary key in the target DIM table and as the foreign key in the FACT table. The jobs also perform lookup validations for the target DIM and FACT tables to ensure there are no information gaps and maintain referential integrity. |
All job types identified in the table are incremental load jobs. Incremental load jobs identify and extract only new or changed source records and bring it into target warehouse tables.
See ETL Load Strategies.
PeopleSoft use standard naming conventions for all ETL jobs; this ensures consistency across different projects. The following table provides the naming conventions for PeopleSoft delivered ETL jobs.
Object |
Naming Convention |
Example |
Staging Server Job |
J_Stage_[Staging Table Name] |
J_Stage_PS_BU_ITEMS_INV |
Sequencer Job |
SEQ_[Server Job Name] |
SEQ_J_Stage_PS_BU_ITEMS_INV |
CRC Initial Load Job |
J_Hash_PS_[Table Name] |
J_Hash_PS_S_BD_ASSET_ITEM |
Common Lookup Load Job |
J_Hash_PS_[Table Name] |
J_Hash_PS_D_LRNG_ENV |
Dimension Job |
J_Dim_PS_[Dimension Table Name] |
J_Dim_PS_D_DEPT |
Fact Job |
J_Fact_PS_[Fact Table Name] |
J_Fact_PS_F_ENRLMT |
Enterprise D00 Job |
J_D00_PS_[D00 Table Name without D00 Suffix] |
J_D00_PS_ACCOMP_D00 |
Enterprise F00 Job |
J_F00_PS_[F00 Table Name without F00 Suffix] |
J_F00_PS_JOB_F00 |
OWE BASE Job |
J_BASE_PS_[Base OWE Table Name] |
J_BASE_PS_XYZ |
Hash files are views of specific EPM warehouse tables and contain only a subset of the data available in the warehouse tables. These streamlined versions of warehouse tables are used to perform data validation (lookups) within an ETL job and select specific data from lookup tables (such as sourceID fields in dimensions).
In the validation (lookup) process the smaller hash file is accessed, rather than the base warehouse table, improving performance. The following diagram provides an example of a hash file lookup in a job.
Lookup process using hash file
Because hash files are vital to the lookup process, jobs cannot function properly until all hash files are created and populated with data. Before you run any job that requires a hash file, you must first run all jobs that create and load the hash files—also called initial hash file load jobs.
After hash files are created and populated by the initial hash file load jobs, they are updated on a regular basis by the delivered sequencer jobs. Hash files are updated in the same job as its related target warehouse table is updated. In other words, both the target warehouse table and the related hash file are updated in the same sequencer job. The successful load of the target warehouse table in the job triggers the load of the related hash file. The following diagram provides an example of the this process.
Hash file update process
You can see from the diagram that both the target warehouse table and the hash file are updated in the same sequencer job. This ensures that the data in the hash file always matches the data in the related warehouse table.
Environmental parameters are user-defined values that represent processing variables in your ETL jobs. Environmental parameters are reusable so they enable you to define a processing variable once and use it in several jobs. They also help standardize your jobs.
Though environmental parameters are reusable, PeopleSoft delivers specific environmental parameters for jobs related to each phase of data movement (such as OWS to MDW jobs). Therefore, an individual parameter is not used across all ETL jobs. For more information about environmental parameters refer to the file “List of Environment Parameters.xls” located in the following install CD directory path: <PSHOME>\SRC\ETL.
Shared containers are reusable job elements. A shared container is usually comprised of groups of stages and links, and is stored in the DataStage repository. You can use shared containers to make common job components available throughout your project. Because shared containers are reusable you can define them once and use them in any number of your ETL jobs. PeopleSoft delivers the following shared containers:
StoreLangStagingList
StoreMaxLastUpdDttm
StoreMaxLastUpdDttmE1
Routines are a set of instructions, or logic, that perform a task within a job. For example, the GetDecimalShiftroutine can determine the decimal point adjustment for currency and non-currency data items. Because routines are reusable you can use them in any number of your ETL jobs. For more information about PeopleSoft delivered routines, refer to the file “EPM89_Routines_Details.xls” located in the following install CD directory path: <PSHOME>\SRC\ETL.
This section provides an overview of PeopleSoft ETL load strategy and discusses:
Incremental loading and the date time stamp.
Incremental loading and update, else insert logic.
PeopleSoft delivers ETL jobs that extract data from your source transaction system and load it into target OWE and MDW dimension and fact tables. Most of the delivered ETL jobs areincremental load jobs. Incremental load jobs use built-in logic to identify and extract only new or changed source records. When an incremental load job reads data from a source, it uses the date time stamp to identify new or changed records. When an incremental load job writes data to a target, it updates records using update, else insert logic. The benefit of the incremental load process is increased efficiency and faster processing during the extract and load process.
Note. If this is the first time you are populating your target warehouse tables with data, the incremental jobs recognize that you have no existing data in your tables and perform a complete extract of your source records. Subsequent runs of the incremental jobs will extract only new or changed records.
To ensure only new or changed records are extracted into EPM Foundation, a date time stamp is associated with each record in the EPM tables. Target warehouse tables have a DTTM column which holds current date time values for each record, but a separate hash file stores the last update date time stamp.
When an incremental load job reads a table, it uses the built-in filter condition, [DTTM_Column] > [%DateTimeIn('#LastModifiedDateTime#')], to determine whether any records in the table are new or changed since the last load. The last update date time is fetched from the related hash file using the GetLastUpdDateTimeroutine. If the retrieved date time is less than the current value in the DTTM column, the record will be updated in the EPM table. This process can be done quickly because the DTTM column is the only value being processed for each record.
Each time a new or updated record is loaded, the present date time stamp is recorded for the last update time stamp and is used as a basis for comparison the next time the incremental load job is run.
Note. If the last update time field is null for a record, the record is processed each time the job is executed.
Incremental Loading without the Date Time Stamp
Sometimes source table records are not associated with a date time stamp. When records lack a date time stamp, a cyclical redundancy check (CRC) must be performed to determine new or changed records. Unlike the traditional date time lookup process which targets the DTTM column for each record, the CRC process reads the entire record for each record in the table. The CRC process is used in some source to OWS jobs.
Incremental Loading and Enterprise One Sources
Enterprise One source records store the datetime in the Julian date format (YYDDD or YYYDDD) and use the fields UPMJ (date last updated) and UPMT (time last updated). However, EPM use the Gregorian date format in the incremental load process as well as the LAST_UPD_DTTM field.
To compensate for the incompatible date formats, special parameters exist in the incremental load jobs reading from Enterprise One sources. Each such job has the parameter LastModifiedDateTimeof type integer, and when an incremental load job reads a table it uses the built-in filter condition, [DRUPMJ * 1000000 + DRUPMT] > [#LastModifiedDateTime#], to determine whether any records in the table are new or changed since the last load. The last update date time is fetched from the related hash file using the GetLastUpDateTimeE1routine.
If a record must be updated in an EPM table, the incremental load job uses update, else insert logic to determine how to insert the new data. The update, else insert process works by comparing specific key values in the source table against the same key values in the target table. If the same keys exist in the EPM target table, the associated record is overwritten with the new values coming from the source. For OWE D00 and MDW dimension tables, only type 1 slowly changing dimension is supported; history is not maintained. If the key-comparison process finds that the keys do not exist in the EPM target table (it is a new record), a new record is inserted into the target table.
The complex process behind integrating and enriching disparate source data can create some special load requirements in EPM. For example, subrecords are used extensively in EPM target tables to provide additional depth and breadth of processing.
Passing Default Values to EPM Target Tables
Due to data aggregation, data enhancement, and other processing requirements, EPM target tables may contain columns that do not exist in your source transaction tables. Because of the disconnect between the source and EPM columns, there are no source values to populate the EPM columns. Therefore, default values must be used to populate the EPM columns instead. PeopleSoft deliver several routines to pass default values to the EPM columns. For example, the routine GetNumDefault is used to pass numeric default values to a target warehouse table.
Note. A separate routine is delivered for each data type (such as varchar and numeric).
Target Table Subrecords
Subrecords are a collection of specific columns that repeat across multiple EPM target tables. Subrecords can perform a variety of functions, including tracking data to its original source and facilitating customizations that enable type 2 slowly changing dimensions. For example, the subrecord LOAD_OWS_SBR contains columns such as CREATED_EW_DTTM, LAST_UPD_DTTM, and BATCH_SID which help track target warehouse table load history.
It is important to populate subrecords with the appropriate data. Thus, it is important that you thoroughly familiarize yourself with the PeopleSoft delivered subrecords and their associated columns.
The reporting and analytical functionality provided by EPM is wholly dependent on the data stored in EPM Foundation. If incomplete or incorrect data resides in EPM Foundation, reports you create and analysis you perform will be flawed and essentially worthless. Given the considerable dependence on warehouse data, all source data entering EPM Foundation must be validated.
Typically source transaction systems have their own mechanisms to enforce data integrity, including database schema declarative referential integrity, runtime checks (such as database triggers), or application based checks. PeopleSoft Enterprise and Enterprise One source systems implement data consistency checks in the application layer. More specifically, data consistency is enforced when data is entered using PIA pages. However, even with these source validations in place, data integrity cannot be assumed. For example, it is possible to enter data into a PeopleSoft Enterprise source system using a back-end mechanism, bypassing standard data validation checks. Hence, data must be re-validated in EPM Foundation.
PeopleSoft provides data validation that ensures:
Each row in a fact table (*_F00 or F_*) has corresponding dimension and fact values.
Source business unit and setID are properly mapped to EPM values.
Source codes are properly mapped to EPM values.
Data validations are performed using PeopleSoft delivered ETL jobs. Certain jobs contain embedded logic that automatically perform the validations when you run the job. More specifically, data validation is performed in the job using lookup validations. Like other lookups, the validation process uses hash files to lookup data and verify the integrity and completeness of the data. Embedded lookup validations can perform dimension key validation (for example, verifying that customer ID fact value has a corresponding customer ID dimension value) and general key validation (for example, verifying the pre-fact customer ID in the OWS table has a corresponding customer ID in the OWE or MDW table).
Because we want to ensure that complete, accurate data resides in the Operational Warehouse - Enriched (OWE) and Multidimensional Warehouse (MDW) layers, data validations are embedded in the jobs that load data from the OWS to the OWE and MDW. Therefore, data that passes the validation process is loaded into OWE and MDW target tables, while data that fails the validation process is redirected to separate error tables in the OWS. This ensures that flawed data never finds its way into the target OWE and MDW tables. The following graphic demonstrates the lookup process and represents a typical data validation process.
ETL data validation process
Error tables log the source values failing validation to aid correction of the data in the source system. There is an error table for each OWS driver table. OWS driver tables are those tables that contain the primary information for the target entity (for example customer ID). After flawed data is moved to the error table you must review this information and correct the problem in the source system. Information about the error tables and their structure can be found in the chapter “Understanding EPM”.
Note. PeopleSoft does not provide an error correction mechanism to fix data failing validation. Failed data should be corrected in the source system.
See Also
ETL setup information for all warehouse layers can be found in PeopleSoft Enterprise Performance Management 8.9 Supplemental Installation Documentation. and EPM 89 ETL Implementation Steps, posted to Customer Connection.