Friday, January 7, 2011

DS - PROCESSING STAGES

Aggregator joins data vertically by grouping incoming data stream and calculating summaries (sum, count, min, max, variance, etc.) for each group. The data can be grouped using two methods: hash table or pre-sort.
Copy - copies input data (a single stream) to one or more output data flows

FTP stage uses FTP protocol to transfer data to a remote machine

Filter filters out records that do not meet specified requirements.

Funnel combines mulitple streams into one.

JOIN combines two or more inputs according to values of a key column(s). Similiar concept to relational DBMS SQL join (ability to perform inner, left, right and full outer joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces single output stream (no reject link).

LOOKUP combines two or more inputs according to values of a key column(s). Lookup stage can have 1 source and multiple lookup tables. Records don't need to be sorted and produces single output stream and a reject link.

MERGE combines one master input with multiple update inputs according to values of a key column(s). All inputs need to be sorted and unmatched secondary entries can be captured in multiple reject links.

Modify stage alters the record schema of its input dataset. Useful for renaming columns, non-default data type conversions and null handling

Remove duplicates stage needs a single sorted data set as input. It removes all duplicate records according to a specification and writes to a single output


Slowly Changing Dimension automates the process of updating dimension tables, where the data changes in time. It supports SCD type 1 and SCD type 2.

SORT sorts input columns


Transformer stage handles extracted data, performs data validation, conversions and lookups.


Change Capture - captures before and after state of two input data sets and outputs a single data set whose records represent the changes made.

Change Apply - applies the change operations to a before data set to compute an after data set. It gets data from a Change Capture stage
Difference stage performs a record-by-record comparison of two input data sets and outputs a single data set whose records represent the difference between them. Similiar to Change Capture stage.

Checksum - generates checksum from the specified columns in a row and adds it to the stream. Used to determine if there are differencies between records.

Compare performs a column-by-column comparison of records in two presorted input data sets. It can have two input links and one output link.


Encode encodes data with an encoding command, such as gzip.


Decode decodes a data set previously encoded with the Encode Stage.


External Filter permits speicifying an operating system command that acts as a filter on the processed data


Generic stage allows users to call an OSH operator from within DataStage stage with options as required.
Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row to a single column in multiple output rows. Pivoting data results in obtaining a dataset with fewer number of columns but more rows.

Surrogate Key Generator generates surrogate key for a column and manages the key source.

Switch stage assigns each input row to an output link based on the value of a selector field. Provides a similiar concept to the switch statement in most programming languages.


Compress - packs a data set using a GZIP utility (or compress command on LINUX/UNIX)


Expand extracts a previously compressed data set back into raw binary data.

No comments:

Post a Comment