Shyam S. Reddy: JOIN, LOOKUP, MERGE

JOIN, LOOKUP, MERGE :

The three stages differs with each other with respect to
1>Input Requirements
2>Treatment of unmatched records
3>Memory Usage

Join requires less memory usage
Lookup requires more memory usage and
Merge requires less memory usage

JOIN : if the reference data is very large then we wl go for join. bcoz it access the data directly from the disk. so the processing time wl be less when compared to lookup. but here in join we cant capture the rejected data. so we go for merge.

1. No reject.
2. Does need partitioned and sorted input.

MERGE: if we want to capture rejected data(when the join key is not matched) we use merge stage. for every detailed link there is a reject link to capture rejected data.\

1. Can reject rows if fail to find match.
2. Does need partitioned and sorted input.

Lookup stage: 1.Can only return multiple matching rows from one reference.
2. Can reject rows based on constraint.
3. Can set failure.
4. Does not need partitioned and sorted input.

Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.
Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions into a single sequential stream (one data partition).
Data partitioning methods

Datastage supports a few types of Data partitioning methods which can be implemented in parallel stages:
Auto - default. Datastage Enterprise Edition decides between using Same or Round Robin partitioning. Typically Same partitioning is used between two parallel stages and round robin is used between a sequential and an EE stage.
Same - existing partitioning remains unchanged. No data is moved between nodes.
Round robin - rows are alternated evenly accross partitions. This partitioning method guarantees an exact load balance (the same number of rows processed) between nodes and is very fast.
Hash - rows with same key column (or multiple columns) go to the same partition. Hash is very often used and sometimes improves performance, however it is important to have in mind that hash partitioning does not guarantee load balance and misuse may lead to skew data and poor performance.
Entire - all rows from a dataset are distributed to each partition. Duplicated rows are stored and the data volume is significantly increased.
Random - rows are randomly distributed accross partitions
Range - an expensive refinement to hash partitioning. It is imilar to hash but partition mapping is user-determined and partitions are ordered. Rows are distributed according to the values in one or more key fields, using a range map (the 'Write Range Map' stage needs to be used to create it). Range partitioning requires processing the data twice which makes it hard to find a reason for using it.
Modulus - data is partitioned on one specified numeric field by calculating modulus against number of partitions. Not used very often.

Data collecting methodsA collector combines partitions into a single sequential stream.
Datastage EE supports the following collecting algorithms:

Auto - the default algorithm reads rows from a partition as soon as they are ready. This may lead to producing different row orders in different runs with identical data. The execution is non-deterministic.
Round Robin - picks rows from input partition patiently, for instance: first row from partition 0, next from partition 1, even if other partitions can produce rows faster than partition 1.
Ordered - reads all rows from first partition, then second partition, then third and so on.
Sort Merge - produces a globally sorted sequential stream from within partition sorted rows. Sort Merge produces a non-deterministic on un-keyed columns sorted sequential stream using the following algorithm: always pick the partition that produces the row with the smallest key value.

***********************************************************

What is APT_CONFIG in datastage

APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.

Apt_configfile is used for to store the nodes information, and it contains the disk storage information, and scrach information. and datastage understands the architecture of system based on this Configfile. for parallel process normally two nodes are required its name like 10,20.

apt_config_file is the environmental variable for configuration file. This file contains details like node information, pool resources, resource disk(permanent memory space)and resource scratch disk(temporary memory)

************************************
What are the command line functions that import and export the DS jobs?

A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.

************************************

Shyam S. Reddy

Thursday, January 13, 2011

JOIN, LOOKUP, MERGE

No comments:

Post a Comment

Total Pageviews

Popular Posts