Friday, January 21, 2011

DS FAQ's - II

**********JOIN or LOOKUP  or MERGE *******

Many times this question pops up in the mind of Datastage developers.
All the above stages can be used to do the same task. Match one set of data (say primary) with another set of data(references) and see the results. DataStage normally uses different execution plans. Since DataStage is not so nice as Oracle, to show its Execution plan easily, we need to fill in the gap of Optimiser and analyze our requiremets. Most importantly its the Primary/Reference ratio that needs to be considered not the actual counts.
Primary   Source Volume     Reference Volume             Preferred Method
Little          (< 5 million)           Very Huge ( > 50 million)         Sparse Lookup
Little          ( < 5 million)           Little (< 5 million)                     Normal Lookup
Huge          (> 10 million)          Little (< 5 million)
Little           (< 5 million)           Huge ( > 10 million)                 Normal Lookup 
Huge          (> 10 million)         Huge (> 10 million)                  Join
Huge          (> 10 million)         Huge (> 10 million)                  Merge  (if you want to handle rejects in reference links).
------------------------------------
what is the differeces between hash and modulus partition methods

HASH-KEY:- HERE DATA WILL BE POPULATED INSUCH A WAY THAT REALATED DATA WILL STAY TOGETHER.
RANGE OF PRIMERY KER RECORDS WILL BE POPULATED TO ONE  PARTITION. EG.)ALL PRIMARY RECORDS IN THIS MONTH WILL BE POPULATED TO
ON E PARTIOTION.
MODULUS:-PARTITIONED DATA WILL PROVIDE SOME INFORMATION. IN THE SENSE CUSTOMERS RELATED TO ONE STORE WILL POPULATED TO ONE PARTIOTION.
  
    Modulus                                Hash
1. For numerics               1. For Numerics and characters
2. Datatype specific          2.  Not Datatype spefic
----------------------------------
RE: How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm? 
Converting mm/dd/yyyy format to yyyy-dd-mm  Below is the format
Oconv(Iconv(Filedname D/MDY[2 2 4] ) D-YDM[4 2 2] )
here first Iconv(Filedname D/MDY[2 2 4] ) will convert our given date in the internal format
later Oconv( Inter_date_format D-YDM[4 2 2] ) will convert our internal date format to required yyyy-dd-mm...

we can achieve this conversion using transformer stage conversion functions in PX.  We use
a) "Iconv" function - Internal Convertion.
b) "Oconv" function - External Convertion.

******************************
Types of Parallel Processing?

Hardware wise there are 3 types of parallel processing systems available:
1. SMP (symetric multiprocessing: multiple CPUs, shared memory, single OS)
2. MPP (Massively Parallel Processing Systems: multiple CPUs each having a personal set of resources - memory, OS, etc, but physically housed on the same machine)
3. Clusters: same as MPP, but physically dispersed (not on the same box & connected via high speed networks).
DS offers 2 types of parallelism to take advantage of the above hardware:
1. Pipeline Parallelism
2. Partition Parallelism
*****************************************************************
Functionality of Link Partitioner and Link Collector?

Link Partitioner : It actually splits data into various partitions or data flows using various
partition methods .
Link Collector : It collects the data coming from partitions, merges it into a single data flow
and loads to target.
server jobs mainly execute the jobs in sequential fashion, the ipc stage as well as link partioner and link collector will simulate the parllel mode of execution over the sever jobs having single cpu

   ***OR***
Link Partitioner : It receives data on a single input link and diverts the data to a maximum no.of 64 output links and the data processed by the same stage having same meta data.
Link Collector : It will collects the data from 64 inputlinks, merges it into a single data flowand loads to target. these both r active stagesand the design and mode of execution of serverjobs has to be decidead by the designer.

*******************************************
How do you execute datastage job from command line prompt?

Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname

*******************************************
Importance of Surrogate Key in Data warehousing?

1.Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database.
if u have any updates to be applied on a database it will be directely effected on datawarehouse part(using surrogate key)
2.Surrogate Key should be system generated number and it should be small integer. For each dimension table depending on the SCD and no of total records expected over a 4 years time you may limit the max number.
This will improve the indexing performance query processing.
3. Primary Key does not allow data duplications for actual source data. We cannot maintain historical data of each record using PK.
SID acts as a primary key in target WH systems that allows data duplications and maintains complete historical data along with current data.

**********************************************************

No comments:

Post a Comment