Shyam S. Reddy: January 2011

Monday, January 31, 2011

ETL - Informatica Part III

What are the various methods of getting incremental records or delta records from the source systems?

One foolproof method is to maintain a field called ‘Last Extraction Date’ and then impose a condition in the code saying ‘current_extraction_date > last_extraction_date’.

First Method: If there is a column in the source which identifies the record inserted date. Then it will be easy to put a filter condition in the source qualifier.

Second Method: If there is no record in the source to identify the record inserted date. Then we need to do a target lookup based on the primary key and determine the new record and then insert.

Where do we use connected and un connected lookups

If return port only one then we can go for unconnected. More than one return port is not possible with Unconnected. If more than one return port then go for Connected.

What are the various test procedures used to check whether the data is loaded in the backend, performance of the mapping, and quality of the data loaded in INFORMATICA.

The best procedure to take a help of debugger where we monitor each and every process of mappings and how data is loading based on conditions breaks.

Lets suppose we have some 10,000 odd records in source system and when load them into target.How do we ensure that all 10,000 records that are loaded to target doesn’t contain any garbage values?

Write a sql Statement with the transformation(ETL) logic for all columns.

Create a temporary table with this sql statemnt.

Use this table and do minus with the target table to ensure the data is loaded properly.

If loaded as expected then minus should not return any rows.

What are parameter files ? Where do we use them?

Parameter file defines the value for parameter and variable used in a workflow, worklet or session.

the parameter file defines the consist value of the parameter.parameter files used in mapping ,session,workflow and worklet.

How to determine what records to extract?

When addressing a table some dimension key must reflect the need for a record to get extracted. Mostly it will be from time dimension (e.g. date >= 1st of current mth) or a transaction flag (e.g. Order Invoiced Stat). Foolproof would be adding an archive flag to record which gets reset when record changes.

What is a mapping, session, worklet, workflow, mapplet?

A mapping represents dataflow from sources to targets.

A mapplet creates or configures a set of transformations.

A workflow is a set of instruction sthat tell the Informatica server how to execute the tasks.

A worklet is an object that represents a set of tasks.

A session is a set of instructions that describe how and when to move data from sources to targets.

What is the differecnce between joiner and lookup

Joiner is used to join two or more tables to retrive data from tables(just like joins in sql).

Look up is used to check and compare source table and target table (just like correlated sub-query in sql).

What is fact less fact table? where you have used it in your project?

Factless table means only the key available in the Fact there is no measures available.

factless fact table means that contains only foreign keys with out any measures example: attendance report of employees in a particular company contains no measures only

What are the various test procedures used to check whether the data is loaded in the backend, performance of the mapping, and quality of the data loaded in INFORMATICA.

The best procedure to take a help of debugger where we monitor each and every process of mappings and how data is loading based on conditions breaks.

Let’s suppose we have some 10,000 odd records in source system and when load them into target how do

Go into workflow monitor after showing the status succeed click right button go into the property and you can see there no of source row and success target rows and rejected rows

ETL - Informatica Part II

What is ODS (operation data source)

ODS - Operational Data Store.

ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity.

Once data was populated in ODS aggregated data will be loaded into EDW through ODS.

What is ETL process? How many steps ETL contains explain with example?

ETL is extraction , transforming , loading process , you will extract data from the source and apply the business role on it then you will load it in the target

the steps are :

1-define the source(create the odbc and the connection to the source DB)

2-define the target (create the odbc and the connection to the target DB)

3-create the mapping ( you will apply the business role here by adding transformations , and define how the data flow will go from the source to the target )

4-create the session (its a set of instruction that run the mapping , )

5-create the work flow (instruction that run the session)

What is the metadata extension?

Informatica allows end users and partners to extend the metadata stored in the repository by associating information with individual objects in the repository. For example, when you create a mapping, you can store your contact information with the mapping. You associate information with repository metadata using metadata extensions.

Informatica Client applications can contain the following types of metadata extensions:

Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them.

User-defined. You create user-defined metadata extensions using PowerCenter/PowerMart. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

What is a staging area? Do we need it? What is the purpose of a staging area?

Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps:

Source data extraction, Data transformation (restructuring),

Data transformation (data cleansing, value transformations),

Surrogate key assignments

(OR)

Staging is a temporary storage area where all data manupulation (cleansing, merging, scrubbing,..)takes place before loading into target dwh

1. What is partitioning? What are the types of partitioning?

If you use PowerCenter, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the Informatica Server to create multiple connections to sources and process partitions of source data concurrently.

When you create a session, the Workflow Manager validates each pipeline in the mapping for partitioning. You can specify multiple partitions in a pipeline if the Informatica Server can maintain data consistency when it processes the partitioned data.

When you configure the partitioning information for a pipeline, you must specify a partition type at each partition point in the pipeline.

The partition type determines how the Informatica Server redistributes data across partition points.

The Workflow Manager allows you to specify the following partition types:

Round-robin partitioning. The Informatica Server distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows.

For more information, see Round-Robin Partitioning.

Hash partitioning. The Informatica Server applies a hash function to a partition key to group data among partitions. If you select hash auto-keys, the Informatica Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. Use hash partitioning where you want to ensure that the Informatica Server processes groups of rows

with the same partition key in the same partition. For more

information, see Hash Partitioning.

Key range partitioning. You specify one or more ports to form a compound partition key. The Informatica Server passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. For more information, see Key Range Partitioning.

Pass-through partitioning. The Informatica Server passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

ETL - Informatica - I

1. How can we use mapping variables in Informatica? Where do we use them?

We can use mapping variable in Informatica.

The Informatica server saves the value of mapping variable to the repository at the end of session run and uses that value next time we run the session.

2. Is there any way to read the MS Excel Data’s directly into Informatica? Like IS there any Possibilities to take excel file as target?

We can’t directly import the xml file in informatica.

we have to define the microsoft excel odbc driver on our system. and define the name in exce sheet by defining ranges.

then in inforematica open the folder using sources ->import from database->select excel odbc driver->connect->select the excel sheet name .

3. Can Informatica load heterogeneous targets from heterogeneous sources?

yes, you can use heterogenous source and target in single mapping. But to join data from heterogenous source you have to use joiner transformation.

4. Can we lookup a table from source qualifier transformation. ie. unconnected lookup

You cannot lookup from a source qualifier directly. However, you can override the SQL in the source qualifier to join with the lookup table to perform the lookup.

5. What is Full load & Incremental or Refresh load?

Full Load: completely erasing the contents of one or more tables and reloading with fresh data.

Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule

Refresh Load: the table will be truncated and data will be loaded again. Here we use to load static dimension table or type tables using this method.

Incremental Load: It is a method to capture on the newly created or updated record. Based upon the falg or Date this load will be performed.

Full Load: when we are loading the data for first time, either it may be a base load or history all the set of records will be loaded at a strech depends upon the volume.

6. How do we call shell scripts from informatica?

Specify the Full path of the Shell script the “Post session properties

of session/workflow”.

You can use Command task also for executing the shell scripts.

As well as you need to use the Keyword “call” before the full path of the shell script.

Using the same Call command you can also call the SQL Store procedures.

DS - Primary key in Fact Table

How do we maintain Primary key in Fact Table ?
Surrogate key is the best option

Two main reasons to generate and maintain a surrogate key on DW side:

1. If your DW has multiple sources for a dimension or fact, the PK ID fields can have same values from different sources. The only way you can handle this is by maintaining the composite primary key on these columns. Now lets imagine that you have some 10 dimensions in a subject area, you would expect to have only 10 keys in the Fact, but by having the composite keys you would endup creating 20 or more keys on the fact. This would inturn adversely affect your query performance.

2. Another case, suppose some data migration activities take place on the source side -- which is quite possible if the source system platform is changed or your company acquiered another company and integrating the data etc -- if the source side architect decides to change the PK field value itself of a table in source, then your DW would see this as a new record and insert it and this would result in data inconsistency /discrepency between the source and DW and it could be a nightmare to fix the issue. By having a separate surrogate key on DW side that is generated based on the grain of the source table (not on the IDs), you are immune to any such PK value changes on the source side.

DS - Data Integration and Diff. B/W FACT & DIMENSIONS

What is DATA integration
Data from different sources (like flatfiles CSV oracle db2 teradata mainfraes etc.) are extracted and transformed in to a single specific pattern (unique structure) and is know as integration of data.

The ETL tools like datastage informatica and pentaho (open source) are used to convert the data from different format to unique format.

Data integration involves data combining in different residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations both commercial and scientific (combining research results from different bioinformatics repositories for example). Data integration appears with increasing frequency as the volume and the need to share existing data explodes.It has become the focus of extensive theoretical work and numerous open problems remain unsolved.

***********************************

Differce between FACT AND DIMENSION TABLES

A fact table captures the data that measures the organization's business operations. A fact table might contain business sales events such as cash register transactions or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization.

Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst.

AND

Fact table can stores large amount of the numerical data.so fact table is big in size as compared to Dimension table.

************

Source : http://www.geekinterview.com/

DS - Performance Tuning

Performance Tuning for DataStage Jobs

1. Select suitable configurations file (nodes depending on data volume)
2. Select buffer memory correctly
3. Select proper partition
4. Turn off Runtime Column propagation wherever it's not required
5. Taking care about sort
6. Handling null values (use modify instead of transformer)
7. Try to decrease the transformer use (use copy, filter, modify…)
8. Use dataset instead of sequential file in the middle of the vast jobs
9. Take maximum 20 stages for a job
10. Select Join or Lookup or Merge (depending on data volume)
11. Stop propagation of unnecessary metadata between the stages

Don't sort previously sorted and some stages like modify, dataset...etc increases performance

***********************

DataStage Performance Issues

This note explains factors that affect DataStage performance, and recommends performance improvement techniques. Factors Affecting Performance A DataStage job is essentially a batch program. It runs at the speed of its slowest element. It can be disk I/O-bound, communications I/O-bound, or CPU-bound. Even if a job is doing very little actual transformation work, it can still be CPU bound due to the amount of CPU required to extract data from a source database, or load it into a target database. Following are general guidelines about factors that affect DataStage performance:

* Raw CPU speed. The faster the better. Performance improves roughly linearly with single CPU speed.

* Multiple CPUs. If the jobs to be performed have some inherent parallelism (e.g., processing multiple separate large files), then adding additional CPUs will help. DataStage is not multithreaded (as of release 2.0), so additional CPUs help a single-stream job only if local source and/or target databases can use them productively (e.g., the databases themselves use multiple threads). If the fundamental job is processing one single large file that cannot be easily broken up, then extra CPUs will not help.

* Main memory. The more the better. In-memory aggregation resulting in many output groups can chew a lot of memory storing intermediate values. 40 Mbytes is a pretty standard minimum. 64-80 is better. If you will be using large in-memory lookup files, add memory to accommodate these files to avoid paging overhead.

* Disk I/O. Raw disk I/O speed may affect performance some for local source and/or target databases, but I/O performance to databases is often CPU intensive as well, and money is better spent on a faster CPU.

* Communications I/O. Remote data sources and data targets will tend to slow things way down. If using ODBC to access remote data sources or targets, the best ODBC drivers will support features like array binding and use internal pipelining to provide good performance. Still, most LANs will always be slower than a good local hard disk.

* Other programs running concurrently on the DataStage server machine will tend to slow down the performance of DataStage jobs. Improving Performance This section lists tips and techniques for improving DataStage job performance gleaned from field experience to date. Note, however, that some techniques may improve some jobs and not others. System Configuration

* Upgrade to a faster single CPU and add more cache. This is the cheapest way to improve DataStage performance, and generally improves most jobs.

* Add more main memory. This helps especially if you use the AGGREGATOR stage (figure one rows worth of bytes for each resulting group, and add 20%).

* Move source and target data files/tables to separate disks and enable dual asynchronous I/O paths (e.g., dual SCSI controllers).

* Add additional CPUs if the DataStage job has inherent parallelism that can take advantage of them. A job uses multiple processes if there are multiple paths from a passive stage (sequential file, HASH or ODBC stage) through active stages to another passive stage.

* Use the fastest ODBC driver available for a data source or target. The Intersolv drivers are generally quite fast. Oracles own ODBC driver is very slow. Job Design

* Use column selection at the data source whenever possible. Remove unused columns from an ODBC or HASH input stage definition.

* Use row selection at the data source, whenever possible. For ODBC data source stages, add WHERE clauses to eliminate rows that are not interesting. This saves on the amount of data to be processed, validated and rejected.

* Perform joins, validation, and simple transformations at the source database, whenever possible. Databases are good at these things, and the more you can do during the process of data extraction, the less processing will be needed within the DataStage job. (However, UniVerse SQL is quite slow performing joins against independent tables, so avoid doing that wherever you can.)

* Use a database bulk loader stage rather than an ODBC stage to populate a target warehouse table (where available). The bulk loader stages can often take advantage of database options not available via ODBC, although the functionality is generally limited to INSERTing new rows.

* Optimizing an ODBC stage to load a non-UniVerse target warehouse table: Set the ODBC stage properties as follows:
* Set the "Transaction isolation level" to the lowest level consistent with the goals of the job. This minimizes the target database overhead in journalling.
* Set the "Rows per transaction" to 0 to have all inserts/updates performed in a single transaction, or set it to a number of at least 50-100 to have transaction commits on chunks of rows. This minimizes the overhead associated with transaction commits.
* Set the "Parameter array size" to a number that is at least 1024 divided by the total number of bytes per row to be written (e.g., set it to 10 or greater for 100-byte rows). This tells DataStage to pass that many rows in each ODBC request. This "array-binding" capability requires an ODBC driver that supports the "SQLParamOptions" (ODBC API Extension level 2) capability.

If the target database is remote Oracle, use the ORABULK stage to create the necessary data and control files for running the Oracle SQL*LOAD utility. Use an AFTER routine on the ORABULK stage to automate the copying of the data and control files to the target machine and start SQL*LOAD. To maximize loading performance, use either the structure of the DataStage job or the ORABULK stage to create multiple data files and then use the Oracle SQL*LOAD "PARALLEL" option to perform loading in parallel. Use the SQL*LOAD "DIRECT" option (if available and configured on the Oracle database) to bypass SQL for even greater loading performance.
When using an ODBC stage to populate a warehouse target table, use straight INSERT or UPDATE rather than other update actions. This allows target database array binding which is fast. Other options may require an attempt to UPDATE a row first, followed by an INSERT if the UPDATE fails.
When accessing UniVerse source data, use the HASH stage type rather than the ODBC stage type whenever you can. HASH stage access is somewhat faster than ODBC stage access. If you need the functionality of ODBC (for dynamic normalization/flattening), configure the UniVerse data source in the uvodbc.config file to have "DBMSTYPE = UNIVERSE" rather than "DBMSTYPE = ODBC". Also, if the UniVerse file/table is local, configure it in uvodbc.config with "HOST = localhost" rather than the explicit local host name. This allows UniVerse to perform SQL operations in the same process as the caller instead of spawning an extra server process.
Remote UniVerse operations are quite slow and should generally be avoided. If source data is in a remote UniVerse file, use a BEFORE routine to copy the file (e.g., via FTP) to the DataStage server machine, then run fnuxi if needed, and then access the local file copy from the DataStage job. Similarly, if the data target is a remote UniVerse file, create the file locally on the DataStage server machine and use an AFTER routine to copy the file to the target machine and run fnuxi if needed.
Avoid remote reference lookups on secondary inputs to Transformer stages. If you must perform such lookups on every data row (e.g., for validation and transformation), arrange to copy the reference data into a local HASHed file (e.g., via a "BEFORE" exit routine or another ODBC=>HASH job) first, and then use the local HASHed file for reference lookups. You can request that the file/table being used for lookups be copied into memory. This works best when reference files fit into real memory and dont cause extra paging.).
If multiple CPUs are available on the DataStage server machine, split the work into multiple parallel paths within one job, or into multiple jobs.
(Note: a parallel path travels from a passive stage (SEQ, HASH or ODBC) through independent active stages to another passive stage. If two paths share any active stage, they are not parallel, and are performed in a single process and cannot take advantage of multiple CPUs.)
Perform aggregation as early as possible in a job. This minimizes the amount of data that needs to be processed in later job stages.
Aggregation is faster if the input data is sorted on some or all of the columns to be grouped.
Avoid intermediate files where possible, since this adds considerable disk I/O to the job.
Avoid validation and transformation that is not really needed. Every item of work done on a column value in a row adds per-row overhead. If you have control over the quality of the arriving data, then do not validate things that will always be true.
Move large data files/tables in bulk where possible, instead of using remote database access. Use a "BEFORE" routine to copy source data into a local file or table.
If the warehouse target tables are in UniVerse, try to write them using a HASH stage rather than an ODBC stage. HASH stage writes are somewhat faster than SQL INSERTs. If the ultimate target warehouse file/table is a remote UniVerse file (not table), consider writing to a local UniVerse file (on the DataStage server machine) and then using an "AFTER" routine to copy the file to the remote target warehouse machine and run fnuxi on it there. Transform Functions
* Avoid writing custom transform functions unless you must. If possible, use BASIC expressions and built-in functions as much as possible.
For example, let us say that an input data row contains a separate DATE field in "mm/dd/yy" format and a TIME field in "hh:mm:ss" format. The output row to the target database needs a single TIMESTAMP field, which requires a string of the form "yyyy-mm-dd hh:mm:ss". Rather than define a custom transform function, use the following derivation expression:
OCONV(ICONV(Input.DATE,"D2/"),"D4-YMD[4,2,2]"):" ":Input.TIME
* When writing custom transform functions (e.g., in BASIC), avoid known slow BASIC operations (e.g., concatenation of new values to the beginning or end of a string, pattern matching). Use built-in BASIC functions and expressions as much as possible (because they are implemented efficiently) and avoid calling additional BASIC programs. Perform initialization into private named COMMON once to minimize per-row overhead. Keep in mind that every transform will be executed once per row.
* If multiple custom transforms are needed, do them if possible as separate transform functions instead of one function with switch arguments. This minimizes the code path on individual columns that need only one of the transforms.

source : http://www.dsxchange.com/viewtopic.php?t=82227&sid=49948936e20a053cdf0ad9a073052f40

DS - Environment Variables

What is an Environment Variable??

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.

We can also define new envrionment variable.For that we can got to DS Admin .

These are the variables used at the project or job level.We can use them to configure the job i.e., we can associate the configuration file(Wighout this u can not run ur job) increase the sequential or dataset read/ write buffer.

ex: $APT_CONFIG_FILE

Like above we have so many environment variables. go to job properties and click on Paramer tab then click on add environment variable to see most of the environment variables.

***************************

DataStage EE environment variables

The default environment variables settings are provided during the Datastage installation (common for all users).

Users have a few options to override the default settings with Datastage client applications:

# With Datastage Administrator - project-wide defaults for general environment variables, set per project in the Projects tab under Properties -> General Tab -> Environment...

# With Datastage Designer - settings at the job level in Job Properties

# With Datastage Director - settings per run, overrides all other settings and is very useful for testing and debuging.

The Datastage environment variables are grouped and each variable falls into one of categories.

Basically the default values set up during an installation are resonable and in most cases there is no need to modify them.

Environment variables overview

Listed below are only environment variables that are candidates to adjustment in real-life project deployments. Please refer to the datastage help for details on variables not listed here.

General variables

# LD_LIBRARY_PATH - specifies the location of dynamic libraries on Unix

# PATH - Unix shell search path

# TMPDIR - temporary directory

Parallel properties

# APT_CONFIG_FILE - the parallel job configuration file. It points to the active configuration file on the server. Please refer to Datastage EE configuration guide for more details on creating a config file.

# APT_DISABLE_COMBINATION - prevents operators (stages) from being combined into one process. Used mainly for benchmarks.

# APT_ORCHHOME - home path for parallel content.

# APT_STRING_PADCHAR - defines a pad character which is used when a varchar is converted to a fixed length string

Sources: http://etl-tools.info/infosphere-datastage-ee/tutorial/environment-variables.htm

*******************************

Creating project specific environment variables-

Start up DataStage Administrator.- Choose the project and click the "Properties" button.- On the General tab click the "Environment..." button.- Click on the "User Defined" folder to see the list of job specific environment variables.

There are two types of variables - string and encrypted. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool and will appears as junk text when saved to the DSParams file or when displayed in a job log. This provides robust security of the value..

Migrating Project Specific Job Parameters. It is possible to set or copy job specific environment variables directly to the DSParams file in the project directory. There is also a DSParams.keep file in this directory and if you make manual changes to the DSParams file you will find Administrator can roll back those changes to DSParams.keep. It is possible to copy project specific parameters between projects by overwriting the DSParams and DSParams.keep files. It may be safer to just replace the User Defined section of these files and not the General and Parallel sections.

Environment Variables as Job Parameters- Open up a job.- Go to Job Properties and move to the parameters tab.- Click on the "Add Environment Variables..." button.- Set the Default value of the new parameter to "$PROJDEF".When the job parameter is first created it has a default value the same as the Value entered in the Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest Value for this variable at job run time.If you have an encrypted environment variable it should also be an encrypted job parameter. Set the value of these encrypted job parameters to $PROJDEF. You will need to type it in twice to the password entry box.Using Environment Variable Job ParametersThese job parameters are used just like normal parameters by adding them to stages in your job enclosed by the # symbol.

Database=#$DW_DB_NAME#Password=#$DW_DB_PASSWORD#File=#$PROJECT_PATH#/#SOURCE_DIR

Tuesday, January 25, 2011

DS - Definitions PART-VII

What is a Star Schema?

Star schema is a type of organising the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment. Usually a star schema consists of one or more dimension tables around a fact table which looks like a star, so that it got its name.

What is a general purpose scheduling tool?

The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition.

How do you load the time dimension?

Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day.

What are modeling tools available in the Market?

These tools are used for Data/dimension modeling

Tool Name Company Name

Erwin Computer Associates

Embarcadero Embarcadero Technologies

Rational Rose IBM Corporation

Power Designer Sybase Corporation

Oracle Designer Oracle Corporation

Name some of modeling tools available in the Market?

These tools are used for Data/dimension modeling

1. Oracle Designer

2. ERWin (Entity Relationship for windows)

3. Informatica (Cubes/Dimensions)

4. Embarcadero

5. Power Designer Sybase

What is VLDB?

Answer 1: VLDB stands for Very Large Database.

It is an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information.

Answer 2: VLDB doesn’t refer to size of database or vast amount of information stored. It refers to the window of opportunity to take back up the database.

Window of opportunity refers to the time of interval and if the DBA was unable to take back up in the specified time then the database was considered as VLDB.

What are Data Marts?

A data mart is a focused subset of a data warehouse that deals with a single area(like different department) of data and is organized for quick analysis

What are the steps to build the data warehouse?

Gathering business requirements

Identifying Sources

Identifying Facts

Defining Dimensions

Define Attributes

Redefine Dimensions & Attributes

Organize Attribute Hierarchy & Define Relationship

Assign Unique Identifiers

Additional conventions: Cardinality/Adding ratios

What is Difference between E-R Modeling and Dimensional Modeling?

Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model.

E-R modeling is used for normalizing the OLTP database design.

Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

Why fact table is in normal form?

Basically the fact table consists of the Index keys of the dimension / look up tables and the measures. So when ever we have the keys in a table, that itself implies that the table is in the normal form.

What are the advantages data mining over traditional approaches?

Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business interms of Revenue (or) Employees (or) Cutomers (or) Orders etc.

Traditional approches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining.

What are the vaious ETL tools in the Market?

Various ETL tools used in market are:

Informatica

Data Stage

Oracle Warehouse Bulider

Ab Initio

Data Junction

What is a CUBE in datawarehousing concept?

Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values.

What is data validation strategies for data mart validation after loading process?

Data validation is to make sure that the loaded data is accurate and meets the business requriments.

Strategies are different methods followed to meet the validation requriments

What is the datatype of the surrgate key?

Datatype of the surrgate key is either inteeger or numaric or number

What is degenerate dimension table?

Degenerate Dimensions : If a table contains the values, which r neither dimesion nor measures is called degenerate dimensions. Ex : invoice id, empno

What is Dimensional Modelling?

Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated.

What are the methodologies of Data Warehousing.?

Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are stardadly used. Other methodologies are AMM, World class methodology and many more.

What is a linked cube?

Linked cube in which a sub-set of the data can be analysed into great detail. The linking ensures that the data in the cubes remain consistent.

DS - Definitions PART-V

What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?

1. Normalization is process for assigning attributes to entities–Reduces data redundancies–Helps eliminate data anomalies–Produces controlled redundancies to link tables

2. Normalization is the analysis of functional dependency between attributes / data items of user views? It reduces a complex user view to a set of small and stable subgroups of fields / relations

1NF: Repeating groups must be eliminated, Dependencies can be identified, All key attributes defined, No repeating groups in table

2NF: The Table is already in1NF, Includes no partial dependencies–No attribute dependent on a portion of primary key, Still possible to exhibit transitive dependency, Attributes may be functionally dependent on non-key attributes

3NF: The Table is already in 2NF, Contains no transitive dependencies

What is Fact table?

Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales", then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables.

What are conformed dimensions?

Answer1: Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex: Date Dimensions is connected all facts like Sales facts, Inventory facts. . etc

Answer2: Conformed dimensions are dimensions which are common to the cubes. (cubes are the schemas contains facts and dimension tables)

Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2, D1, D2, D4 are the Facts and Dimensions here D1, D2 are the Conformed Dimensions

What are the Different methods of loading Dimension tables?

Conventional Load: Before loading the data, all the Table constraints will be checked against the data.

Direct load:(Faster Loading)

All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed.

What is confirmed dimensions?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly

What are Data Marts?

Data Marts are designed to help manager make strategic decisions about their business.

Data Marts are subset of the corporate-wide data that is of value to a specific group of users.

There are two types of Data Marts:

1. Independent data marts – sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area.

2. Dependent data mart – sources directly form enterprise data warehouses.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minute and put that data.

How are the Dimension tables designed?

Most dimension tables are designed using Normalization principles upto 2NF. In some instances they are further normalized to 3NF.

Find where data for this dimension are located.

Figure out how to extract this data.

Determine how to maintain changes to this dimension (see more on this in the next section).

What are non-additive facts?

Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

What type of Indexing mechanism do we need to use for a typical datawarehouse?

On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes.

To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.

What Snow Flake Schema?

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

What is real time data-warehousing?

Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing.

Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it.

Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.

What are slowly changing dimensions?

SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types

SCD1: only maintained updated values.

Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using

A) Effective Date

B) Versions

C) Flags or combination of these

scd3: by adding new columns to target table we maintain historical information and current information

What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables?

Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.

EX: Average daily balance

A fact table without numeric fact columns is called factless fact table.

Ex: Promotion Facts

While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures.

Differences between star and snowflake schemas?

Star schema - all dimensions will be linked directly with a fat table.

Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables.

DS - Definitions PART-V

What are Aggregate tables?

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions.Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance.To avoid this we can aggregate the table to certain required level and can use it.This tables reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly.

What is Dimensional Modelling? Why is it important ?

Why is Data Modeling Important?

Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response by practitioners who write on the subject is that you should no more build a database without a model than you should build a house without blueprints.

The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language , it can be reviewed and verified as correct by the end-users.

The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

What is data mining?

Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware house can be used to mine data for the most high risk people to insure in a certain geographial area.

What is ODS?

1. ODS means Operational Data Store.

2. A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.

What is a dimension table?

A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. it contains only the textual attributes.

What is a lookup table?

A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.

Why should you put your data warehouse on a different system than your OLTP system?

Answer1: A OLTP system is basically " data oriented " (ER model) and not " Subject oriented "(Dimensional Model) .That is why we design a separate system that will have a subject oriented OLAP system... Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the OLTP server that will affect the daytoday business directly.

Answer2: The loading of a warehouse will likely consume a lot of machine resources. Additionally, users may create querries or reports that are very resource intensive because of the potentially large amount of data available. Such loads and resource needs will conflict with the needs of the OLTP systems for resources and will negatively impact those production systems.

What is ETL?

ETL stands for extraction, transformation and loading.

ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control parameter.

Extraction : Take data from an external source and move it to the warehouse pre-processor database.

Transformation : Transform data task allows point-to-point generating, modifying and transforming data.

Loading : Load data task adds records to a database table in a warehouse.

What does level of Granularity of a fact table signify?

Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:

Determine which dimensions will be included.

Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually goes back to the requirements.

Shyam S. Reddy

Monday, January 31, 2011

ETL - Informatica Part III

ETL - Informatica Part II

ETL - Informatica - I

DS - Primary key in Fact Table

DS - Data Integration and Diff. B/W FACT & DIMENSIONS

DS - Performance Tuning

DS - Environment Variables

Tuesday, January 25, 2011

DS - Definitions PART-VII

DS - Definitions PART-V

DS - Definitions PART-V

Total Pageviews

Popular Posts