Monday, January 31, 2011

DS - Performance Tuning

Performance Tuning for DataStage Jobs

1. Select suitable configurations file (nodes depending on data volume)
2. Select buffer memory correctly
3. Select proper partition
4. Turn off Runtime Column propagation wherever it's not required
5. Taking care about sort
6. Handling null values (use modify instead of transformer)
7. Try to decrease the transformer use (use copy, filter, modify…)
8. Use dataset instead of sequential file in the middle of the vast jobs
9. Take maximum 20 stages for a job
10. Select Join or Lookup or Merge (depending on data volume)
11. Stop propagation of unnecessary metadata between the stages
Don't sort previously sorted and some stages like modify, dataset...etc increases performance

***********************

DataStage Performance Issues
This note explains factors that affect DataStage performance, and recommends performance improvement techniques. Factors Affecting Performance A DataStage job is essentially a batch program. It runs at the speed of its slowest element. It can be disk I/O-bound, communications I/O-bound, or CPU-bound. Even if a job is doing very little actual transformation work, it can still be CPU bound due to the amount of CPU required to extract data from a source database, or load it into a target database. Following are general guidelines about factors that affect DataStage performance:

* Raw CPU speed. The faster the better. Performance improves roughly linearly with single CPU speed.
* Multiple CPUs. If the jobs to be performed have some inherent parallelism (e.g., processing multiple separate large files), then adding additional CPUs will help. DataStage is not multithreaded (as of release 2.0), so additional CPUs help a single-stream job only if local source and/or target databases can use them productively (e.g., the databases themselves use multiple threads). If the fundamental job is processing one single large file that cannot be easily broken up, then extra CPUs will not help.
* Main memory. The more the better. In-memory aggregation resulting in many output groups can chew a lot of memory storing intermediate values. 40 Mbytes is a pretty standard minimum. 64-80 is better. If you will be using large in-memory lookup files, add memory to accommodate these files to avoid paging overhead.
* Disk I/O. Raw disk I/O speed may affect performance some for local source and/or target databases, but I/O performance to databases is often CPU intensive as well, and money is better spent on a faster CPU.
* Communications I/O. Remote data sources and data targets will tend to slow things way down. If using ODBC to access remote data sources or targets, the best ODBC drivers will support features like array binding and use internal pipelining to provide good performance. Still, most LANs will always be slower than a good local hard disk.
* Other programs running concurrently on the DataStage server machine will tend to slow down the performance of DataStage jobs. Improving Performance This section lists tips and techniques for improving DataStage job performance gleaned from field experience to date. Note, however, that some techniques may improve some jobs and not others. System Configuration
* Upgrade to a faster single CPU and add more cache. This is the cheapest way to improve DataStage performance, and generally improves most jobs.
* Add more main memory. This helps especially if you use the AGGREGATOR stage (figure one rows worth of bytes for each resulting group, and add 20%).
* Move source and target data files/tables to separate disks and enable dual asynchronous I/O paths (e.g., dual SCSI controllers).
* Add additional CPUs if the DataStage job has inherent parallelism that can take advantage of them. A job uses multiple processes if there are multiple paths from a passive stage (sequential file, HASH or ODBC stage) through active stages to another passive stage.
* Use the fastest ODBC driver available for a data source or target. The Intersolv drivers are generally quite fast. Oracles own ODBC driver is very slow. Job Design
* Use column selection at the data source whenever possible. Remove unused columns from an ODBC or HASH input stage definition.
* Use row selection at the data source, whenever possible. For ODBC data source stages, add WHERE clauses to eliminate rows that are not interesting. This saves on the amount of data to be processed, validated and rejected.
* Perform joins, validation, and simple transformations at the source database, whenever possible. Databases are good at these things, and the more you can do during the process of data extraction, the less processing will be needed within the DataStage job. (However, UniVerse SQL is quite slow performing joins against independent tables, so avoid doing that wherever you can.)
* Use a database bulk loader stage rather than an ODBC stage to populate a target warehouse table (where available). The bulk loader stages can often take advantage of database options not available via ODBC, although the functionality is generally limited to INSERTing new rows.
* Optimizing an ODBC stage to load a non-UniVerse target warehouse table: Set the ODBC stage properties as follows:
* Set the "Transaction isolation level" to the lowest level consistent with the goals of the job. This minimizes the target database overhead in journalling.
* Set the "Rows per transaction" to 0 to have all inserts/updates performed in a single transaction, or set it to a number of at least 50-100 to have transaction commits on chunks of rows. This minimizes the overhead associated with transaction commits.
* Set the "Parameter array size" to a number that is at least 1024 divided by the total number of bytes per row to be written (e.g., set it to 10 or greater for 100-byte rows). This tells DataStage to pass that many rows in each ODBC request. This "array-binding" capability requires an ODBC driver that supports the "SQLParamOptions" (ODBC API Extension level 2) capability.
If the target database is remote Oracle, use the ORABULK stage to create the necessary data and control files for running the Oracle SQL*LOAD utility. Use an AFTER routine on the ORABULK stage to automate the copying of the data and control files to the target machine and start SQL*LOAD. To maximize loading performance, use either the structure of the DataStage job or the ORABULK stage to create multiple data files and then use the Oracle SQL*LOAD "PARALLEL" option to perform loading in parallel. Use the SQL*LOAD "DIRECT" option (if available and configured on the Oracle database) to bypass SQL for even greater loading performance.
When using an ODBC stage to populate a warehouse target table, use straight INSERT or UPDATE rather than other update actions. This allows target database array binding which is fast. Other options may require an attempt to UPDATE a row first, followed by an INSERT if the UPDATE fails.
When accessing UniVerse source data, use the HASH stage type rather than the ODBC stage type whenever you can. HASH stage access is somewhat faster than ODBC stage access. If you need the functionality of ODBC (for dynamic normalization/flattening), configure the UniVerse data source in the uvodbc.config file to have "DBMSTYPE = UNIVERSE" rather than "DBMSTYPE = ODBC". Also, if the UniVerse file/table is local, configure it in uvodbc.config with "HOST = localhost" rather than the explicit local host name. This allows UniVerse to perform SQL operations in the same process as the caller instead of spawning an extra server process.
Remote UniVerse operations are quite slow and should generally be avoided. If source data is in a remote UniVerse file, use a BEFORE routine to copy the file (e.g., via FTP) to the DataStage server machine, then run fnuxi if needed, and then access the local file copy from the DataStage job. Similarly, if the data target is a remote UniVerse file, create the file locally on the DataStage server machine and use an AFTER routine to copy the file to the target machine and run fnuxi if needed.
Avoid remote reference lookups on secondary inputs to Transformer stages. If you must perform such lookups on every data row (e.g., for validation and transformation), arrange to copy the reference data into a local HASHed file (e.g., via a "BEFORE" exit routine or another ODBC=>HASH job) first, and then use the local HASHed file for reference lookups. You can request that the file/table being used for lookups be copied into memory. This works best when reference files fit into real memory and dont cause extra paging.).
If multiple CPUs are available on the DataStage server machine, split the work into multiple parallel paths within one job, or into multiple jobs.
(Note: a parallel path travels from a passive stage (SEQ, HASH or ODBC) through independent active stages to another passive stage. If two paths share any active stage, they are not parallel, and are performed in a single process and cannot take advantage of multiple CPUs.)
Perform aggregation as early as possible in a job. This minimizes the amount of data that needs to be processed in later job stages.
Aggregation is faster if the input data is sorted on some or all of the columns to be grouped.
Avoid intermediate files where possible, since this adds considerable disk I/O to the job.
Avoid validation and transformation that is not really needed. Every item of work done on a column value in a row adds per-row overhead. If you have control over the quality of the arriving data, then do not validate things that will always be true.
Move large data files/tables in bulk where possible, instead of using remote database access. Use a "BEFORE" routine to copy source data into a local file or table.
If the warehouse target tables are in UniVerse, try to write them using a HASH stage rather than an ODBC stage. HASH stage writes are somewhat faster than SQL INSERTs. If the ultimate target warehouse file/table is a remote UniVerse file (not table), consider writing to a local UniVerse file (on the DataStage server machine) and then using an "AFTER" routine to copy the file to the remote target warehouse machine and run fnuxi on it there. Transform Functions
* Avoid writing custom transform functions unless you must. If possible, use BASIC expressions and built-in functions as much as possible.
For example, let us say that an input data row contains a separate DATE field in "mm/dd/yy" format and a TIME field in "hh:mm:ss" format. The output row to the target database needs a single TIMESTAMP field, which requires a string of the form "yyyy-mm-dd hh:mm:ss". Rather than define a custom transform function, use the following derivation expression:
OCONV(ICONV(Input.DATE,"D2/"),"D4-YMD[4,2,2]"):" ":Input.TIME
* When writing custom transform functions (e.g., in BASIC), avoid known slow BASIC operations (e.g., concatenation of new values to the beginning or end of a string, pattern matching). Use built-in BASIC functions and expressions as much as possible (because they are implemented efficiently) and avoid calling additional BASIC programs. Perform initialization into private named COMMON once to minimize per-row overhead. Keep in mind that every transform will be executed once per row.
* If multiple custom transforms are needed, do them if possible as separate transform functions instead of one function with switch arguments. This minimizes the code path on individual columns that need only one of the transforms.

No comments:

Post a Comment