1dbmanagement.info/books/mix/informatica_faq.doc · web viewbitmap indexes by the word bitmap...

1

1. Explain about your projects– Architecture– Dimension and Fact tables– Sources and Targets– Transformations used– Frequency of populating data– Database size

2. What is dimension modeling?Unlike ER model the dimensional model is very asymmetric with one large central table called as fact table connected to multiple dimension tables .It is also called star schema.

3. What are mapplets?Mapplets are reusable objects that represents collection of transformationsTransformations not to be included in mapplets areCobol source definitionsJoiner transformationsNormalizer TransformationsNon-reusable sequence generator transformationsPre or post session proceduresTarget definitionsXML Source definitionsIBM MQ source definitionsPower mart 3.5 style Lookup functions

4. What are the transformations that use cache for performance?Aggregator, Lookups, Joiner and Ranker

5. What the active and passive transformations?An active transformation changes the number of rows that pass through the mapping.1. Source Qualifier2. Filter transformation3. Router transformation4. Ranker5. Update strategy6. Aggregator 7. Advanced External procedure8. Normalizer9. JoinerPassive transformations do not change the number of rows that pass through the mapping.1. Expressions2. Lookup3. Stored procedure4. External procedure5. Sequence generator6. XML Source qualifier

6. What is a lookup transformation?Used to look up data in a relational table, views, or synonym, The informatica server queries the lookup table based on the lookup ports in the transformation. It compares lookup transformation port values to lookup table column values based on the lookup condition. The result is passed to other transformations and the target.Used to :Get related valuePerform a calculation Update slowly changing dimension tables.

Page No 1 of 54 (IQA Doc 2)

2

Diff between connected and unconnected lookups. Which is better?Connected :Received input values directly from the pipelineCan use Dynamic or static cache.Cache includes all lookup columns used in the mappingCan return multiple columns from the same rowIf there is no match , can return default values Default values can be specified.

Un connected :Receive input values from the result of a LKP expression in another transformation.Only static cache can be used.Cache includes all lookup/output ports in the lookup condition and lookup or return port.Can return only one column from each row.If there is no match it returns null.Default values cannot be specified.

Explain various caches :Static : Caches the lookup table before executing the transformation. Rows are not added dynamically.Dynamic : Caches the rows as and when it is passed.Unshared : Within the mapping if the lookup table is used in more than one transformation then the cache built for the first lookup can be used for the others. It cannot be used across mappings.Shared : If the lookup table is used in more than one transformation/mapping then the cache built for the first lookup can be used for the others. It can be used across mappings.Persistent : If the cache generated for a Lookup needs to be preserved for subsequent use then persistent cache is used. It will not delete the index and data files. It is useful only if the lookup table remains constant.What are the uses of index and data caches? The conditions are stored in index cache and records from the lookup are stored in data cache

7. Explain aggregate transformation?The aggregate transformation allows you to perform aggregate calculations, such as averages, sum, max, min etc. The aggregate transformation is unlike the Expression transformation, in that you can use the aggregator transformation to perform calculations in groups. The expression transformation permits you to perform calculations on a row-by-row basis only.Performance issues ? The Informatica server performs calculations as it reads and stores necessary data group and row data in an aggregate cache. Create Sorted input ports and pass the input records to aggregator in sorted forms by groups then by portIncremental aggregation? In the Session property tag there is an option for performing incremental aggregation. When the Informatica server performs incremental aggregation , it passes new source data through the mapping and uses historical cache (index and data cache) data to perform new aggregation calculations incrementally.What are the uses of index and data cache?The group data is stored in index files and Row data stored in data files.


3

8. Explain update strategy?Update strategy defines the sources to be flagged for insert, update, delete, and reject at the targets.What are update strategy constants?DD_INSERT,0 DD_UPDATE,1 DD_DELETE,2 DD_REJECT,3If DD_UPDATE is defined in update strategy and Treat source rows as INSERT in Session . What happens? Hints: If in Session anything other than DATA DRIVEN is mentioned then Update strategy in the mapping is ignored.What are the three areas where the rows can be flagged for particular treatment? In mapping, In Session treat Source Rows and In Session Target Options.What is the use of Forward/Reject rows in Mapping?

9. Explain the expression transformation ?Expression transformation is used to calculate values in a single row before writing to the target.What are the default values for variables?Hints: String = Null, Number = 0, Date = 1/1/1753

10. Difference between Router and filter transformation?In filter transformation the records are filtered based on the condition and rejected rows are discarded. In Router the multiple conditions are placed and the rejected rows can be assigned to a port.

11. How many ways you can filter the records?1. Source Qualifier2. Filter transformation3. Router transformation4. Ranker5. Update strategy.

12. How do you call stored procedure and external procedure transformation ?External Procedure can be called in the Pre-session and post session tag in the Session property sheet.Store procedures are to be called in the mapping designer by three methods1. Select the icon and add a Stored procedure transformation2. Select transformation – Import Stored Procedure3. Select Transformation – Create and then select stored procedure.

13. Explain Joiner transformation and where it is used?While a Source qualifier transformation can join data originating from a common source database, the joiner transformation joins two related heterogeneous sources residing in different locations or file systems.Two relational tables existing in separate databasesTwo flat files in different file systems.Two different ODBC sourcesIn one transformation how many sources can be coupled?Two sources can be coupled. If more than two is to be coupled add another Joiner in the hierarchy.What are join options?Normal (Default)Master OuterDetail OuterFull Outer

14. Explain Normalizer transformation?The normaliser transformation normalises records from COBOL and relational sources, allowing you to organise the data according to your own needs. A Normaliser transformation can appear anywhere in a data flow when you normalize a relational source. Use a Normaliser transformation


4

instead of the Source Qualifier transformation when you normalize COBOL source. When you drag a COBOL source into the Mapping Designer Workspace, the Normaliser transformation appears, creating input and output ports for every columns in the source.

15. What is Source qualifier transformation?When you add relational or flat file source definition to a mapping , you need to connect to a source Qualifier transformation. The source qualifier represents the records that the informatica server reads when it runs a session.Join Data originating from the same source database.Filter records when the Informatica server reads the source data.Specify an outer join rather than the default inner join.Specify sorted ports Select only distinct values from the sourceCreate a custom query to issue a special SELECT statement for the Informatica server to read the source data.

16. What is Ranker transformation?Filters the required number of records from the top or from the bottom.

17. What is target load option?It defines the order in which informatica server loads the data into the targets.This is to avoid integrity constraint violations

18. How do you identify the bottlenecks in Mappings?Bottlenecks can occur in 1. Targets The most common performance bottleneck occurs when the informatica server writes to a target

database. You can identify target bottleneck by configuring the session to write to a flat file target. If the session performance increases significantly when you write to a flat file, you have a target bottleneck. Solution : Drop or Disable index or constraints Perform bulk load (Ignores Database log) Increase commit interval (Recovery is compromised) Tune the database for RBS, Dynamic Extension etc.,2. Sources Set a filter transformation after each SQ and see the records are not through. If the time taken is same then there is a problem. You can also identify the Source problem by Read Test Session – where we copy the mapping with sources, SQ and remove all transformations and connect to file target. If the performance is same then there is a Source bottleneck. Using database query – Copy the read query directly from the log. Execute the query against the source database with a query tool. If the time it takes to execute the query and the time to fetch the first row are significantly different, then the query can be modified using optimizer hints. Solutions: Optimize Queries using hints. Use indexes wherever possible.3. Mapping If both Source and target are OK then problem could be in mapping. Add a filter transformation before target and if the time is the same then there is a problem. (OR) Look for the performance monitor in the Sessions property sheet and view the counters.Solutions: If High error rows and rows in lookup cache indicate a mapping bottleneck.


5

Optimize Single Pass Reading: Optimize Lookup transformation : 1. Caching the lookup table: When caching is enabled the informatica server caches the lookup table and queries the cache during the session. When this option is not enabled the server queries the lookup table on a row-by row basis. Static, Dynamic, Shared, Un-shared and Persistent cache 2. Optimizing the lookup condition Whenever multiple conditions are placed, the condition with equality sign should take precedence. 3. Indexing the lookup table The cached lookup table should be indexed on order by columns. The session log contains the ORDER BY statement The un-cached lookup since the server issues a SELECT statement for each row passing into lookup transformation, it is better to index the lookup table on the columns in the condition Optimize Filter transformation : You can improve the efficiency by filtering early in the data flow. Instead of using a filter transformation halfway through the mapping to remove a sizable amount of data. Use a source qualifier filter to remove those same rows at the source, If not possible to move the filter into SQ, move the filter transformation as close to the source qualifier as possible to remove unnecessary data early in the data flow. Optimize Aggregate transformation: 1. Group by simpler columns. Preferably numeric columns. 2. Use Sorted input. The sorted input decreases the use of aggregate caches. The server assumes all input data are sorted and as it reads it performs aggregate calculations. 3. Use incremental aggregation in session property sheet. Optimize Seq. Generator transformation: 1. Try creating a reusable Seq. Generator transformation and use it in multiple mappings 2. The number of cached value property determines the number of values the informatica server caches at one time. Optimize Expression transformation: 1. Factoring out common logic 2. Minimize aggregate function calls. 3. Replace common sub-expressions with local variables. 4. Use operators instead of functions.4. Sessions If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. You can identify a session bottleneck by using the performance details. The informatica server creates performance details when you enable Collect Performance Data on the General Tab of the session properties. Performance details display information about each Source Qualifier, target definitions, and individual transformation. All transformations have some basic counters that indicate the Number of input rows, output rows, and error rows. Any value other than zero in the readfromdisk and writetodisk counters for Aggregate, Joiner, or Rank transformations indicate a session bottleneck. Low bufferInput_efficiency and BufferOutput_efficiency counter also indicate a session bottleneck. Small cache size, low buffer memory, and small commit intervals can cause session bottlenecks.5. System (Networks)


6

19. How to improve the Session performance?1 Run concurrent sessions2 Partition session (Power center)3. Tune Parameter – DTM buffer pool, Buffer block size, Index cache size, data cache size, Commit Interval, Tracing level (Normal, Terse, Verbose Init, Verbose Data) The session has memory to hold 83 sources and targets. If it is more, then DTM can be increased. The informatica server uses the index and data caches for Aggregate, Rank, Lookup and Joiner transformation. The server stores the transformed data from the above transformation in the data cache before returning it to the data flow. It stores group information for those transformations in index cache. If the allocated data or index cache is not large enough to store the date, the server stores the data in a temporary disk file as it processes the session data. Each time the server pages to the disk the performance slows. This can be seen from the counters . Since generally data cache is larger than the index cache, it has to be more than the index.4. Remove Staging area5. Tune off Session recovery6. Reduce error tracing

20. What are tracing levels?Normal-defaultLogs initialization and status information, errors encountered, skipped rows due to transformation errors, summarizes session results but not at the row level.TerseLog initialization, error messages, notification of rejected data.Verbose Init.In addition to normal tracing levels, it also logs additional initialization information, names of index and data files used and detailed transformation statistics.Verbose Data.In addition to Verbose init, It records row level logs.

21. What is Slowly changing dimensions?Slowly changing dimensions are dimension tables that have slowly increasing data as well as updates to existing data.

22. What are mapping parameters and variables?A mapping parameter is a user definable constant that takes up a value before running a session. It can be used in SQ expressions, Expression transformation etc.Steps:Define the parameter in the mapping designer - parameter & variables .Use the parameter in the Expressions.Define the values for the parameter in the parameter file.A mapping variable is also defined similar to the parameter except that the value of the variable is subjected to change.It picks up the value in the following order.1. From the Session parameter file2. As stored in the repository object in the previous run.3. As defined in the initial values in the designer.4. Default values

DOC :2

Q: What is a Data Warehouse? A: A Data Warehouse is the "corporate memory". Academics will say it is a subject oriented, point-in-time, integrated, time-variant, non-volatile inquiry only collection of operational data. Typical relational


7

databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.

Q: What is ETL?A: ETL is the Data Warehouse acquisition processes of Extracting (E), Transforming or Transporting (T) and Loading (L) data from source systems into the data warehouse.

Q: What is the difference between a Data warehouse and a Data mart? A: Data mart contains department (particular Business Line) data and Data warehouse contains enterprise wise data.

Q: What is the difference between a W/H and an OLTP application? A: OLTP databases are designed to maintain Atomicity (normalized), Consistency and Integrity (many constraint) Data ("ACID" tests). OLTP application is update intensive (many small update). Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read only) and Integrated. Since a data warehouse is not updated, these constraints are relaxed.

Q: What is the difference between OLAP, ROLAP, MOLAP and HOLAP? A: On Line Analytical Processing (OLAP), Relational OLAP (use RDBMS), Multi dimensional OLAP (cube), Hybrid OLAP (ROLAP+MOLAP).

ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS).

MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on aggregated data as well as on detailed data.

Q: What is the difference between an ODS and a W/H? A: ODS is a staging area where we bring all OLTP data on real time basis and put it in a de-normalized form. W/H contains data for longer period and are non-volatile (read only) and integrated in nature.

Q: What Oracle tools can be used to design and build a W/H? A: Oracle Warehouse Builder.

Q: When should one use an MD-database (multi-dimensional database) and not a relational one? A: To develop an analytical application allowing users to slice and dice measures against various contexts (dimension).

Q: What is a star schema? Why does one design this way? A: Star schema will have only one Fact table containing all the measures and multiple dimension tables directly linked with the Fact table containing various contexts against which measures have been taken. This help to address real life analytical problems providing multidimensional cube views to the users.

It allows for the highest level of flexibility of metadata Low maintenance as the data warehouse matures Best possible performance


http://www.seagatesoftware.com/

8

Q: When should you use a STAR and when a SNOWFLAKE schema?A: We should always avoid SNOWFLAKE schema and de-normalized it to STAR.

Q: What is the difference between Oracle Express and Oracle Discoverer? A: Express is an MD database and development environment. Discoverer is an ad-hoc end-user query tool.

Q: What is the difference between View and Materialized View?

A: In Materialized view loading or replication will take place only once. Data of a Materialized View is saved in a physical table, so data access is fast due to direct access to the table. Whereas View, it has to query the base table every time it is referred that result slow performance.

Q: How can Oracle Materialized Views be used to speed up data warehouse queries? A: Using “Query Rewrite” feature Oracle may access data from available Materialized Views instead of the base tables. This will eliminate some table joins.

Q: What Oracle features can be used to optimize my Warehouse system? A: Bitmap Index, Join Index, enable “Query Rewrite” to use Materialized views, set parameter Star_transformation_enable = TRUE, Partitioning, Parallel Query (parallel_max_servers > 0 and set Degree of table > 1), transport table spaces to transfer data between Oracle databases, etc.

Q: What do you know about Informatica and ETL?A: Informatica is a very useful GUI based ETL tool.

Q: FULL and DELTA files. Historical and Ongoing load.A: FULL file contains complete data as of today including history data, DELTA file contains only the changes since last extract.

Q: Power Center/ Power Mart – which products have you worked with?A: Power Center will have Global and Local repository, whereas Power Mart will have only Local repository.

Q: Explain what are the tools you have used in Power Center and/or Power Mart?A: Designer, Server Manager, and Repository Manager.

Q: What is a Mapping?A: Mapping Represent the data flow between source and target using some chain of transformation

Q: What are the components must contain in Mapping?A: Source definition, Transformation, Target Definition and Connectors

Q: What is Transformation? What are they?A: Transformation is a repository object that generates, modifies, or passes data. Transformation performs specific function. They are two types of transformations:

1. Active Rows, which are affected during the transformation or can change the no of rows that pass through it. Eg: Aggregator, Filter, Joiner, Normalizer, Rank, Router, Source qualifier, Update Strategy, ERP Source Qualifier, Advance External Procedure.


9

2. PassiveDoes not change the number of rows that pass through it. Eg: Expression, External Procedure, Input, Lookup, Stored Procedure, Output, Sequence Generator, XML Source Qualifier.

Q: Which transformation can be override at the Server?A: Source Qualifier and Lookup Transformations

Q: What is connected and unconnected Transformation and give Examples?

Q: What are Options/Type to run a Stored Procedure?A:

Normal: During a session, the stored procedure runs where the transformation exists in the mapping on a row-by-row basis. This is useful for calling the stored procedure for each row of data that passes through the mapping, such as running a calculation against an input port. Connected stored procedures run only in normal mode.

Pre-load of the Source. Before the session retrieves data from the source, the stored procedure runs. This is useful for verifying the existence of tables or performing joins of data in a temporary table.

Post-load of the Source. After the session retrieves data from the source, the stored procedure runs. This is useful for removing temporary tables.

Pre-load of the Target. Before the session write data to the target, the stored procedure runs. This is useful for verifying target tables or disk space on the target system.

Post-load of the Target. After the session write data to the target, the stored procedure runs. This is useful for re-creating indexes on the database.

It must contain at least one Input and one Output port.

Q: What kinds of sources and of targets can be used in Informatica?A:

Sources may be Flat file, relational db or XML. Target may be relational tables, XML or flat files.

Q: Transformations: What are the different transformations you have worked with?

A: Source Qualifier (XML, ERP, MQ) Joiner Expression Lookup Filter Router Sequence Generator Aggregator Update Strategy Stored Proc External Proc


10

Advanced External Proc Rank Normalizer

Q: What are active/passive transformations?A: Passive transformations do not change the nos. of rows passing through it whereas active transformation changes the nos. rows passing thru it.Active: Filter, Aggregator, Rank, Joiner, Source QualifierPassive: Expression, Lookup, Stored Proc, Seq. Generator

Q: What are connected/unconnected transformations?A:

Connected transformations are part of the mapping pipeline. The input and output ports are connected to other transformations.

Unconnected transformations are not part of the mapping pipeline. They are not linked in the map with any input or output ports. Eg. In Unconnected Lookup you can pass multiple values to unconnected transformation but only one column of data will be returned from the transformation. Unconnected: Lookup, Stored Proc.

Q: In target load ordering, what do you order - Targets or Source Qualifiers?A: Source Qualifiers. If there are multiple targets in the mapping, which are populated from multiple sources, then we can use Target Load ordering.

Q: Have you used constraint-based load ordering? Where do you set this?A: Constraint based loading can be used when you have multiple targets in the mapping and the target tables have a PK-FK relationship in the database. It can be set in the session properties. You have to set the Source “Treat Rows as: INSERT” and check the box “Constraint based load ordering” in Advanced Tab.

Q: If you have a FULL file that you have to match and load into a corresponding table, how will you go about it? Will you use Joiner transformation?A: Use Joiner and join the file and Source Qualifier.

Q: If you have 2 files to join, which file will you use as the master file?A: Use the file with lesser nos. of records as master file.

Q: If a sequence generator (with increment of 1) is connected to (say) 3 targets and each target uses the NEXTVAL port, what value will each target get?A: Each target will get the value in multiple of 3.

Q: Have you used the Abort, Decode functions?A: Abort can be used to Abort / stop the session on an error condition. If the primary key column contains NULL, and you need to stop the session from continuing then you may use ABORT function in the default value for the port. It can be used with IIF and DECODE function to Abort the session.

Q: Have you used SQL Override?A: It is used to override the default SQL generated in the Source Qualifier / Lookup transformation.


11

Q: If you make a local transformation reusable by mistake, can you undo the reusable action?

A: No

Q: What is the difference between filter and router transformations?A: Filter can filter the records based on ONE condition only whereas Router can be used to filter records on multiple condition.Eg. Suppose out of 50 records only 10 records are matching one filter condition, so in filter transformation 10 will pass and 40 will rejected, whereas in Router this 40 can be retested and can be rout to the target.

Q: Lookup transformations: Cached/un-cachedA: When the Lookup Transformation is cached the Informatica Server caches the data and index. This is done at the beginning of the session before reading the first record from the source. If the Lookup is uncached then the Informatica reads the data from the database for every record coming from the Source Qualifier.

Q: Connected/unconnected – if there is no match for the lookup, what is returned?A: Unconnected Lookup returns NULL if there is no matching record found in the Lookup transformation whereas connected can return the default value.

Q: What is persistent cache?A: When the Lookup is configured to be a persistent cache Informatica server does not delete the cache files after completion of the session. In the next run Informatica server uses these cache file from the previous session.

Q: What is dynamic lookup strategy?A: The Informatica server compares the data in the lookup table and the cache, if there is no matching record found in the cache file then it modifies the cache files by inserting the record. You may use only (=) equality in the lookup condition. If multiple matches are found in the lookup then Informatica fails the session.* By default the Informatica server creates a static cache.

Q: Mapplets: What are the 2 transformations used only in mapplets?A: Mapplet Input / Source Qualifier, Mapplet Output

Q: Have you used Shortcuts?A: Shortcuts may used to refer to another mapping. Informatica refers to the original mapping. If any changes are made to the mapping / mapplet, it is immediately reflected in the mapping where it is used.

Q: If you used a database when importing sources/targets that was dropped later on, will your mappings still be valid?A: No

Q: In expression transformation, how can you store a value from the previous row?A: By creating a variable in the transformation.

Q: How does Informatica do variable initialization? Number/String/Date


12

A: Number – 0, String – blank, Date – 1/1/1753

Q: Have you used the Informatica debugger?A: Debugger is used to test the mapping during development. You can give breakpoints in the mappings and analyze the data.

Q: What do you know about the Informatica server architecture? Load Manager, DTM, Reader, Writer, Transformer.A:

Load Manager is the first process started when the session runs. It checks for validity of mappings, locks sessions and other objects.

DTM process is started once the Load Manager has completed its job. It starts a thread for each pipeline.

Reader scans data from the specified sources. Writer manages the target/output data. Transformer performs the task specified in the mapping.

Q: Have you used partitioning in sessions? (not available with Powermart)A: It is available in PowerCenter. It can be configured in the session properties.

Q: Have you used External loader? What is the difference between normal and bulk loading?A: External loader/ bulk load will perform direct data load to the table/data files, bypass the SQL layer and will not log the data. During normal data load, data passes through SQL layer, data is logged in to the archive log file and as a result it is slow.

Q: Do you enable/disable decimal arithmetic in session properties?A: Disabling Decimal Arithmetic will improve the session performance but it converts numeric values to double, thus leading to reduced accuracy.

Q: When would use multiple update strategy in a mapping?A: When you would like to insert and update the records in a Type 2 Dimension table.

Q: When would you truncate the target before running the session?A: When we want to load entire data set including history in one shot. Update strategy do not have dd_update, dd_delete and it does only dd_insert.

Q: How do you use stored proc transformation in the mapping?A: Inside mapping we can use stored procedure transformation, pass input parameters and get back the output parameters. When handling through session, it can be invoked either in Pre-session or post-session scripts.

Q: What did you do in the stored procedure? Why did you use stored proc instead of using expression?A:

Q: When would you use SQ, Joiner and Lookup?A:

If we are using multiples source tables and they are related at the database, then we can use a single SQ.

If we need to Lookup values in a table or Update Slowly Changing Dimension tables then we can use Lookup transformation.


13

Joiner is used to join heterogeneous sources, e.g. Flat file and relational tables.

Q: How do you create a batch load? What are the different types of batches?A: Batch is created in the Server Manager. It contains multiple sessions. First create sessions and then create a batch. Drag the sessions into the batch from the session list window. Batches may be sequential or concurrent. Sequential batch runs the sessions sequentially. Concurrent sessions run parallel thus optimizing the server resources.

Q: How did you handle reject data? What file does Informatica create for bad data?A: Informatica saves the rejected data in a .bad file. Informatica adds a row identifier for each record rejected indicating whether the row was rejected because of Writer or Target. Additionally for every column there is an indicator for each column specifying whether the data was rejected due to overflow, null, truncation, etc.

Q: How did you handle runtime errors? If the session stops abnormally how were you managing the reload process?

Q: Have you used pmcmd command? What can you do using this command?A: pmcmd is a command line program. Using this command

You can start sessions Stop sessions Recover session

Q: What are the two default repository user groupsA: Administrators and Public

Q: What are the Privileges of Default Repository and Extended Repository user?A: Default Repository Privileges

o Use Designero Browse Repositoryo Create Session and Batches

Extended Repository Privilegeso Session Operatoro Administer Repositoryo Administer Servero Super User

Q: How many different locks are available for repository objectsA: There are five kinds of locks available on repository objects:

Read lock. Created when you open a repository object in a folder for which you do not have write permission. Also created when you open an object with an existing write lock.

Write lock. Created when you create or edit a repository object in a folder for which you have write permission.

Execute lock. Created when you start a session or batch, or when the Informatica Server starts a scheduled session or batch.

Fetch lock. Created when the repository reads information about repository objects from the database.

Save lock. Created when you save information to the repository.

Q: What is Session Process?


14

A: The Load Manager process. Starts the session, creates the DTM process, and sends post-session email when the session completes.

Q: What is DTM process?A: The DTM process creates threads to initialize the session, read, write, transform data, and handle pre and post-session operations.

Q: When the Informatica Server runs a session, what are the tasks handled?A:

Load Manager (LM):o LM locks the session and reads session properties. o LM reads the parameter file. o LM expands the server and session variables and parameters. o LM verifies permissions and privileges. o LM validates source and target code pages. o LM creates the session log file. o LM creates the DTM (Data Transformation Manager) process.

Data Transformation Manager (DTM):o DTM process allocates DTM process memory. o DTM initializes the session and fetches the mapping. o DTM executes pre-session commands and procedures. o DTM creates reader, transformation, and writer threads for each source

pipeline. If the pipeline is partitioned, it creates a set of threads for each partition. o DTM executes post-session commands and procedures. o DTM writes historical incremental aggregation and lookup data to disk, and it writes

persisted sequence values and mapping variables to the repository. o Load Manager sends post-session email

Q: What is Code Page?A: A code page contains the encoding to specify characters in a set of one or more languages.

Q: How to handle the performance in the server side?A: Informatica tool has no role to play here. The server administrator will take up the issue.

Q: What are the DTM (Data Transformation Manager) Parameters?A:

DTM Memory parameter - Default buffer block size/Data & Index Cache size , Reader Parameter - Line Sequential buffer length for flat files, General Parameter - Commit Interval (source and Target)/ Others- Enabling Lookup cache, Event based Scheduling - Indicator file to wait for.

DOC : 3

Answers to Informatica questions

No Question Answer

1. Large Datasets Processing Techniques

By Using Bulk utility mode at the session level and if possible by disabling constraints after consulting with DBA; Using Bulk utility


15

How do you handle large datasets?

mode would mean that no writing is taking place in Roll Back Segment so loading is faster. However the pitfall is that recovery is not possible

2. Large Datasets Processing Techniques What restrictions do you impose on the Source Qualifier to account for large datasets?

We can put maximum filter/joins conditions in Source qualifier and analyze and tune the query. We can write the custom query for each and every partition’s source qualifier when we do the session partition.

3. Large Datasets Processing Techniques How do you manage incremental loading for large datasets?

Incremental loading option at the session level can be used. However Incremental loading technique is preferred as it gives more control to handle the case.

4. Large Datasets Processing Techniques When is more convenient to join in the database or in Informatica?

Definitely at the database level , at the source Qualifier query itself , rather than using Joiner transformation

5. Recovery of Rejected/Interrupted Batch Jobs How do you recover from a failure in a session?

Check log file, find out the error. Do a cause effect analysis, solve it and restart the session either in recovery mode or restart the session only depends on proper DB activity.

We have to run the session in recovery mode if any data already committed in the target on this load. Other wise we can restart the session.

The server can recover the same session more than once. That is, if a session fails while running in recovery mode, you can re-run the session in recovery mode until the session completes successfully. This is called nested recovery.

6. Recovery of Rejected/Interrupted Batch Jobs Can you state the mechanism to recover data from the .bad file?

When performing recovery, the server creates a single reject file. The server appends rejected rows from the recovery session (or sessions) to the session reject file. This allows you to correct and load all rejected rows from the completed session.

In case of load failure an entry is made in OPB_SERV_ENTRY(?) table from where the extent of loading can be determined

7. Recovery of Rejected/Interrupted Batch Jobs How does the recovery mode work in informatica?

We have to run the session in recovery mode if any data already committed in the target on this load. Other wise we can restart the session.

The server can recover the same session more than once. That is, if a session fails while running in recovery mode, you can re-run the session in recovery mode until the session completes successfully. This is called nested recovery.

8. Recovery of Rejected/Interrupted Batch Jobs Can you suggest other mechanisms to deal with data recovery?

By using Database logging - Rollback ,commit ; using external loader utility like SQLLoader


16

9. Mapping/Session Performance Tuning How do you perform tuning in a mapping?

In case of joiner query itself can be fine tuned; In case of aggregates and filter combinations keep filters first and aggregates after filters; Use router transformation instead of filter transformation wherever possible; In similar fashion performance tuning ca be done depending on the existing scenarios

10.Mapping/Session Performance Tuning What parameters can be tweaked to get better performance from a session?

DTM shared memory, Index cache memory, Data cache memory, by indexing, using persistent cache, increasing commit interval etc

11.Mapping/Session Performance Tuning How do you measure session performance?

by checking "Collect performance Data" check box

You create performance details by selecting Perform Monitor in the session property sheet before running the session. By evaluating the final performance details, you can determine where session performance slows down. Monitoring also provides session-specific details that can help tune the following:

Buffer block size

Index and data cache size for Aggregator, Rank, Lookup, and Joiner transformations

Lookup transformations

Before using performance details to improve session performance you must do the following:

Enable monitoring

Understand performance counters

12. Mapping/Session Performance Tuning Can you name the mechanism in which Informatica collects Information about the performance of a mapping?

You can view session performance details through the Server Manager or by locating and opening the performance details fileThe Informatica Server names the file session_name.PERF, and stores it in the same directory as session log.

13.Strengths and Weaknesses of the Tool Lookups Concepts and Applications (Connected, unconnected and Dynamic)

If I need only one value from lookup table the we can use unconnected lookup as well as connected lookup but if we need multiple columns from a lookup table then we have to use connected lookup. When you use a dynamic cache, the Informatica Server inserts rows into the cache as it passes rows to the target.

14.Strengths and Weaknesses of the Tool Active Transformations Concept and Use

Active transformations are those where the number of records goes out from that transformation may vary with the number of records entered into the transformation examples like filter , router , aggregator, source qualifier transformations/ passive transformation are those where number of records flows out from the transformation is same as the # of records entered into the transformation like lookup, expressions etc,

15.Strengths and Weaknesses of the Tool Limitations of the use of

We can’t create batch as a reusable in Informatica 5.x but we can reuse set of session as worklet in Informatica 6.x


17

batches

16.Strengths and Weaknesses of the Tool Invoking Informatica Outside the UI

You can use the command line program pmcmd to communicate

with the Informatica Server. You can perform the following

actions with pmcmd:

Determine if the Informatica Server is running.

Start sessions and batches. Stop sessions and batches. Recover sessions. Stop the Informatica Server.

You can configure repository usernames and passwords as

environmental variables with pmcmd

DOC :4

Informatica

Duration : 1 Hr. Max. Marks : 100

1. Where exactly the sources and target information stored ? (2)2. What is the difference between power mart and power centre. Elaborate. (2)3. What are variable ports and list two situations when they can be used? (2)4. What are the parts of Informatica Server? (2)5. How does the server recognise the source and target databases.

Elaborate on this. (2)6. List the transformation used for the following: (10)

a) Heterogeneous Sourcesb) Homogeneous Sourcesc) Find the 5 highest paid employees within a dept.d) Create a Summary tablee) Generate surrogate keys

7. What is the difference between sequential batch and concurrent batch and which is recommended and why ? (2)

8. Designer is used for ____________________ (1)9. Repository Manager is used for __________________ (1)10. Server Manager is used for ______________________ (1)11. Server is used for _____________________________ (1)12. A session S_MAP1 is in Repository A. While running the session error message has displayed

‘ server hot-ws270 is connect to Repository B ‘. What does it mean ? (2)13. How do you do error handling in Informatica ? (2)14. How do you implement scheduling in Informatica ? (2)15. What is the meaning of upgradation of repository ? (2)16. How can you run a session without using server manager ? (2)17. What is indicator file and where it is used ? (2)18. What are pre and post session stored procedures ? Write a suitable example. (2)19. Consider two cases :


18

1. Power Center Server and Client on the same machine2. Power Center Sever and Client on the different machinesWhat is the basic difference in these two setups and which is recommended? (2)

20. Informatica Server and Client are in different machines. You run a session from the server manager by specifying the source and target databases. It displays an error. You are confident that everything is correct. Then why it is displaying the error? (2)

21. When you connect to repository for the first time it asks you for user name & password of repository and database both. But subsequent times it asks only repository password. Why ? (2)

22. What is the difference between normal and bulk loading.Which one is recommended? (2)

23. What is a test load ? (2)24. What is an incremental aggregation and when it should be implemented? (2)25. How can you use an Oracle sequences in Informatica? You have an Informatica sequence generator

transformation also. Which one is better to use? (2)26. What is the difference between a shortcut of an object and copy of an object?

Compare them. (2)27. What is mapplet and a reusable transformation? (2)28. How do you implement configuration management in Informatica? (3)29. What are Business Components in Informatica? (2)30. Dimension Object created in Oracle can be imported in Designer ( T/ F) (1)31. Cubes contain measures ( T / F ) (1)32. COM components can be used in Informatica ( T / F ) (1)33. Lookup is an Active Transformation (T/F) (1)34. What is the advantage of persistent cache? When it should be used. (1)35. When will you use SQL override in a lookup transformation? (2)36. Two different admin users created for repository are ______ and_______ (1)37. Two Default User groups created in the repository are ____ and ______ (1)38. A mapping contains

Source Table S_Time ( Start_Year, End_Year ) Target Table Tim_Dim ( Date, Day, Month, Year, Quarter )Stored procedure transformation : A procedure has two input parameters I_Start_Year, I_End_Year and output parameter as O_Date, Day , Month, Year, Quarter. If this session is running, how many rows will be available in the target and why ? (5)

39. Two Sources S1, S2 containing measures M1,M2,M3, 4 Dimensions D1,D2,D3,D4, 1 Fact F1 containing measures M1, M2,M3 and Dimension Surrogate keys K1,K2,K3,K4

(a) Write a SQL statement to populate Fact table F1(b) Design a mapping in Informatica for loading of Fact table F1. (5)

40. What is the difference between connected lookup and unconnected lookup. (2)41. What is the difference between lookup cache and lookup index. (2)42. When should one create a lookup transformation? (2)43. How do you handle performance issues in Informatica.Where can you monitor the performance ?

(3)44. List and Discuss two approaches for updation of target table in informatica and how they are different. (3)45. You have created a lookup tansformation for a certain condition which if true returns multiple rows .When you go to the target and see only one row has come and not all. Why is it so and how it can be corrected. (2)46. Where are the log files generally stored. Can you change the path of the file. What can the path be? (2)47. Where is the cache (lookup, index) created and how can you see it. (2)


19

Key for Informatica Questions.1. Informatica Repository Tables2. Power Center - has all the functionality . distributed metadata(repository). global repository and can register multiple Informatica servers. One can share metadata across

repositories. Can connect to Varied sources like Peoplesoft,SAP etc. Has bridges which can transport meta data from opther tools (like Erwin) Cost around 200K US $.

Power Mart – Subset of Power centre. One repository and can register only one Informatica server. Cannot connect to Varied sources like Peoplesoft,SAP etc Cost around 50K US $.

3. Variable ports are used to store intermediate values.1. To make a complex Expression into simpler by beaking it up into variable ports.2. Temporarily Store data3. Store valuse from prior rows5. Capture multiple return values from stored procedures.6. Compare Values

4. Load Manager

Data Transformation manager Reader Writer

5. Specify the database connections for source and target in Server Manager.More importantly configure the connect strings pointing to Source and Target Databse on the Workstation where Server is installed.

6. a) Joinerb) Source Qualifierc) Rankd) Aggregatore) Sequence Generator

7. In sequential batch one session will end then the other will begin. Where as in the concurrent Batch the sessions will run simultaneously depending on the CPU availability .If beginning of one session is dependent on the completion of another session then Sequential is to be adopted, otherwise concurrent can be used.Concurrent utilises the CPU resources efficiently.

8. Designer is used for importing/creating your Source and Target Definitions. Creating Mappings,Mapplets and reusable transformations.

9. Repository Manager is used for Creating/Adding/Backing up/Restoring/Copying Repository User Administration and Privileges


20

Creating Folders Folder Privileges

10. Server Manager is used for Configuring the Server Creating Source and Taget data Sourec Configuring and monitoring the Session and its properties. Starting/Stopping the Session. Scheduling Error Handling

11. Server is actually used for loading data from source to targets.

12. Informatica server is currently not connected to repository A .It is connected to Repository B.It has to be configured for Repository A for running session S_MAP1.

13.Error handling is very primitive. Log files can be generated which contain error details and code. The error code can be checked from troubleshooting guide and corrective action taken. The log file can be increased by giving appropriate tracing level in the session properties. Also we can give that one Session can stop after 1,2 or n number of errors.

14. The option of scheduling is present in the session properties ( double click the session in server manager) an choose scheduling option.

15. Repository was created with an earlier version of Informatica and now it has to be upgraded to the new version.

16. By typing pmcmd …. from command prompt17.Indicator file is used for Event Based Scheduling when you don’t know when the Source Data is availaible.

A shell command ,script or a batch file creates and send this indicator file to the directory local to the Informatica Server.Server waits for the indicator file to appear before running the session.

18. They are the code (e.g. SQL statement) to be executed before and after running the session respectively

i.e actually they are executed as pre session and post session commands. Pre session: for disabling the constraints on the target table(fact table) for bulk loading alter constraint pk_fact disable Post session : for enabling the constraints on the target table(fact table) after loading is complete.

19. Server and client should be on seperate machines . one can have multiple clients connected to the server. hence evryone can work independently and share server resources.Option2 is preferred.As CPU resources are utilsed optimally.

20. The connect strings for source and target databases are not configured on the Workstation conatining the server though they may be on the client m/c.

21. It might Store the information in Cache or in the Reository tables.


21

22.Normal Loading : Database log is generated .Takes more time. Bulk loading bypasses the database log and the constarints are disabled.It is faster than the normal load.Recovery is also difficult in case of bulk loading

23. Test load option can be chosen to test whether load would take place or not .We do not need to run the session with all the rows of the source side .It is for Normal Laoding.No of rows to tested can be specified in Target options in Session property.

24. If the source changes only incrementally and you can capture changes, you can configure the session to process only those changes. This allows the Informatica Server to update your target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session. Therefore, only use incremental aggregation if:

Your mapping includes an aggregate function.

The source changes only incrementally.

You can capture incremental changes. You might do this by filtering source data by timestamp.

Before implementing incremental aggregation, consider the following issues:

Whether it is appropriate for the session

What to do before enabling incremental aggregation

When to reinitialize the aggregate caches

25. Oracle sequence can be used in a Pl/Sql stored procedure, which in turn can be used with stored procedure transformation of Informatica. It depends upon the user needs but Oracle sequence provides greater control.

26. Shortcut: The folder containing the object has to be a shared folder.It points to the original object in the shared folder.

Copy: Actually copies the object. Copy:increase repository size as it is saved unlike shortcut which is an instance of the object in the shared folder. By creating shortcut the change in the original object can be reflected on all instances of the object, which is not the case with copy.

27 Mapplet is a reusable component, which we can plug and play into different mapping. It uses transformations,input port,output port. Developed in Mapplet designerReusable transformation :Developed in transformation developer.It can be invoked from more than one mapping or Mapplet.

28. There are several methods to do this .Some of them are :- Taking a back up of the repository as a binary file and treat it as a configurable item. Implement Folder Versioning utility in Informatica.


22

29. Business components allow you to organize, group, and display sources and mapplets in a single location in your repository folder. For example, you can create groups of source tables that you call Purchase Orders and Payment Vouchers. You can then organize the appropriate source definitions into logical groups and add descriptive names for them.

Business components let you access data from all operational systems within your organization through source and mapplet groupings representing business entities. You can think of business components as tools that let you view your sources and mapplets in a meaningful way using hierarchies and directories.A business component is a reference to any of the following objects:

Source, Mapplet, Shortcut to a source, Shortcut to a mapplet30.F31.T32. T33.F ,Can be true also when Sql override is used.

34. When Lookup cache is saved in Look up Transformation It is called persistent cache. The first time session runs it is saved on the disk and utilised in subsequent running of the Session. It is used when the look up table is Static i.e does’nt change frequently.

35.Use SQl override when you have more than one look up table To use where condition to reduce records in cache.

36. Administrator and database user of repository schema37. Administrator and Public38.Only one row the last date of the End_Year. All the subsequent rows are overriding the previous rows. 39. Insert into F1 values ( Select S1.k1,S2.k2,S3.k3,S4.k4,M1,M2,M3 from S1,S2,D1,D2,D3,D4 where

(join condition between S1 and S2 ) and (3 join contion between the production keys in the dimension and the corresponding columns in Sources S1,S2))

k’s are Surrogate keys of the Dimensions.

Lookup for D1

Lookup for D2

Source Qualifier Lookup for D3

Lookup for D4


S1

S2

FactTable

23

40.Connected Look Up is part of Mappinf Data flow. It Gives multiple output values on a condition. Support Default values Unconnected Look Up is not a part of Mapping Data flow.It is called form other tansformations e.g Expression Transformation It has a return port which returns one value generally a flag.Does not Support Default values

41.Look up Cache contains Index cache and data cache Index cache:Contains columns used in condition Data cache: :Contains other output columns than the condition columns.

42. Look up is created when :- To get a value from another table based on certain condition Slowly changing Dimensionhandling To verify whether the record exist in the target table or not.

43. There are several aspects to the performance handling .Some of them are :- Source tuning Target tuning Repository tuning Session performance tuning Incremental Change identification in source side. Software , hardware(Use multiple servers) and network tuning. Bulk Loading

Use the appropriate transformation.To monitor this

Set performance detail criteria Enable performance monitoring Monitor session at runtime &/ or Check the performance monitor file .

44. Update strategy transformation: We can write our own code .It is flexible. Normal insert / update /delete (with proper variation of the update option) :

It can be configured in the Session properties.Any change in the row will cause an update.Inflexible.

45. Look up Transformation will return either the first,last match or an error in case of multiple matches.Use Joiner Transformation for this.

46. By default it is log file directory ($PMSessionLogDir) as specified in the server manager .It can be overridden. It has to be in the directory local to the server.

47.The cache is created in the server.Some default memory is allocated for it. Once that memory is exceeded than these files can be seen in the Cache directory in the Sever.

Tuning


24

Data warehousing operations process high volumes of data and they have a high correlation with the goals of parallel operations.

OLTP applications have a high transaction volume and they correlate more with serial operations.

The following operations can be performed in parallel by Oracle: Parallel Query Parallel DML Parallel DDL Parallel recovery Parallel loading Parallel propagation (for replication)

Parallelization of operations is recommended in the following situations: Long elapsed timeHigh number of rows processedTo minimize parsing, bind variables should be used in SQL statements within OLTP applications. In this way all users can share the same SQL statements while using fewer resources for parsing.

Excessive use of triggers for frequent events such as logons, logoffs, and error events can degrade performance since these events affect all users.

The V$SORT_USAGE view can be queried to see the session and SQL statement associated with a temporary segment.

Tools such as TKPROF, the SQL Trace Facility, and Oracle Trace can be used to find the problem statements and stored procedures.

Five ways to improve SQL statement efficiency: Restructure the Indexes Restructure the Statement Modify or disable the triggers Restructure the data Keeping statistics current and using plan stability to preserve execution plans

One can use the results of the EXPLAIN PLAN statement to compare the execution plans and costs of the two statements and determine which is more efficient.

If application has statements that use the NOT IN operator, one should consider rewriting them so that they use the NOT EXISTS operator. This would allow such statements to use an index, ifone exists.

Join order can have a significant effect on performance. The join order must be chosen so as to join fewer rows to tables later in the join order.

Filter conditions dominate the choice of driving table and index. In general, the driving table should be the one containing the filter condition that eliminates the highest percentage of the table.


25

Optimization (determining the execution plan) takes place before the database knows what values will be substituted into the query. An execution plan should not, therefore, depend on what those values are.

When the condition comes back false for one part of the UNION ALL query, that part is not evaluated further.

WHEN (NOT) EXISTS is a useful alternative to IN / NOT IN in sub queries.

The use of DISTINCT must be minimized. DISTINCT always creates a SORT; all the data must be instantiated before results can be returned.

Tips in restructuring the data: Introduce derived values. Avoid GROUP BY in response-critical code Implement missing entities and intersection tables Reduce network load. Migrate, replicate and partition dataOne can find indexes that are not referenced in execution plans by processing all of application SQL through EXPLAIN PLAN and capturing the resulting plans.

UPDATE statements that modify indexed columns and INSERT and DELETE statements that modify indexed tables take longer than if there were no index. Such SQL statements must modify data in indexes as well as data in tables. They also generate additional undo and redo information.

If all columns selected by a query are in a composite index, Oracle can return these values from the index without accessing the table.

The optimizer can choose such an access path for a SQL statement only if it contains a construct that makes the access path available.

If new indexes are created to tune a statement that is currently parsed, Oracle invalidates the statement. When the statement is next executed, the optimizer automatically chooses a new execution plan that could potentially use the new index.

One can use the FULL hint to force the optimizer to choose a full table scan instead of an index scan.

A preferable method to decide whether an index is good or bad is to compare the optimizer cost (in the first row of EXPLAIN PLAN output) of the plans with and without the index.

Parallel execution uses indexes effectively.

The fast full index scan is an alternative to a full table scan when there is an index that contains all the keys that are needed for the query.

A fast full scan is faster than a normal full index scan in that it can use multiblock I/O and can be parallelized just like a table scan. Unlike regular index scans, however, one cannot use keys and the rows will not necessarily come back in sorted order.

The parallel degree of the index is set independently: the index does not inherit the degree of parallelism of the table.


26

Use the ALTER INDEX ... REBUILD statement to reorganize or compact an existing index or to change its storage characteristics.

ALTER INDEX ... REBUILD is usually faster than dropping and re-creating an index, because this statement uses the fast full scan feature.

One can coalesce leaf blocks of an index using the ALTER INDEX statement with the COALESCE option. This allows combining leaf levels of an index to free blocks for re-use. One can also rebuild the index online.

A function-based index is an index on an expression. Oracle strongly recommends using function-based indexes whenever possible.

One can create function-based indexes for any repeatable SQL function. Oracle recommends using function-based indexes for range scans and for functions in ORDER BY clauses.

Function based indexes permits Oracle to bypass computing the value of the expression when processing SELECT and DELETE statements. When processing INSERT and UPDATE statements, however, Oracle evaluates the function to process the statement.

Oracle treats indexes with columns marked DESC as function-based indexes. The columns marked DESC are sorted in descending order.

One must set the session parameter QUERY_REWRITE_ENABLED to TRUE to enable function-based indexes for queries.

Bitmap indexes are highly advantageous for complex ad hoc queries that contain lengthy WHERE clauses.

Bitmap indexes are created not only for efficient space usage, but also for efficient execution.

If a bitmap index is created on a unique key column, it requires more space than a regular B*-tree index.

For columns where each value is repeated hundreds or thousands of times, a bitmap index typically is less than 25% of the size of a regular B*-tree index. The bitmaps themselves are stored in compressed format.

Because of their different performance characteristics, one should keep B*-tree indexes on high-cardinality data, while creating bitmap indexes on low-cardinality data.

Bitmap indexes benefit data warehousing applications but they are not appropriate for OLTP applications with a heavy load of concurrent INSERTs, UPDATEs, and DELETEs.

A B*-tree index entry contains a single rowid. Therefore, when the index entry is locked, a single row is locked. With bitmap indexes, an entry can potentially contain a range of rowids. When a bitmap index entry is locked, the entire range of rowids is locked. The number of rowids in this range affects concurrency.


27

For bulk inserts and updates where many rows are inserted or many updates are made in a single statement, performance with bitmap indexes can be better than with regular B*-tree indexes.

The COMPATIBLE initialization parameter must be set to 7.3.2 or higher to use bitmap indexes.

System index views USER_INDEXES, ALL_INDEXES, and DBA_INDEXES indicatebitmap indexes by the word BITMAP appearing in the TYPE column. A bitmap index cannot be declared as UNIQUE.

In a bitmap index, a bitmap for each key value is used instead of a list of rowids.

Each bit in the bitmap corresponds to a possible rowid. If the bit is set, then it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so the bitmap index provides the same functionality as a regular index even though it uses a different representation internally.

Bitmap indexing efficiently merges indexes that correspond to several conditions in a WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically.

Bitmap indexes are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data.

The advantages of using bitmap indexes are greatest for low cardinality columns: that is, columns in which the number of distinct values is small compared to the number of rows in the table.

B-tree indexes are most effective for high-cardinality data: that is, data with many possible values, such as CUSTOMER_NAME or PHONE_NUMBER.

Bitmap indexes index nulls, whereas all other index types do not.

Indexing of nulls can be useful for some types of SQL statements, such as queries with the aggregate function COUNT.

CREATE_BITMAP_SIZE parameter determines the amount of memory allocated for bitmap creation. The default value is 8MB. A larger value may lead to faster index creation.

As a general rule, the higher the cardinality, the more memory is needed for optimal performance. One cannot dynamically alter this parameter at the system or session level.

To use bitmap access paths for B*-tree indexes, the rowids stored in the indexes must be converted to bitmaps. After such a conversion, the various Boolean operations available for bitmaps can be used.

Bitmap Index Restrictions: For bitmap indexes with direct load, SORTED_INDEX flag does not apply. Bitmap indexes are not considered by rule-based optimizer Bitmap indexes can not be used for referential integrity.

Features available only with Cost-based Optimization: Partitioned tables


28

Index Organized tables Reverse Indexes Parallel execution Star transformations Star Joins

One must gather statistics for tables to obtain accurate execution plans.

Cost-based optimization is used for efficient star query performance. Similarly, it can be used with hash joins and histograms. Cost-based optimization is always used with parallel execution and with partitioned tables.

To maintain the effectiveness of the cost-based optimizer, one must keep statistics current.

Cost-based optimization can be enabled by one of the following methods: OPTIMIZER_MODE parameter should be set to its default value of CHOOSE. Issue an ALTER SESSION SET OPTIMIZER_MODE statement with the ALL_ROWS or FIRST_ROWS

option, to enable optimization for a session only Cost-based optimization can be enabled for an individual SQL statement, by using any hint other

than RULE

The plans generated by the cost-based optimizer depend on the sizes of the tables. The execution plan produced by the optimizer can vary depending upon the optimizer’s goal.

To change the goal of the cost-based approach for all SQL statements in a session, issue an ALTER SESSION SET OPTIMIZER_MODE statement with theALL_ROWS or FIRST_ROWS option.

To specify the goal of the cost-based approach for an individual SQL statement, use the ALL_ROWS or FIRST_ROWS hint.

For uniformly distributed data, the cost-based approach fairly accurately determines the cost of executing a particular statement.

For nonuniformly distributed data, Oracle allows to create histograms that describe data distribution patterns of a particular column. Oracle stores these histograms in the data dictionary for use by the cost-based optimizer.

Create histograms on columns that are frequently used in WHERE clauses of queries and that have highly skewed data distributions. To do this, use the GATHER_TABLE_STATS procedure of the DBMS_STATS package.

Column statistics appear in the data dictionary views: USER_TAB_COLUMNS ALL_TAB_COLUMNS DBA_TAB_COLUMNS

Because the cost-based approach relies on statistics, generate statistics for all tables, clusters, and all types of indexes accessed by SQL statements before using the cost-based approach.


29

If the size and data distribution of these tables change frequently, generate these statistics regularly to ensure the statistics accurately represent the data in the tables.

Oracle generates statistics using one of the following techniques: Estimation based on random data sampling Exact computation Using user-defined statistics collection methods

Estimation never scans the entire table, whereas the exact computation performs full table scan to generate statistics.

Some statistics are always computed, regardless of whether computation or estimation is specified. If estimation is specified and the time saved by estimating statistics is negligible, Oracle computes the statistics.

The COMPUTE STATISTICS clause of the ANALYZE SQL statement is used to gather index statistics when creating or rebuilding an index. If the COMPUTE STATISTICS clause is not used, or if major DML changes are made, the GATHER_INDEX_STATS procedure is used to collect index statistics.

Analyzing a table uses more system resources than analyzing an index. It may be helpful to analyze the indexes for a table separately, or collect statistics during index creation with a higher sampling rate.

Plan Stability prevents certain database environment changes from affecting the performance characteristics of applications.

Plan Stability preserves execution plans in "stored outlines". Oracle can create a stored outline for one or all SQL statements. The optimizer then generates equivalent execution plans from the outlines when the use of stored outlines is enabled.

Using stored outlines also stabilizes the generated execution plan if the optimizer changes in subsequent Oracle releases.

Plan Stability relies on preserving execution plans at a point in time when performance is satisfactory.

An outline consists primarily of a set of hints that is equivalent to the optimizer’s results for the execution plan generation of a particular SQL statement.

When Oracle creates an outline, Plan Stability examines the optimization results using the same data used to generate the execution plan. That is, Oracle uses the input to the execution plan to generate an outline and not the execution plan itself.

One cannot modify an outline. One can embed hints in SQL statements, but this has no effect on how Oracle uses outlines because Oracle considers a SQL statement that is revised with hints to be different from the original SQL statement stored in the outline.

If the outline usage is disabled by setting the system/session parameter USE_STORED_OUTLINES to FALSE, Oracle does not attempt to match SQL text to outlines.


30

Oracle stores outline data in the OL$ table and hint data in the OL$HINTS table.

Oracle can automatically create outlines for all SQL statements, or one can create them for specific SQL statements. In either case, the outlines derive their input from the rule-based or cost-based optimizers.

Oracle creates stored outlines automatically when the parameter CREATE_STORED_OUTLINES is set to TRUE.

When USE_STORED_OUTLINES is set to FALSE and CREATE_STORED_OUTLINES is set to TRUE, Oracle creates outlines but does not use them.

When the use of stored outlines is activated, Oracle always uses the cost-based optimizer. This is because outlines rely on hints, and to be effective, most hints require the cost-based optimizer.

The information about stored outlines and the related hint data can be found in the following views: USER_OUTLINES USER_OUTLINE_HINTS

The DDL statements CREATE, DROP, and ALTER to manipulate a specific outline.

The procedures in the OUTLN_PKG package are used to manage stored outlines and their outline categories.

Hints are used to specify: Optimization approach for SQL statements The goal of cost-based approach for SQL statements The access path for the table accessed by statement The join order for a join statement A join operation for join statement

Oracle ignores hints in all SQL statements in those environments that use PL/SQL Version 1, such as SQL*Forms Version 3 triggers, Oracle Forms 4.5, and Oracle Reports 2.5.

The DBMS_SHARED_POOL package allows to keep objects in shared memory, so they do not age out with the normal LRU mechanism.

EXPLAIN PLAN is a SQL statement listing the access path used by the query optimizer. Each plan output from the EXPLAIN PLAN command has a row that provides the statement type.

Oracle Trace collects significant Oracle server event data such as all SQL events and Wait events.

Identifying resource-intensive SQL statements is easy with Oracle Trace.

SQL trace files record SQL statements issued by a connected process and the resources used by these statements.


31

The SQL trace facility can be enabled for any session. It records in an operating system text file the resource consumption of every parse, execute, fetch, commit, or rollback request made to the server by the session.

TKPROF summarizes the trace files produced by the SQL trace facility, optionally including the EXPLAIN PLAN output.

Oracle Diagnostics pack comprises of: Oracle Capacity Planner Oracle Performance Manager Oracle Advanced Event Tests Oracle Trace

Oracle Tuning Pack optimizes system performance by identifying and tuning major database and application bottlenecks such as inefficient SQL, poor data structures, and improper use of system resources.

Oracle Tuning Pack contains: Oracle Expert Oracle SQL Analyze Oracle Tablespace manager Oracle Index tuning wizard Oracle Auto-analyze

Oracle Expert provides automated database performance tuning. Performance problems detected by Oracle Diagnostics Pack and other Oracle monitoring applications can be analyzed and solved with Oracle Expert.

Oracle Expert automates the process of collecting and analyzing data and contains a rules-based inference engine that provides "expert" database tuning recommendations, implementation scripts, and reports.

The following scripts are used to display the most recent plan table output: UTLXPLS.SQL - To show plan table output for serial processing. UTLXPLP.SQL - To show plan table output with parallel execution columns.

The SQL trace facility and TKPROF enable to accurately assess the efficiency of the SQL statements an application runs.

SQL trace generates the following statistics for each statement: Parse, execute and fetch counts CPU and elapsed times Physical and logical reads Number of rows processed Misses on the library cache Username under which each parse occurred Each commit and rollback

One can enable the SQL trace facility for a session or for an instance.


32

One can run the TKPROF program to format the contents of the trace file and place the output into a readable output file.

TKPROF can optionally perform the following tasks: Determining the execution plans of SQL statements Creating a SQL script to store statistics in database

TKPROF reports each statement executed with the resources it has consumed, the number of times it was called, and the number of rows which it processed.

When the SQL trace facility is enabled for a session, Oracle generates a trace file containing statistics for traced SQL statements for that session. When the SQL trace facility is enabled for an instance, Oracle creates a separate trace file for each process

Pre-requisites for using SQL Trace facility:TIMED_STATISTICS, USER_DUMP_DEST and MAX_DUMP_FILE_SIZE parameters must be set

TIMED_STATISTICS parameter enables and disables the collection of timed statistics, such as CPUand elapsed times, by the SQL trace facility, as well as the collection of various statistics in the dynamic performance tables. The default value of FALSE disables timing. A value of TRUE enables timing. Enabling timing causes extra timing calls for low-level operations. This is a session parameter.

When the SQL trace facility is enabled at the instance level, every call to the server produces a text line in a file in operating system’s file format. The maximum size of these files (in operating system blocks) is limited by the initialization parameter MAX_DUMP_FILE_SIZE. The default is 500. This is a session parameter.

To enable the SQL trace facility for a current session, enter:ALTER SESSION SET SQL_TRACE = TRUE;

One can enable the SQL trace facility for a session by using theDBMS_SESSION.SET_SQL_TRACE procedure.

To enable the SQL trace facility for an instance, set the value of the SQL_TRACE initialization parameter to TRUE. Statistics will be collected for all sessions.

Oracle Trace is a general-purpose data collection product and is part of the Oracle Enterprise Manager systems management product family.

Major performance factors in well-designed systems: Network Issues CPU Issues Memory Issues Software Issues I/O Issues

Why Informatica_____________________________________________________________________________________________________

No Hand Coding


33

There is NO programming required and NO code generated in the background. Everything is done through a simple, Windows-standard GUI, with drop 'n drag, point 'n click approach. This saves considerable development time and even more maintenance time. The biggest benefit to the client is there is no code to maintain. This equates to substanitial savings based on the fact that most IT organizations allocate 70 percent to their annual budget to maintenance of legacy code.

Rapid Deployment of Data Integration Projects

PowerMart minimizes or eliminates the risks associated with failed data warehouse projects and the expense of developing and maintaining old code. PowerMart helps companies speed up the design and deployment of data warehouse applications. Its unique engine-based architecture maximizes design efficiencies, and ensures highest-quality decision support output by promoting close cooperation between data warehouse designers and users. Customers select Informatica because they can successfully implement and manage a data warehouse much easier than using a manual coding approach or using code generation products. In fact, according to the Meta Group, over 85% of data warehouses implemented in the past using code generation technology have either failed or been significantly delayed as a result of the technology being too complex to implement and manage. The industry shows that maintenance of code costs 4 times the amount of development.

Metadata Metadata (Data about Data) is a critical asset to all IT organizations. PowerMart is metadata centric. It is at the center of our architecture. Metadata is captured automatically as users interact with the client toolset. The metadata repository is stored in an OPEN, relational database (KEY DIFFERENTIATOR). There is nothing proprietary about it. Because of that, it is extensible, shareable by other tools (front end, design tools, etc.) Also, because of the metadata centric architecture, PowerMart is self-documenting, capturing AUTOMATICALLY all technical and business metadata at every level in the process.

Strong Management Framework PowerMart provides a lights-out operation of the ETL process by email/pager notification. The scheduler is VERY comprehensive, allowing for very customized scheduling. Error and exception handling is done by the tool and logging in files, as well as automatically placed in the metadata repository. There is full recovery upon failure, as an option. Pre and Post session activities are easily kicked off through access to the operating system. The load process can be kicked off from the command line, allowing integration with a third party scheduler. Real time monitoring can be done through a monitor screen. Dependency analysis is easily accomplished and can also be viewed in one of the canned reports provided through the run-time version of Crystal Reports.

Centralized Management of Distributed/Networked Data Marts PowerCenter (upgrade from PowerMart) allows centralized management of distributed, networked data marts. The Global Metadata Repository (GDS), stores all public metadata, along with objects such as transformations, data models, and source definitions to be shareable and re-useable by any data mart on the network. There is also a Parallel Engines Option for increased scalability, allowing process to be spread across multiple engines and run in parallel.

Proven Track Record Informatica is now being used by over 600 companies world-wide.


34

Analyst Approval PowerMart is the recognized leader of second-generation technology for building and managing data warehouses, operational data stores and data marts. Analysts throughout the technical industry continue to honor Informatica’s technology and practices. Below is a list of some of the most recent awards: The Data Warehousing Institute’s “Best Practices” award. The best Data Extraction, Cleansing

and Transformation for the second year in a row Ovum Evaluates: Data Warehousing Tools and Strategies. PowerMart suite received the highest

ranking (second time this year) among nine competing data warehouse technology vendors. The Red Herring magazine in its annual “Herring 100” issue. Informatica was named one of

1998’s top 50 private companies in digital technology. 1999 Intelligent Enterprise Dozen Award. Designated a “Company to Watch” in the December

15th issue of Intelligent Enterprise.

DOC : 5 (ORACLE)

1. What are the components of Physical database structure of Oracle Database?.

ORACLE database is comprised of three types of files. One or more Data files, two are more Redo Log files, and one or more Control files.

2. What are the components of Logical database structure of ORACLE database?

Tablespaces and the Database's Schema Objects.

3. What is a Tablespace? A database is divided into Logical Storage Unit called tablespaces. A tablespace is used to grouped related logical structures together.

4. What is SYSTEM tablespace and When is it Created? Every ORACLE database contains a tablespace named SYSTEM, which is automatically created when the database is created. The SYSTEM tablespace always contains the data dictionary tables for the entire database.

5. Explain the relationship among Database, Tablespace and Data file. Each databases logically divided into one or more tablespaces One or more data files are explicitly created for each tablespace.

6. What is schema? A schema is collection of database objects of a User.

7. What are Schema Objects ? Schema objects are the logical structures that directly refer to the database's data. Schema objects include tables, views, sequences, synonyms, indexes, clusters, database triggers, procedures, functions packages and database links.

8. Can objects of the same Schema reside in different tablespaces.?


35

Yes.

9. Can a Tablespace hold objects from different Schemes ? Yes.

10. what is Table ? A table is the basic unit of data storage in an ORACLE database. The tables of a database hold all of the user accessible data. Table data is stored in rows and columns.

11. What is a View ? A view is a virtual table. Every view has a Query attached to it. (The Query is a SELECT statement that identifies the columns and rows of the table(s) the view uses.)

12. Do View contain Data ? Views do not contain or store data.

13. Can a View based on another View ? Yes.

14. What are the advantages of Views ? Provide an additional level of table security, by restricting access to a predetermined set of rows and columns of a table. Hide data complexity. Simplify commands for the user. Present the data in a different perpecetive from that of the basetable. Store complex queries.

15. What is a Sequence ? A sequence generates a serial list of unique numbers for numerical columns of a database's tables.

16. What is a Synonym ? A synonym is an alias for a table, view, sequence or program unit.

17. What are the type of Synonyms? There are two types of Synonyms Private and Public.

18. What is a Private Synonyms ? A Private Synonyms can be accessed only by the owner.

19. What is a Public Synonyms ? A Public synonyms can be accessed by any user on the database.

20. What are synonyms used for ? Synonyms are used to : Mask the real name and owner of an object. Provide public access to an object Provide location transparency for tables,views or program units of a remote database. Simplify the SQL statements for database users.

21. What is an Index ? An Index is an optional structure associated with a table to have direct access to rows,which can be created to increase the performance of data retrieval. Index can be created on one or more columns of a table.


36

22. How are Indexes Update ?Indexes are automatically maintained and used by ORACLE. Changes to table data are automatically incorporated into all relevant indexes.

23. What are Clusters ? Clusters are groups of one or more tables physically stores together to share common columns and are often used together.

24. What is cluster Key ? The related columns of the tables in a cluster is called the Cluster Key.

25. What is Index Cluster ? A Cluster with an index on the Cluster Key.

26. What is Hash Cluster ? A row is stored in a hash cluster based on the result of applying a hash function to the row's cluster key value. All rows with the same hash key value are stores together on disk.

27. When can Hash Cluster used ? Hash clusters are better choice when a table is often queried with equality queries. For such queries the specified cluster key value is hashed. The resulting hash key value points directly to the area on disk that stores the specified rows.

28. What is Database Link ? A database link is a named object that describes a "path" from one database to another.

29. What are the types of Database Links ? Private Database Link, Public Database Link & Network Database Link.

30. What is Private Database Link ? Private database link is created on behalf of a specific user. A private database link can be used only when the owner of the link specifies a global object name in a SQL statement or in the definition of the owner's views or procedures.

31. What is Public Database Link ?

Public database link is created for the special user group PUBLIC. A public database link can be used when any user in the associated database specifies a global object name in a SQL statement or object definition.

2. What is Network Database link ? Network database link is created and managed by a network domain service. A network database link can be used when any user of any database in the network specifies a global object name in a SQL statement or object definition.

33. What is Data Block ?


37

ORACLE database's data is stored in data blocks. One data block corresponds to a specific number of bytes of physical database space on disk.

34. How to define Data Block size ? A data block size is specified for each ORACLE database when the database is created. A database users and allocated free database space in ORACLE datablocks. Block size is specified in INIT.ORA file and cann't be changed latter.

35. What is Row Chaining ? In Circumstances, all of the data for a row in a table may not be able to fit in the same data block. When this occurs , the data for the row is stored in a chain of data block (one or more) reserved for that segment.

36. What is an Extent ? An Extent is a specific number of contiguous data blocks, obtained in a single allocation, used to store a specific type of information.

37. What is a Segment ? A segment is a set of extents allocated for a certain logical structure.

38. What are the different type of Segments ? Data Segment, Index Segment, Rollback Segment and Temporary Segment.

39. What is a Data Segment ? Each Non-clustered table has a data segment. All of the table's data is stored in the extents of its data segment. Each cluster has a data segment. The data of every table in the cluster is stored in the cluster's data segment.

40. What is an Index Segment ? Each Index has an Index segment that stores all of its data.

41. What is Rollback Segment ? A Database contains one or more Rollback Segments to temporarily store "undo" information.

42. What are the uses of Rollback Segment ?

DOC : 6

Informatica questions

1) How do you handle large datasets?Ans : By Using Bulk utility mode at the session level and if possible by disabling constraints after consulting with DBA; Using Bulk utility mode would mean that no writing is taking place in Roll Back Segment so loading is faster. However the pitfall is that recovery is not possible2) When is more convenient to join in the database or in Informatica?Ans : Definitely at the database level , at the source Qualifier query itself , rather than using Joiner transformation----------------------------------------------------------------------------------


38

3) How does the recovery mode work in informatica?Ans : In case of load failure an entry is made in OPB_SERV_ENTRY(?) table from where the extent of loading can be determined----------------------------------------------------------------------------------

4) What parameters can be tweaked to get better performance from a session?Ans : DTM shared memory, Index cache memory, Data cache memory, by indexing, using persistent cache, increasing commit interval etc----------------------------------------------------------------------------------

5) How do you measure session performance?Ans : by checking "Collect performance Data" check box----------------------------------------------------------------------------------

6) Is It Possible to invoke Informatica batch or session outside Informatica UIAns : PMCMD----------------------------------------------------------------------------------

7) Limitations of handling long datatypesAns : When the length of a datatype (e.g varchar2(4000)) goes beyond 4000, Informatica makes this as varchar2(2000)----------------------------------------------------------------------------------

DOC : 7

What is a data-warehouse? A data warehouse is subject-oriented, integrated, time-variant and non-volatile [data] collection in support of management decision making processes. (OR) A data warehouse, is a collection of data gathered from one or more data repositories to create a new, central database.Storage of large volumes of data, Historical data, Load and save - no updates Reporting system, Query and analysis, Trends and forecasting

What are Data Marts? Data mart is restricted to a single business process or single business group. Union of data marts equal data warehouse

What is ER Diagram? Entity – Relational diagramCould be (a) one to one (b) One to many (crowfoot style) (c) Many to many

What is a Star Schema?

Star schema: A modeling paradigm that has single object in the middle (fact table) connected to a number of objects (dimensions tables) around it radially – demoralized.

What is Dimensional Modeling?

What Snow Flake Schema?


39

Snowflake structure: Snowflake is a star schema with normalized dimensions.

Data cleaning - filling in missing values, smoothing noisy data, identifying & removing outliers, correcting inconsistencies, etc.;

What are the Different methods of loading Dimension tables?

What are Aggregate tables?

After fact tables are built, any necessary aggregate fact tables must be built. Aggregate tables are structured to define "totals" of data stored in granular fact tables. This pre-summarization allows for faster extracts from the warehouse, avoiding costly repeats of "sum/group by" SQL requests. Aggregate tables may be built at the staging level or may be built as a post-load process in the warehouse DBMS itself.

What is the Difference between OLTP and OLAP?

OLTP OLAP

Functional: day to day operations || Decision support

Db design: application oriented || subject oriented

Data : Current up to date || Historical data

Detailed, flat relational || Summarized, Isolated

Unit of work: Short, simple, transaction || Complex query

What is ETL?

Processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse (or)

Extract Transform and Load – a set of database utilities used to extract information from one database, transform it and load it into a second database. These tools are particularly useful to aggregate data from different database suppliers, e.g., Oracle to Sybase. into a data warehouse

What are the various ETL tools in the Market?

Data Stage, Data Junction, Ab initio

What are the various Reporting tools in the Market?

Seagate Crystal Reports, Ms Access, Business Objects

What is Fact table?

A table in a star schema that contains facts. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. (OR)


40

The tables which are extracted from heterogeneous sources and used in the Data Warehouse.

What is a dimension table?

Dimension tables describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called Lookup or Reference tables.

What is a lookup table?

What is a general purpose scheduling tool? Name some of them?

What are modeling tools available in the Market? Name some of them?

Visio-based database modeling component - Visual Studio .NET

What is real time data-warehousing?

An enterprise-wide implementation that replicates data from the same publication table on different servers/platforms to a single subscription table. This implementation effectively consolidates data from multiple sources.

What is data mining?

Data mining is about the discovery of knowledge, rules, or patterns in large quantities of data.

The process of finding hidden patterns and relationships in data. For instance, a consumer goods company may track 200 variables about each consumer. There are scores of possible relationships among the 200 variables. Data mining tools will identify the significant relationships

What is Normalization? First Normal Form, Second Normal Form , Third Normal Form?

Normalization is a step-by-step process of removing redundancies and dependencies of attributes in a data structure. The condition of the data at completion of each step is described as a "normal form."

What is ODS?

The form that data warehouse takes in the operational environment. Operational data stores can be updated, do provide rapid and consistent time, and contain only a limited amount of historical data.

What type of Indexing mechanism do we need to use for a typical datawarehouse?

Bitmap, Btree

Which columns go to the fact table and which columns go the dimension table?

(My user needs to see element broken by All elements before broken = Fact Measures All elements after broken = Dimension Elements

What is a level of Granularity of a fact table? What does this signify? (Weekly level summarization there is no need to have Invoice Number in the fact table anymore)


41

Granularity is the level of detail of the data stored in a data warehouse

How are the Dimension tables designed?

De-Normalized , Wide, Short , Use Surrogate Keys, Contain Additional date fields and flags.

What are slowly changing dimensions? What are non-additive facts? (Inventory , > > Account balances in bank)

What is VLDB?

(Data base is too large to back up in a time frame then it's a VLDB)

What is SCD1 , SCD2 , SCD3 ? how do you load the time dimension.

What are Semi-additive and factless facts?And in which scenario will you use such kinds of fact tables?

A fact table without any metrics in it is factless fact.

what are conformed dimensions?

Conformed dimensions can be used to analyze facts from two or more data marts. Suppose you have a “shipping” data mart (telling you what you’ve shipped to whom and when) and a “sales” data mart (telling you who has purchased what and when). Both marts require a “customer” dimension and a “time” dimension. If they’re the same dimension, then you have conforming dimensions, allowing you to extract and manipulate facts relating to a particular customer from both marts, answering questions such as whether late shipments have affected sales to that customer

Differences between star and snowflake

A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies

A star schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions

ETL Questions:

What is a staging area? Do we need it? What is the purpose of a staging area?

Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps:

Source data extraction, Data transformation (restructuring),

Data transformation (data cleansing, value transformations),

Surrogate key assignments


42

What is a three tier data warehouse?

Three tiered data warehousing means there are 3 tiers of data, each designed to meet a specific set of end user requirements

Operational Data Systems (Tier 1)

operations of a business on a day to day basis

Data Warehouse (Tier 2)

This data layer may be comprised of multiple data structures; the operational data store (ODS) for tactical decision support applications which require transaction level detail as well as the data warehouse which provides a single common set of data bases designed specifically for all decision support applications in a business. Data Mart (Tier 3)

This tier is customized for a specific department or set of users like sales/marketing analysts, financial analysts, customer satisfaction, etc.

What are the various methods of getting incremental records or delta records from the source systems? What are the various tools? - Name a few

What is latest version of Power Center / Power Mart?

Power center 7.0, Power mart 7.0

What is the difference between Power Center & Power Mart?

Informatica PowerCenter license - has all options, including distributed metadata, ability to organize repositories into a data mart domain and share metadata accross repositories.

PowerMart - a limited license (all features except distributed metadata and multiple registered servers). Only local repository can be created

What are the various transformation available?

Source Qualifier, Filter, Router, Joiner, Aggregate, Expression, Rank, Lookup, Update, Sequence Generator, Stored Procedure, External Stored Procedure, Adv St Procedure, XML, Normalization

What are the modules in Power Mart?

What are active transformation / Passive transformations?

Active transformation can change the number of rows that pass through it. Eg: The filter transformation removes rows that do not meet the filter conditions

Passive transformation does not change the number of rows that pass through it

Eg: Expression trans performs calculation on data and passes all rows thorough the transformationWhat are the different Lookup methods used in Informatica?


43

Static and Dynamic

Can Informatica load heterogeneous targets from heterogeneous sources? How do we call shell scripts from informatica? What is Informatica Metadata and where is it stored?

Data about data, it’s stored in the repository

What is a mapping, session, worklet, workflow, mapplet?

How can we use mapping variables in Informatica? Where do we use them? What are parameter files ? Where do we use them?

Can we override a native sql query within Informatica? Where do we do it? How do we do it?

Yes, with Override sql query in the Mapping – Properties

Eg: Select * from…. Where….. It’s advised not to use ORDER BY Clause here.

Can we use procedural logic inside Infromatica? If yes how , if now how can we use external procedural logic in informatica?

Do we need an ETL tool? When do we go for the tools in the market? How do we extract SAP data Using Informatica? What is ABAP? What are IDOCS? How to determine what records to extract?

Timestamps * Deletes are logical with timestamped deletes* Triggers on source system tables (Generally we dont do this as this decreases the source system efficiency) * Application Integration Software TIBCO , MQSERIES * File Compares (least method) * Snapshots in Oracle(daily) * Oracle Streams

What is Full load & Incremental or Refresh load?

Techniques of Error Handling - Ignore ,Rejecting bad records to a flat file , loading the records and reviewing them (default values)

What are snapshots? What are materialized views & where do we use them? What is a materialized view log?

What is partitioning? What are the types of partitioning? When do we Analyze the tables? How do we do it?

BI QUESTIONS

Compare ETL & Manual development? Business Intelligence

What is Business Intelligence?What is OLTP?


44

OLTP – Online transaction processing: Defines the transaction processing that supports the daily business operations

What is OLAP?

Online Analytical Processing: “Drilling down” on various data dimensions to gain a more detailed view of the data. For instance, a user might begin by looking at North American sales and then drill down on regional sales, then sales by state, and then sales by major metro area. Enables a user to view different perspectives of the same data to facilitate decision-making.

What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP?

Examples? ROLAP = relationnal olap, the users see cubes but under the hood it is pure relationnal table, Micro-Strategy is a rolap product MOLAP = multi dimensionnal olap, the users see cubes and under the hood there a big cube, Oracle Express used to be a molap product DOLAP = Desktop olap, the users see many cubes and under the hood there are many small cubes, Cognos PowerPlay. HOLAP = hybryd olap, combines molap and rolap, Essbase

Name some of the standard Business Intelligence tools in the market?

What are the various modules in Business Objects product Suite? What is a Universe? What is BAS? What is the function ?How do we enhance the functionality of the reports in BO? (VBA??)

DOC :7

What is a data-warehouse?

A data warehouse is subject-oriented, integrated, time-variant and non-volatile [data] collection in support of management decision making processes. (OR) A data warehouse, is a collection of data gathered from one or more data repositories to create a new, central database.

Storage of large volumes of data, Historical data, Load and save - no updates

Reporting system, Query and analysis, Trends and forecasting

What are Data Marts?

Data mart is restricted to a single business process or single business group.

Union of data marts equal data warehouse

What is ER Diagram?

Entity – Relational diagram

Could be (a) one to one


45

(b) One to many (crowfoot style)

(c) Many to many

What is a Star Schema?

Star schema: A modeling paradigm that has single object in the middle (fact table) connected to a number of objects (dimensions tables) around it radially – demoralized.

What is Dimensional Modeling?

What Snow Flake Schema?

Snowflake structure: Snowflake is a star schema with normalized dimensions.

Data cleaning - filling in missing values, smoothing noisy data, identifying & removing outliers, correcting inconsistencies, etc.;

What are the Different methods of loading Dimension tables?

What are Aggregate tables?

After fact tables are built, any necessary aggregate fact tables must be built. Aggregate tables are structured to define "totals" of data stored in granular fact tables. This pre-summarization allows for faster extracts from the warehouse, avoiding costly repeats of "sum/group by" SQL requests. Aggregate tables may be built at the staging level or may be built as a post-load process in the warehouse DBMS itself.

What is the Difference between OLTP and OLAP?

OLTP OLAP

Functional: day to day operations || Decision support

Db design: application oriented || subject oriented

Data : Current up to date || Historical data

Detailed, flat relational || Summarized, Isolated

Unit of work: Short, simple, transaction || Complex query

What is ETL?

Processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse (or)

Extract Transform and Load – a set of database utilities used to extract information from one database, transform it and load it into a second database. These tools are particularly useful to aggregate data from different database suppliers, e.g., Oracle to Sybase. into a data warehouse

What are the various ETL tools in the Market?


46

Data Stage, Data Junction, Ab initio

What are the various Reporting tools in the Market?

Seagate Crystal Reports, Ms Access, Business Objects

What is Fact table?

A table in a star schema that contains facts. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. (OR)

The tables which are extracted from heterogeneous sources and used in the Data Warehouse.

What is a dimension table?

Dimension tables describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called Lookup or Reference tables.

What is a lookup table?

What is a general purpose scheduling tool? Name some of them?

What are modeling tools available in the Market? Name some of them?

Visio-based database modeling component - Visual Studio .NET

What is real time data-warehousing?

An enterprise-wide implementation that replicates data from the same publication table on different servers/platforms to a single subscription table. This implementation effectively consolidates data from multiple sources.

What is data mining?

Data mining is about the discovery of knowledge, rules, or patterns in large quantities of data.

The process of finding hidden patterns and relationships in data. For instance, a consumer goods company may track 200 variables about each consumer. There are scores of possible relationships among the 200 variables. Data mining tools will identify the significant relationships

What is Normalization? First Normal Form, Second Normal Form , Third Normal Form?

Normalization is a step-by-step process of removing redundancies and dependencies of attributes in a data structure. The condition of the data at completion of each step is described as a "normal form."

What is ODS?

The form that data warehouse takes in the operational environment. Operational data stores can be updated, do provide rapid and consistent time, and contain only a limited amount of historical data.


47

What type of Indexing mechanism do we need to use for a typical datawarehouse?

Bitmap, Btree

Which columns go to the fact table and which columns go the dimension table?

(My user needs to see element broken by All elements before broken = Fact Measures All elements after broken = Dimension Elements

What is a level of Granularity of a fact table? What does this signify? (Weekly level summarization there is no need to have Invoice Number in the fact table anymore)

Granularity is the level of detail of the data stored in a data warehouse

How are the Dimension tables designed?

De-Normalized , Wide, Short , Use Surrogate Keys, Contain Additional date fields and flags.

What are slowly changing dimensions? What are non-additive facts? (Inventory , > > Account balances in bank)

What is VLDB?

(Data base is too large to back up in a time frame then it's a VLDB)

What is SCD1 , SCD2 , SCD3 ? how do you load the time dimension.

What are Semi-additive and factless facts?And in which scenario will you use such kinds of fact tables?

A fact table without any metrics in it is factless fact.

what are conformed dimensions?

Conformed dimensions can be used to analyze facts from two or more data marts. Suppose you have a “shipping” data mart (telling you what you’ve shipped to whom and when) and a “sales” data mart (telling you who has purchased what and when). Both marts require a “customer” dimension and a “time” dimension. If they’re the same dimension, then you have conforming dimensions, allowing you to extract and manipulate facts relating to a particular customer from both marts, answering questions such as whether late shipments have affected sales to that customer

Differences between star and snowflake

A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies

A star schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions


48

ETL Questions:

What is a staging area? Do we need it? What is the purpose of a staging area?

Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps:

Source data extraction, Data transformation (restructuring),

Data transformation (data cleansing, value transformations),

Surrogate key assignments

What is a three tier data warehouse?

Three tiered data warehousing means there are 3 tiers of data, each designed to meet a specific set of end user requirements

Operational Data Systems (Tier 1)

operations of a business on a day to day basis

Data Warehouse (Tier 2)

This data layer may be comprised of multiple data structures; the operational data store (ODS) for tactical decision support applications which require transaction level detail as well as the data warehouse which provides a single common set of data bases designed specifically for all decision support applications in a business. Data Mart (Tier 3)

This tier is customized for a specific department or set of users like sales/marketing analysts, financial analysts, customer satisfaction, etc.

What are the various methods of getting incremental records or delta records from the source systems? What are the various tools? - Name a few

What is latest version of Power Center / Power Mart?

Power center 7.0, Power mart 7.0

What is the difference between Power Center & Power Mart?

Informatica PowerCenter license - has all options, including distributed metadata, ability to organize repositories into a data mart domain and share metadata accross repositories.

PowerMart - a limited license (all features except distributed metadata and multiple registered servers). Only local repository can be created

What are the various transformation available?

Source Qualifier, Filter, Router, Joiner, Aggregate, Expression, Rank, Lookup, Update, Sequence


49

Generator, Stored Procedure, External Stored Procedure, Adv St Procedure, XML, Normalization

What are the modules in Power Mart?

What are active transformation / Passive transformations?

Active transformation can change the number of rows that pass through it. Eg: The filter transformation removes rows that do not meet the filter conditions

Passive transformation does not change the number of rows that pass through it

Eg: Expression trans performs calculation on data and passes all rows thorough the transformationWhat are the different Lookup methods used in Informatica?

Static and Dynamic

Can Informatica load heterogeneous targets from heterogeneous sources? How do we call shell scripts from informatica? What is Informatica Metadata and where is it stored?

Data about data, it’s stored in the repository

What is a mapping, session, worklet, workflow, mapplet?

How can we use mapping variables in Informatica? Where do we use them? What are parameter files ? Where do we use them?

Can we override a native sql query within Informatica? Where do we do it? How do we do it?

Yes, with Override sql query in the Mapping – Properties

Eg: Select * from…. Where….. It’s advised not to use ORDER BY Clause here.

Can we use procedural logic inside Infromatica? If yes how , if now how can we use external procedural logic in informatica?

Do we need an ETL tool? When do we go for the tools in the market? How do we extract SAP data Using Informatica? What is ABAP? What are IDOCS? How to determine what records to extract?

Timestamps * Deletes are logical with timestamped deletes* Triggers on source system tables (Generally we dont do this as this decreases the source system efficiency) * Application Integration Software TIBCO , MQSERIES * File Compares (least method) * Snapshots in Oracle(daily) * Oracle Streams

What is Full load & Incremental or Refresh load?

Techniques of Error Handling - Ignore ,Rejecting bad records to a flat file , loading the records and reviewing them (default values)


50

What are snapshots? What are materialized views & where do we use them? What is a materialized view log?

What is partitioning? What are the types of partitioning? When do we Analyze the tables? How do we do it?

BI QUESTIONS

Compare ETL & Manual development? Business Intelligence

What is Business Intelligence?What is OLTP?

OLTP – Online transaction processing: Defines the transaction processing that supports the daily business operations

What is OLAP?

Online Analytical Processing: “Drilling down” on various data dimensions to gain a more detailed view of the data. For instance, a user might begin by looking at North American sales and then drill down on regional sales, then sales by state, and then sales by major metro area. Enables a user to view different perspectives of the same data to facilitate decision-making.

What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP?

Examples? ROLAP = relationnal olap, the users see cubes but under the hood it is pure relationnal table, Micro-Strategy is a rolap product MOLAP = multi dimensionnal olap, the users see cubes and under the hood there a big cube, Oracle Express used to be a molap product DOLAP = Desktop olap, the users see many cubes and under the hood there are many small cubes, Cognos PowerPlay. HOLAP = hybryd olap, combines molap and rolap, Essbase

Name some of the standard Business Intelligence tools in the market?

What are the various modules in Business Objects product Suite? What is a Universe? What is BAS? What is the function ?How do we enhance the functionality of the reports in BO? (VBA??)

Frequently Asked Questions about Data WarehousingBy Author No Published in Portal Feature in June 2003

(DataFlux would like to thank Wayne Eckerson at The Data Warehousing Institute for his contributions to some of these questions.)

What is the main purpose of a data warehouse?

The primary function of a data warehouse is to provide organizations with a single version of the truth – a single, encompassing view of their data. Data management is crucial to data warehousing, because without it a data warehouse will be ineffective. A complete data management strategy ensures that data in the data warehouse is consistent, accurate and reliable.


http://www.dmreview.com/master.cfm?NavID=152&AuthorID=1

51

Why is data management so important to data warehousing?

The basic fact is that organizations have a limited appreciation of the quality of data residing in the operational systems with the majority having no data management processes in place at all. A survey conducted by The Data Warehousing Institute (TDWI) shows that around 44 percent of respondents said that their data quality was worse than they had anticipated. Additionally, 40 percent admitted to costs, problems and losses directly attributed to data quality issues.

Further studies by business analysts conclude that poor data quality is the main cause for failed and limited acceptance of data warehousing and business intelligence projects. Poor data quality is costing U.S. organizations billions of dollars every year in lost sales and lower customer satisfaction rates due to the lack of accurate information available

How is a data warehouse different from a normal database?

Every company conducting business inputs valuable information into transactional-oriented data stores. The distinguishing traits of these online transaction processing (OLTP) databases are that they handle very detailed, day-to-day segments of data, are very write-intensive by nature and are designed to maximize data input and throughput while minimizing data contention and resource-intensive data lookups.

By contrast, a data warehouse is constructed to manage aggregated, historical data records, is very read-intensive by nature and is oriented to maximize data output. Usually, a data warehouse is fed a daily diet of detailed business data in overnight batch loads with the intricate daily transactions being aggregated into more historical and analytically formatted database objects. Naturally, since a data warehouse is a collection of a business entity’s historical information, it tends to be much larger in terms of size than its OLTP counterpart.

Is data stewardship important to data warehousing?

Effective data warehousing/data management requires organizations to adopt a data stewardship approach. Stewardship is different than ownership. A steward is a person who is expected to exercise responsible care over an asset that he or she does not own. The data is actually owned by the enterprise. The steward is responsible for caring for that asset.

Data stewardship is important but establishing a data stewardship program is very difficult! One immediate challenge to a stewardship program is to identify the group or person responsible for a set of data.

What does data management consist of?

Data management, as it relates to data warehousing, consists of four key areas associated with improving the management, and ultimately the usability and reliability, of data. These are:

Data profiling: Understanding the data we have.

Data quality: Improving the quality of data we have.

Data integration: Combining similar data from multiple sources.

Data augmentation: Improving the value of the data.

How does data warehousing benefit from data profiling?

Data profiling deciphers the content and the structure of the information being moved into the data warehouse. This profiling, or discovery process, enables organizations to more quickly understand data quality and structure issues before any information is moved into trusted information stores. By deciphering data issues, profiling helps organizations build a sound strategy for ensuring the accuracy and quality of information


52

being moved into their data warehouse. Of course, profiling simply identifies data issues, data quality and data integration software is almost always required in order to correct the identified issues.

Why combine data warehousing and data quality?

Bill Inmon, father of data warehousing, says the purpose of the ETL phase is to load the data warehouse with integrated and cleansed data. Data management is a key component in preparing data for entry into the data warehouse. The integration of data warehousing and data quality provides the ability to manage data quality on an enterprise-wide scale, solving issues for both data stewards/business analysts and IT/data warehousing professionals. A true data quality solution must address the entire process – IT/data warehousing professionals need data management tools that function within their ETL environment. Data stewards/business analysts need data management tools that simplify the complex business rules governing the algorithms and methodologies that identify true errors in the data.

How do data warehousing and OLAP differ?

The answer to this depends on with whom you have the conversation. While there have been attempts to make a relational database do online analytical processing (termed ROLAP), the fact is that the relational and OLAP engines are quite distinct in how they store and access data. Most often, a data warehouse uses a relational engine for its data management. The reasons for this are many, but the central driving force is that relational engines have the maturity and capability to handle the heavy load and storage requirements of very large data warehouses.

The OLAP engines differ from their relational cousins in that they use a different object foundation. While the two-dimensional table is the main logical storage structure for the relational database, the OLAP engine uses (primarily) a three-dimensional cube structure with the third dimension most often being time. While OLAP databases can handle fairly large databases, their capabilities do not match that of relational engines when it comes to managing hundreds of gigabytes of data. They do, however, make excellent data mart candidates.

Can I deliver information to decision-makers they can trust?

The answer lies in your approach to data management. When an organization treats its data as a strategic asset and deploys a data warehousing strategy to support this important asset, then decision-makers can trust the information that they receive. A data management solution merged with ETL manages data on an enterprise-wide scale, solving issues for both business analysts and IT professionals. It provides business analysts with easy-to-use tools that simplify the data auditing and analysis processes. Data warehousing professionals get data quality administration tools that function within their ETL environments. This approach emphasizes not only the construction of quality data with loading into a warehouse but also the ongoing management of your warehouse, providing increased automation of data transformations, integration of external information and simplified management of complex job dependencies.

What is the single most important objective in building a data warehouse?

While a solid foundational design is extremely important to a data warehouse, it ranks second to consistent, accurate and reliable data. Perhaps the most difficult task in initially creating and maintaining a data warehouse is ensuring the validity of the information stored within the database itself. The collecting and cleansing of data from many disparate enterprise systems is not an easy task, and a data warehouse project usually fails because the data cannot be validated and relied upon by the decision-makers using the warehouse itself.

How can data warehousing improve my bottom line?

OLTP systems are not designed to help decision-makers spot the buying patterns of their customers or assist knowledge workers with the analysis of historical inputs and outputs of a company’s inventory. However, these types of data analysis are critical in the fast- paced and competitive business world, especially in the e-


53

commerce arena where the competition is only a mouse click away. Making solid, informed business decisions mandates that companies make the most of the information they collect. To accomplish that, the data warehouse must be organized to answer the questions that today’s knowledge workers ask.

Where should data management start?

A successful data management program has both proactive and reactive components. The proactive component consists of establishing the overall governance, defining the roles and responsibilities, establishing the quality expectations and the supporting business practices, and deploying a technical environment that supports these business practices. Specialized tools are often needed in this technical environment.

There are many opportunities to improve data quality at a point of data integration. The most logical point is at the source of the data. Data sources have various formats, reside in multiple platforms and are often widely distributed. Some data sources are more complete, while others have missing or incorrect values. By performing corrective maintenance and preventing data quality issues at the source, the data warehousing effort becomes more effective.

The point of passage of data from the operational environment into the data warehouse is a very good place to address data quality. In order to address the completeness of data coming from multiple sources, it is necessary to first address the issue of data quality in the source applications and then address the issue of compatibility of data as the data is merged. Data quality tools provide robust matching logic to facilitate merging of disparate data across data sources.

What is the "next generation" of data warehousing?

The next generation of data warehousing is real-time or active data warehousing. An active data warehouse also implies near real-time analysis. The key is to get large volumes of data into the data warehouse for next-day analysis. For example, many e-business executives want to see reports on the previous day's Web activity. But this often involves corralling gigabytes of data. The technologies to support active data warehousing can be complex. But companies that want and need to translate data into action quickly will rely on active or real-time data warehousing tools and techniques to achieve this goal.

What is IT's data management pain?

IT department heads tell us that decision- makers in their organization have trouble getting the high- quality information needed to determine how best to allocate resources, manage costs, add and retain the right customers, attain profit targets and more. It is their responsibility to find smarter ways to use technology for satisfying the business needs and show how IT contributes to the organization. IT says that it is faced with the challenges of so much data constantly coming in and everyone wanting immediate answers to their questions. It is hard to produce accurate information and meet everyone's needs.

Another challenge is that data comes from so many sources and it is not standardized. "We have duplicate data on individual customers that doesn't match up. Different reports on the same question yield different answers. We do not have a consistent view of information. We don't know the best way to achieve high- quality data in a low-risk manner. We can't even tell management how much time and effort it will take to clean up the data."

And another IT department states, "We have records and fields from various data sources, platforms and systems but we don't have the ability to extract and transform the data into information that users can trust. We know if we our systems don't produce accurate information, our entire organization suffers."

How can I increase the ROI of my data warehouse?

Closed-loop analysis is the holy grail of business intelligence. Closed-loop analysis can make collected information more valuable. With more than 50 percent of data warehousing initiatives failing as a result of


54

poor data quality, it is time for organizations to think seriously about the impact of data quality. In many cases, closing the loop involves a context switch between analytical and operational applications. In most cases, the actionable element can be automated through custom programming. For instance, an agent that tracks inventory in a data warehouse can kick off a purchase order for new parts when inventory levels get below a certain point.

CRM applications make good use of closed-loop processing. Here, marketing managers use a CRM package to gather information about customers, analyze the data and create new campaigns based on customers' tendency to purchase certain types of goods through specific channels in response to specific offers. The campaign runs, data is collected and reviewed, and the process repeats itself.

What is the impact of e-business on corporate data warehouse initiatives?

Analyzing data in near real time is a big requirement in the e-business world. The Web makes real-time data analysis even harder because Web sites generate so much data. We know some companies that extract 2GB of data a day from their various Web sites and turn them into reports for their executives to see the next day. How do you capture, integrate and report on all this data, most of which is simply text strings with fast-changing variables that are difficult to decode? That's the challenge for e-business intelligence vendors.

What is the key to data warehousing success?

Setting realistic expectations from the start and understanding that a data warehousing project is never totally complete. It begins with matching your existing technical skills to the business value you want to achieve. Finding the right pace with the right balance will be your organization's biggest challenge.

Also, when starting a data warehousing project, hire an expert who has built a data warehouse before to provide assistance. This person can help sell the concept to upper management, train staff about the core principles and techniques, and set a proper direction based on tough lessons learned in the trenches. Large vendors, such as IBM, have trained many consultants in a comprehensive methodology for building data warehouses. They can make good partners, and it would also be wise to bring in a consultant without a vendor affiliation to provide a second opinion.


1dbmanagement.info/books/mix/informatica_faq.doc · web viewbitmap indexes by the word bitmap...

Documents