Advantages of an ETL Tool

Many IT Departments are still relying on SQL Scripts executed within their Database to Move Data from one Instance of the Database to another. This does work fairly well even with very Basic SQL like:

INSERT INTO table1 (column1, column2, someInt, someVarChar )
SELECT table2.column1, table2.column2, 8, 'some string etc.'
FROM table2
WHERE table2.criteria;

The above SQL Script should even work on most Database Products. For the Oracle Database even more sophisticated SQL Capabilities exist to cope with the Task of Moving data from one place to another. This comes with various SQL Statements e.g. the COPY FROM Clause:

COPY FROM HR/your_password@ORCL -
CREATE NEW_TABLE_SALESMEN (EMPLOYEE_ID, SA_MAN) -
USING SELECT EMPLOYEE_ID, LAST_NAME FROM EMPLOYEES -
WHERE JOB_CODE='SALESMEN';

This would “copy” the full data from one Schema to another and create a new Table. Even Incremental Loads (that is only the changed data) could be realized quickly by using the MERGE or UPSERT Command e.g.

MERGE INTO employees e
    USING hr_records h
    ON (e.id = h.emp_id)
  WHEN MATCHED THEN
    UPDATE SET e.address = h.address
  WHEN NOT MATCHED THEN
    INSERT (id, address)
    VALUES (h.emp_id, h.address);

All above sample code would also be very efficient since the complete Processing is done within the Database by the SQL Optimizer.

So to summarize: the above code is easy, could be placed within the Database Scheduler and is also performing very efficient within the database. So why do ETL Tools exist for this Task?

For more Complex Data Movement Design and Operation a dedicated ETL Tool should definitely be considered for their following Advantages:

  1. More Flexible configuration in Terms of Data Flow and Handling Connection Deployment
  2. Monitoring of Data Loading Workflows
  3. Robustness against errors during data loading, due to the Restartability from the fault point
  4. Debugging functionality during Development
  5. Structured and standardized design processes
  6. Orchestration of different loading procedures and Modeling the Dependency’s of Loading Procedures (Load Sequence)
  7. Easier Parallelization (performance) of multiple Workflows
  8. Implementation of business rules and transformation of the data
  9. Visual representation of data flow (data lineage documentation created by Tool)
  10. Multi- User Development (checkin / checkout )
  11. Built-in Versioning Capabilities (Source Code Control)
  12. Easier definition of aggregation Rules for data
  13. Quicker implementation of new Requirements to existing Mappings (e.g. adding new Column)
  14. Easier Implementation of Change Data Capture by using pre-built Models to utilize the same Mapping for both Tasks
  15. Simpler access to 3rd Party Sources which need to be incorporated into the Loading Procedure

After all, the Oracle BI Application Product also ships with an ETL Tool for data Movement like Informatica or ODI.

Informatica Monitoring using SQL Scripts

Informatica offers the Monitor and the Log Files to check how long certain Workflows have run. This is very nice, but sometimes one wants to have the runtime Statistics in a better format. This can be achived using the following SQL that querys the internal Informatica Table Repository (The Informatica Database Schema):

The following Contains the Workflow runtime, but also how long the Source Qualifier took and how long the Target load Took:

 
SELECT OPB_SWIDGINST_LOG.INSTANCE_NAME det,
to_char(OPB_SWIDGINST_LOG.START_TIME, 'yyyy-mm-dd hh24:mi:ss') start_det,
to_char(OPB_SWIDGINST_LOG.END_TIME, 'yyyy-mm-dd hh24:mi:ss') ende_det,
round((OPB_SWIDGINST_LOG.END_TIME - OPB_SWIDGINST_LOG.START_TIME) *24*60*60) as det_laufzeit,
sum(round((OPB_SWIDGINST_LOG.END_TIME - OPB_SWIDGINST_LOG.START_TIME) *24*60*60)) over (partition by OPB_WFLOW_RUN.workflow_run_id) det_sum_lfz,
round((OPB_TASK_INST_RUN.END_TIME - OPB_TASK_INST_RUN.START_TIME) *24*60*60) as wf_laufzeit,
OPB_WFLOW_RUN.WORKFLOW_NAME,
to_char(OPB_TASK_INST_RUN.START_TIME, 'yyyy-mm-dd hh24:mi:ss') start_wf,
to_char(OPB_TASK_INST_RUN.END_TIME, 'yyyy-mm-dd hh24:mi:ss') ende_wf,
OPB_WFLOW_RUN.workflow_run_id
FROM OPB_TASK_INST_RUN, OPB_WFLOW_RUN, OPB_SUBJECT, OPB_SWIDGINST_LOG
WHERE
OPB_SUBJECT.SUBJ_ID = OPB_TASK_INST_RUN.SUBJECT_ID
AND OPB_WFLOW_RUN.WORKFLOW_ID = OPB_TASK_INST_RUN.WORKFLOW_ID
AND OPB_WFLOW_RUN.WORKFLOW_RUN_ID = OPB_TASK_INST_RUN.WORKFLOW_RUN_ID
AND OPB_SWIDGINST_LOG.WORKFLOW_RUN_ID = OPB_TASK_INST_RUN.WORKFLOW_RUN_ID
AND OPB_SWIDGINST_LOG.TASK_INSTANCE_ID = OPB_TASK_INST_RUN.INSTANCE_ID
AND OPB_SWIDGINST_LOG.WIDGET_TYPE IN (3, 2)
--AND workflow_NAME like '%WORFLOW%'
and OPB_TASK_INST_RUN.start_time >'01-01-14'
order by OPB_WFLOW_RUN.workflow_run_id desc, OPB_TASK_INST_RUN.start_time desc,OPB_SWIDGINST_LOG.START_TIME asc;

With the Result-set in SQL Developer looking like:

Informatica Laufzeiten

The following is a subset of the above and only shows the Workflow runtimes:

 SELECT distinct OPB_WFLOW_RUN.WORKFLOW_NAME,
 to_char(OPB_TASK_INST_RUN.START_TIME, 'yyyy-mm-dd hh24:mi:ss') start_wf,
 to_char(OPB_TASK_INST_RUN.END_TIME, 'yyyy-mm-dd hh24:mi:ss') ende_wf,
 round((OPB_TASK_INST_RUN.END_TIME - OPB_TASK_INST_RUN.START_TIME) *24*60*60) as wf_laufzeit,
 OPB_WFLOW_RUN.workflow_run_id
 FROM
 OPB_TASK_INST_RUN, OPB_WFLOW_RUN, OPB_SUBJECT, OPB_SWIDGINST_LOG
 WHERE
 OPB_SUBJECT.SUBJ_ID = OPB_TASK_INST_RUN.SUBJECT_ID
 AND OPB_WFLOW_RUN.WORKFLOW_ID = OPB_TASK_INST_RUN.WORKFLOW_ID
 AND OPB_WFLOW_RUN.WORKFLOW_RUN_ID = OPB_TASK_INST_RUN.WORKFLOW_RUN_ID
 AND OPB_SWIDGINST_LOG.WORKFLOW_RUN_ID = OPB_TASK_INST_RUN.WORKFLOW_RUN_ID
 AND OPB_SWIDGINST_LOG.TASK_INSTANCE_ID = OPB_TASK_INST_RUN.INSTANCE_ID
 AND OPB_SWIDGINST_LOG.WIDGET_TYPE IN (3, 2)
 --AND workflow_NAME like '%WORFLOW%'
 and OPB_TASK_INST_RUN.start_time >'01-01-14'
 order by OPB_WFLOW_RUN.workflow_run_id desc;

With the Result-set in SQL Developer looking like:

Informatica Laufzeiten WF

This shows how the Informatica Internal Repository can be used to easily query for Runtime Information.