Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
Identify an organisational requirement for a Data Warehouse
Understand what is involved in creation and ongoing administration
Understand how to improve an organisation's data and information
Identify and analyse potential data sources and how to improve business intelligence
Design a reporting plan to improve information management
Improve data integrity and quality
Create a database schema suitable for a Data Warehouse
Reporting
Decision making
Business Intelligence
What information do you need?
Databases
Documents
Websites
Emails
Be aware of privacy issues
Capture public accounts, eg sales, info
Header information can be extracted and loaded
Strip all images and HTML
Images and attachments can be loaded individually
Access is via POP or IMAP protocols
Research
Log Files
Evan Leybourn
Repository of Organisational Information
Resolve conflicts between data from different sources.
Information is the processed and refined version of the data for human usage.
Understand how to combine data sources into a consolidated database
Data from disparate sources is stored for
Data is the raw output of any database,
website, log files or other data source.
Increase the efficiency, reliability and accessibility of your data
Standardise and simplify information managment within an organisation
Cross-application reporting, improving business intelligence
Identify obsolete, duplicate and redundant data
Correlate records that may have different identifying keys.
Differentiate identical primary keys from difference records.
What data do you have?
Accurately, efficiently and with no duplication store records.
Do you own your own data?
This is an iterative process.
Design of the Consolidation Database and Data Marts
Extraction, Transformation and Load (ETL) of the data
Business level reporting on the data
Analysis and Reverse Engineering
From a direct database connection
From a schema diagram such as an E.R. Diagram
From a remote connection such as ODBC/JDBC
As a last resort they can be extracted from text extracts
Access is via a direct connection or ODBC
Identify and discuss potential data sources
Text files and word processor documents
Presentations
Spreadsheets
Which can be analysed as if they were a database table
Access is via file access on a local path or URL
Ensure that you strip all advertising and navigation
The extraction tool should follow links where required
Access is via URL through HTTP Path
Identify and standardise all common data from different sources.
Validation checks scan the data source for logical errors. If the check fails;
This database should contain all the information from your data sources as read-only, historical and aggregated data
Data Marts are subsets of the consolidation database.
During the ETL process, records can be discovered that contain errors.
Level of information based on access rights.
Access is via URL through HTTP Path
Improved decision making
Complete understanding of organisational data
Break vendor lock-in
Simplify application migration
Giving end-users more access to organisation data
Build reports to improve business capability and identify organisational weakness
The ETL process is very time-consuming
Computer processing requirements are very high
A security policy to cover data access and storage
A strict design, scope, prototyping and release cycle to deliver a working system in a timely manner.
Statistical information is usually stored in simple log files which can be processed and integrated into the Data Warehouse.
Access is via file access on a local path or URL
What quantity of data is being dealt with?
How frequently and when does this data change?
How much of this data is duplicated elsewhere?
A normalised type of star schema.
How satisfactory are the current reports?
How much of this data is obsolete or irrelevant?
How is this data used?
How can we access and extract the data?
Is external permission required before extracting this data, and how much will that cost?
Does the organisation have the skills in-house to perform the analysis?
Consider your reporting requirements. How would you create a star schema to represent that data.
The first step in the ETL process
In principle, a Data Warehouse should contain a complete historical log of the original data.
The simplest Data Warehouse schema.
The process involves extracting and building;
Contains definitions and representations of data elements
Define the human meaning of data
Define how data elements are stored in the DBMS
Should be comprehensive and simple to understand
Define historical data, structure, and extraction
New data sources need to be analysed and added
Custom reports will regularly need to be written
Data validation and integrity checks need to be run
Standard database maintenance and tuning
Most E.R. modellers provide some reverse engineering tools.
In many countries, reverse engineering the artefact or process is lawful provided it is obtained legitimately.
The second step in the ETL process. A series of rules are modifications to each row
Any column that has been extracted from the data source can be explicitly deleted from the load.
If your extraction query contains a wildcard (*), then you may wish to delete extraneous columns.
Arbitrary columns, with pre-set values, can be added to the consolidation database during the extraction process.
This is most often used to add descriptive details alongside extracted information.
Any column that has been extracted from the data source can be modified and transformed into any style you require.
Different applications do this in different ways. One of the most powerful methods is regular expressions.
The final step in the ETL process.
A report can be as simple as a representation of a Key Performance Indicator (KPI), which can;
Garbage In, Garbage Out is the fact that, computers will unquestioningly process all regardless of purpose
A regular expression, or regex for short, is a pattern describing text, and optionally a replacement string.
Unchanged: Update the end timestamp to the current timestamp.
Updated: Create a new row with the start and end timestamp set to the current timestamp.
New (Inserted): Create a new row with the start and end timestamp set to the current timestamp.
Deleted: This row will never be extracted from the source.
<input>
<table>email</table>
<transform>
<type>regex</type>
<in>body</in>
<out>body</out>
<action>s/<[^<>]+>//g</action>
</transform>
</input>
<input>
<select>SELECT * FROM customer</select>
<table>person</table>
<transform>
<type>delete</type>
<in>modified_time</in>
</transform>
</input>
By building an intelligent semantic framework around the data contained within a Data Warehouse, the report generator can accurately respond to automated and free-form queries.
<input>
<select>SELECT * FROM company</select>
<transform>
<type>add</type>
<out>organisation_type</out>
<value>Company</value>
</transform>
</input>
<input>
<select>SELECT * FROM ti</select>
<transform>
<type>add</type>
<out>organisation_type</out>
<value>Tertiary Institution</value>
</transform>
</input>
There are eleven characters (or metacharacters) with special meanings;
What DBMS is used organisationally?
CREATE TABLE person (
sys_person_id BIGSERIAL PRIMARY KEY,
person_id TEXT,
title TEXT,
family_name TEXT,
given_names TEXT,
gender TEXT,
date_of_birth TEXT,
passphrase TEXT,
start_timestamp_id TIMESTAMP REFERENCES dw_timestamp,
end_timestamp_id TIMESTAMP REFERENCES dw_timestamp,
UNIQUE (person_id, start_timestamp_id)
);
Your ETL system needs to be able to join multiple columns together from the source data.
The join should be separated by an optional delimiter.
Your ETL system needs to be able to split apart a data based on a given delimiter.
The delimiter can either be a string or a regular expression.
For data validation purposes you must be able to drop entire rows from the load based on the value, or transformed value, of a given column.
This ensures that your Data Warehouse (or Data Mart) is in a stable and consistent state at all times.
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>drop</type>
<in>email_address</in>
<action>/^([a-zA-Z0-9_\.\-])+\@
(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/</action>
</transform>
</input>
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>split</type>
<in>email_address</in>
<out>username,domain</out>
<action>@</action>
</transform>
</input>
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>join</type>
<in>username,domain</in>
<out>address</out>
<action>@</action>
</transform>
</input>
Business Objects
Netezza
SAP Business Information Warehouse
Teradata
Pentaho
Jasper Reports
YALE
OpenI
Hello World