Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Data Warehousing

Evan Leybourn

Updated May 3, 2012

Transcript

Course Outline

Identify an organisational requirement for a Data Warehouse

Understand what is involved in creation and ongoing administration

Understand how to improve an organisation's data and information

Identify and analyse potential data sources and how to improve business intelligence

Design a reporting plan to improve information management

Improve data integrity and quality

Create a database schema suitable for a Data Warehouse

Reporting

Decision making

Business Intelligence

What information do you need?

Databases

Documents

Websites

Emails

Be aware of privacy issues

Capture public accounts, eg sales, info

Header information can be extracted and loaded

Strip all images and HTML

Images and attachments can be loaded individually

Access is via POP or IMAP protocols

Research

Log Files

Overview

Data Warehousing and

Business Intelligence Systems

Evan Leybourn

eleybourn@gmail.com

Multiple Data Sources

What is a Data Warehouse?

Repository of Organisational Information

Resolve conflicts between data from different sources.

Information is the processed and refined version of the data for human usage.

Understand how to combine data sources into a consolidated database

Important

Data from disparate sources is stored for

Data is the raw output of any database,

website, log files or other data source.

Central Repository

Increase the efficiency, reliability and accessibility of your data

Standardise and simplify information managment within an organisation

Cross-application reporting, improving business intelligence

Identify obsolete, duplicate and redundant data

Correlate records that may have different identifying keys.

Differentiate identical primary keys from difference records.

What data do you have?

Accurately, efficiently and with no duplication store records.

Do you own your own data?

Discussion

Data Warehouse Phases

Data Warehouse Applications

This is an iterative process.

Analysis

Design

ETL

Reporting

Design of the Consolidation Database and Data Marts

Extraction, Transformation and Load (ETL) of the data

Business level reporting on the data

Analysis and Reverse Engineering

Rationalle

Resolving Errors

Writing Good Reports

Narrowing your Results

Data Sources

Databases

Snowflake Schema

Outcomes

Databases and the Structural Schema

Dimensions

From a direct database connection

From a schema diagram such as an E.R. Diagram

From a remote connection such as ODBC/JDBC

As a last resort they can be extracted from text extracts

Access is via a direct connection or ODBC

Identify and discuss potential data sources

Discussion

Office Documents and Corporate Knowledge

Text files and word processor documents

Presentations

Spreadsheets

Which can be analysed as if they were a database table

Access is via file access on a local path or URL

Fact

Data Integrity

Data Validation

Data Standardisation

Snowflakes

Websites with research and reference data

Ensure that you strip all advertising and navigation

The extraction tool should follow links where required

Access is via URL through HTTP Path

Corporate Emails

Identify and standardise all common data from different sources.

Gender: Male / Female => M / F etc.
Case Sensitivity: university => University
Addresses: Line 1 / Line 2 => Company / Street

Integrity checks scan the data source for physical and data errors
If the check fails skip the insertion and raise an error flag
Many older applications are prone to these integrity problems

Validation checks scan the data source for logical errors. If the check fails;

Insert into the warehouse and raise an error flag or
Skip the insertion and raise an error flag

Most reports display aggregated information to represent an organisational fact.
Advanced reporting tools can "drill down" into reports to extract more detailed information.
Users to look at the organisation level to see trends and statistics and further interrogate points of interest.

The Consolidation Database

Data sources must be appropriate to the business requirements
Irrelevant information can provide misleading information
The input data must be accurate
Most importantly, the person writing the report must understand the data
Users must be fully trained in the content and structure

This database should contain all the information from your data sources as read-only, historical and aggregated data

Can become very large
Historical information must be stored
Use an incremental extraction/insertion system

Data Marts are subsets of the consolidation database.

The Data Marts also contain historical and incremental data
Most Data Marts are highly denormalised for rapid querying
Indexes should be put on frequently used fields
Each schema should be designed for a report subject

Source Analysis

Research Papers and Abstracts

During the ETL process, records can be discovered that contain errors.

All erroneous records should be flagged for DBA attention
Erroneous records should always be resolved in the master data source to ensure stable and reliable data
Erroneous records can also identify logical flaws in the design of the Data Warehouse

Level of information based on access rights.

Access is via URL through HTTP Path

Improved decision making

Complete understanding of organisational data

Break vendor lock-in

Simplify application migration

Giving end-users more access to organisation data

Build reports to improve business capability and identify organisational weakness

The ETL process is very time-consuming

Computer processing requirements are very high

A security policy to cover data access and storage

A strict design, scope, prototyping and release cycle to deliver a working system in a timely manner.

Log Files from corporate servers

Statistical information is usually stored in simple log files which can be processed and integrated into the Data Warehouse.

Access is via file access on a local path or URL

Long Term Concerns

What quantity of data is being dealt with?

How frequently and when does this data change?

How much of this data is duplicated elsewhere?

Long Term Benefits

Data Marts

A normalised type of star schema.

The dimension data has been grouped into multiple tables instead of one large table.
Saves space and reduces redundancy
Increases the number of dimension tables and requires more foreign key joins.
Reduced query performance.
Non Simplicistic

How satisfactory are the current reports?

How much of this data is obsolete or irrelevant?

How is this data used?

How can we access and extract the data?

Is external permission required before extracting this data, and how much will that cost?

Does the organisation have the skills in-house to perform the analysis?

Historical Extraction

Data Access Control

Web Services

Extraction

Star Schema

Scheduled Reporting

Consolidation and Data Mart

Differences

Dimensions

Discussion

Consider your reporting requirements. How would you create a star schema to represent that data.

Fact

A collection of protocols and standards used for exchanging data
A common standard across all programming languages and platforms
Can be utilised to allow other internal and third party applications to query the Data Warehouse.

Query Speed: Data Marts are optimised for querying
Schema: A Data Mart Schema is denormalised
Tranformation: Simple or no tranformation is required from the Consolidation Database

Written to meet a general business requirement
Users who cannot create reports can access the pre-run scheduled reports
Interaction required by the Data Warehouse administrators is minimised
Complicated and processor intensive reports to be run during non-peak times

The first step in the ETL process

Databases: Data is extracted via remote SQL queries
Documents/Log Files: Data is extracted via local directory paths or remote URLs
Websites: Data is extracted via remote URLs
Email: Data is extracted through an email connection

In principle, a Data Warehouse should contain a complete historical log of the original data.

A record may store the fact that it is no longer current, e.g. cancelled.
A record may store the modified by and time, but not the changes.
A record may store the changes in a audit log file, which may be difficult to process and extract from.

Data security and access control is critical in any data-rich environment
Sensitive and personal information should not be available to users who do not have a need to know
Users should have their access to the data and stored reports restricted based on the business requirements and trust level.

Building a Data Dictionary

Reverse Engineering

The simplest Data Warehouse schema.

A central fact table, surrounded by dimension or lookup tables.
Understandable for business users.
The fact tables are mostly third normal form
Dimensional tables are in de-normalised second normal form (2NF) which have some duplication of data.

A Data Warehouse is Never

Complete

The process involves extracting and building;

Metadata on all the available relations
Relationships between elements
A complete data model
A data dictionary for future reference

Contains definitions and representations of data elements

Define the human meaning of data

Define how data elements are stored in the DBMS

Should be comprehensive and simple to understand

Define historical data, structure, and extraction

New data sources need to be analysed and added

Custom reports will regularly need to be written

Data validation and integrity checks need to be run

Standard database maintenance and tuning

Reverse Engineering Applications

Most E.R. modellers provide some reverse engineering tools.

DBDesigner4 / MySQL Workbench
Microsoft Visio

In many countries, reverse engineering the artefact or process is lawful provided it is obtained legitimately.

Important

Status and Dashboarding

Ad-hoc Reporting

Transformation

Load

DBMS Choices

Historical Data

Garbage In, Garbage Out

The Semantic Web

Understanding Incremental

The second step in the ETL process. A series of rules are modifications to each row

Deleting Data

Adding Data

Modifying Data

PostgreSQL
Oracle
Primary DBMS used within your organisation

Written to answer a single business requirement.
Normally it does not need to be run again
Discarded after use.

Any column that has been extracted from the data source can be explicitly deleted from the load.

If your extraction query contains a wildcard (*), then you may wish to delete extraneous columns.

Arbitrary columns, with pre-set values, can be added to the consolidation database during the extraction process.

This is most often used to add descriptive details alongside extracted information.

Any column that has been extracted from the data source can be modified and transformed into any style you require.

Different applications do this in different ways. One of the most powerful methods is regular expressions.

The final step in the ETL process.

Any triggers are fired as each row is inserted.
Depending on the triggers and indexes, this can be very time-consuming.
It can be faster to drop all the indices on the tables for the load and recreate them after the ETL process.

A report can be as simple as a representation of a Key Performance Indicator (KPI), which can;

Identify and correct negative trends
Identify new trends
Measure efficiencies/inefficiencies
Monitor the performance of individual sections
A digital dashboard can display rapid and real time statistics

Timestamp each record when it gets inserted
Timestamp each record when it gets superceded
Each group of like records make up an incremental snapshot of a single data source record.

Regular Expressions (REGEX)

Garbage In, Garbage Out is the fact that, computers will unquestioningly process all regardless of purpose

Ensure that all input is correctly analysed
Validate and correct misleading data is corrected
Check the ETL tool error logs regularly
Confirm the output against a secondary source

A regular expression, or regex for short, is a pattern describing text, and optionally a replacement string.

Match a string: m/SEARCH/
Search and Replace: s/SEARCH/REPLACE/
Global Search and Replace: s/SEARCH/REPLACE/g

Unchanged: Update the end timestamp to the current timestamp.

Updated: Create a new row with the start and end timestamp set to the current timestamp.

New (Inserted): Create a new row with the start and end timestamp set to the current timestamp.

Deleted: This row will never be extracted from the source.

<input>

<table>email</table>

<type>regex</type>

</transform>

</input>

<input>

<select>SELECT * FROM customer</select>

<table>person</table>

<type>delete</type>

<in>modified_time</in>

</transform>

</input>

Literal Characters

Character Classes

Character Ranges

By building an intelligent semantic framework around the data contained within a Data Warehouse, the report generator can accurately respond to automated and free-form queries.

A project to give a semantic meaning, understandable by machines, to documents on the Web
Consists of a data model called Resource Description Framework (RDF), interchange formats and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL)
These facilitate formal description of concepts, terms, and relationships within a given domain

<input>

<select>SELECT * FROM company</select>

<out>organisation_type</out>

<value>Company</value>

</transform>

</input>

<input>

<select>SELECT * FROM ti</select>

<out>organisation_type</out>

<value>Tertiary Institution</value>

</transform>

</input>

You can use a hyphen inside a character class to specify a range of characters.
"[0-9]" matches a single digit between "0" and "9".
You can use more than one range. eg "[0-9a-zA-Z]"

The most basic regular expression consists of literal characters, e.g.: "a".
It will match the first occurrence of those character in the string.
If the string is "Data warehousing can improve business intelligence", it will match the first a after the D.

A character class will match only one out of a range of possible characters.
To match an "a" or an "e", use "[ae]" or "[ea]".
You could use this in "gr[ae]y" to match either "gray" or "grey", though not "graay", "graey".
"\d" matches a single character that is a digit
"\w" matches a "word character" (alphanumeric characters plus underscore)
"\s" matches a whitespace character (includes tabs and line breaks).

Negation

Special Characters

Typing a caret "^" after the opening square bracket will negate the character class.
"h[^o]" matches "he" in "hello", but it will not match "hope".

There are eleven characters (or metacharacters) with special meanings;

[ (Open Square Bracket) - Start a character class
\ (backslash) - Turns metacharacters into normal characters.
^ (caret) - Within a character class. This will negate the character class.
^ (caret) - Outside a character class. This will match the beginning of a line.
$ (dollar sign) - This will match the end of a line.

. (period or dot) - Matches one character regardless of the character.
| (vertical bar or pipe symbol) - Matches either or.
? (question mark) - Matches the preceding character 0 or 1 times.
* (asterisk or star) - Matches the preceding character 0 or more times.
+ (plus sign) - Matches the preceding character 1 or more times.
( (opening round bracket) - Start a character group
) (closing round bracket) - End a character group

Example Table

Joining Data

Splitting Data

Dropping Rows

What DBMS is used organisationally?

Discussion

CREATE TABLE person (

sys_person_id BIGSERIAL PRIMARY KEY,

person_id TEXT,

title TEXT,

family_name TEXT,

given_names TEXT,

gender TEXT,

date_of_birth TEXT,

passphrase TEXT,

start_timestamp_id TIMESTAMP REFERENCES dw_timestamp,

end_timestamp_id TIMESTAMP REFERENCES dw_timestamp,

UNIQUE (person_id, start_timestamp_id)

);

Your ETL system needs to be able to join multiple columns together from the source data.

The join should be separated by an optional delimiter.

Your ETL system needs to be able to split apart a data based on a given delimiter.

The delimiter can either be a string or a regular expression.

For data validation purposes you must be able to drop entire rows from the load based on the value, or transformed value, of a given column.

This ensures that your Data Warehouse (or Data Mart) is in a stable and consistent state at all times.

<input>

<select>SELECT * FROM email</select>

<table>email</table>

<in>email_address</in>

<action>/^([a-zA-Z0-9_\.\-])+\@

(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/</action>

</transform>

</input>

<input>

<select>SELECT * FROM email</select>

<table>email</table>

<type>split</type>

<in>email_address</in>

<out>username,domain</out>

</transform>

</input>

<input>

<select>SELECT * FROM email</select>

<table>email</table>

<in>username,domain</in>

<out>address</out>

</transform>

</input>

Commercial Applications

Open Source Applications

Business Objects

Netezza

SAP Business Information Warehouse

Teradata

Pentaho

Jasper Reports

YALE

OpenI

Hello World

Data Marts

Data Sources

Reporting

Engine

Consolidation

Database

ETL

Choose a template

Science - Cranium (AI Assisted)

Unleash your creativity and captivate your audience with our Cranium Prezi AI-assisted presentation template, designed to stimulate innovative thinking and deliver a visually engaging experience for any intellectual endeavor.

Music Festival (AI Assisted)

Elevate your presentation with our dynamic and visually stunning Music Festival Prezi AI-assisted presentation template, designed to captivate audiences and showcase the rhythm of your event in every slide.

Whiteboard (AI Assisted)

Unleash creativity and collaboration with our Whiteboard Prezi AI-assisted presentation template, seamlessly combining the simplicity of a traditional whiteboard with the power of digital innovation for dynamic and interactive visual storytelling.

See more templates →

Presentations from around the world

DIA DE

Orlando Sanchez

LMS Prezi

Zoo Labs

Creative Report

Gabriele Roncoroni

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better