Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Data Warehousing

Data Warehousing Presentation
by

Evan Leybourn

on 3 May 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data Warehousing

Data Warehousing and
Business Intelligence Systems

Evan Leybourn
Identify an organisational requirement for a Data Warehouse
Understand how to improve an organisation's data and information
Understand what is involved in creation and ongoing administration
Identify and analyse potential data sources and how to improve business intelligence
Design a reporting plan to improve information management
Improve data integrity and quality
Create a database schema suitable for a Data Warehouse
Course Outline
What is a Data Warehouse?
Repository of Organisational Information
Data from disparate sources is stored for
Reporting
Decision making
Business Intelligence
Data is the raw output of any database,
website, log files or other data source.
Information is the processed and refined version of the data for human usage.
What data do you have?
What information do you need?
Central Repository
Increase the efficiency, reliability and accessibility of your data
Standardise and simplify information managment within an organisation
Cross-application reporting, improving business intelligence
Identify obsolete, duplicate and redundant data
Multiple Data Sources
Resolve conflicts between data from different sources.
Correlate records that may have different identifying keys.
Differentiate identical primary keys from difference records.
Understand how to combine data sources into a consolidated database
Accurately, efficiently and with no duplication store records.
Important
Do you own your own data?
Discussion
Data Warehouse Phases
Analysis and Reverse Engineering
This is an iterative process.
Data Warehouse Applications
Open Source Applications
Pentaho
Jasper Reports
YALE
OpenI
Commercial Applications
Business Objects
Netezza
SAP Business Information Warehouse
Teradata
Overview
Analysis
Design of the Consolidation Database and Data Marts
Design
Extraction, Transformation and Load (ETL) of the data
ETL
Business level reporting on the data
Reporting
Data Sources
Databases
Documents
Websites
Emails
Research
Log Files
Identify and discuss potential data sources
Discussion
Source Analysis
How much of this data is duplicated elsewhere?
Does the organisation have the skills in-house to perform the analysis?
How satisfactory are the current reports?
How is this data used?
How frequently and when does this data change?
Is external permission required before extracting this data, and how much will that cost?
How can we access and extract the data?
How much of this data is obsolete or irrelevant?
What quantity of data is being dealt with?
From a direct database connection
From a schema diagram such as an E.R. Diagram
From a remote connection such as ODBC/JDBC
As a last resort they can be extracted from text extracts


Access is via a direct connection or ODBC
Databases and the Structural Schema
Text files and word processor documents
Presentations
Spreadsheets
Which can be analysed as if they were a database table


Access is via file access on a local path or URL
Office Documents and Corporate Knowledge
Ensure that you strip all advertising and navigation
The extraction tool should follow links where required


Access is via URL through HTTP Path
Websites with research and reference data
Be aware of privacy issues
Capture public accounts, eg sales, info
Header information can be extracted and loaded
Strip all images and HTML
Images and attachments can be loaded individually


Access is via POP or IMAP protocols
Corporate Emails
Level of information based on access rights.


Access is via URL through HTTP Path
Research Papers and Abstracts
Statistical information is usually stored in simple log files which can be processed and integrated into the Data Warehouse.


Access is via file access on a local path or URL
Log Files from corporate servers
Building a Data Dictionary
Contains definitions and representations of data elements
Define the human meaning of data
Define how data elements are stored in the DBMS
Should be comprehensive and simple to understand
Define historical data, structure, and extraction
Reverse Engineering
The process involves extracting and building;

Metadata on all the available relations
Relationships between elements
A complete data model
A data dictionary for future reference
Reverse Engineering Applications
Most E.R. modellers provide some reverse engineering tools.

DBDesigner4 / MySQL Workbench
Microsoft Visio
In many countries, reverse engineering the artefact or process is lawful provided it is obtained legitimately.
Important
A Data Warehouse is Never
Complete
New data sources need to be analysed and added
Custom reports will regularly need to be written
Data validation and integrity checks need to be run
Standard database maintenance and tuning
Outcomes
Long Term Benefits
Improved decision making
Complete understanding of organisational data
Break vendor lock-in
Simplify application migration
Giving end-users more access to organisation data
Build reports to improve business capability and identify organisational weakness
Long Term Concerns
The ETL process is very time-consuming
Computer processing requirements are very high
A security policy to cover data access and storage
A strict design, scope, prototyping and release cycle to deliver a working system in a timely manner.
Databases
DBMS Choices
PostgreSQL
Oracle
Primary DBMS used within your organisation
Historical Data
Timestamp each record when it gets inserted
Timestamp each record when it gets superceded
Each group of like records make up an incremental snapshot of a single data source record.
Star Schema
The simplest Data Warehouse schema.

A central fact table, surrounded by dimension or lookup tables.
Understandable for business users.
The fact tables are mostly third normal form
Dimensional tables are in de-normalised second normal form (2NF) which have some duplication of data.
The Consolidation Database
This database should contain all the information from your data sources as read-only, historical and aggregated data

Can become very large
Historical information must be stored
Use an incremental extraction/insertion system
Data Marts
Data Marts are subsets of the consolidation database.

The Data Marts also contain historical and incremental data
Most Data Marts are highly denormalised for rapid querying
Indexes should be put on frequently used fields
Each schema should be designed for a report subject
What DBMS is used organisationally?
Discussion
Fact
Dimensions
Snowflake Schema
A normalised type of star schema.
The dimension data has been grouped into multiple tables instead of one large table.
Saves space and reduces redundancy
Increases the number of dimension tables and requires more foreign key joins.
Reduced query performance.
Non Simplicistic
Fact
Dimensions
Consider your reporting requirements. How would you create a star schema to represent that data.
Discussion
Snowflakes
Example Table
CREATE TABLE person (
sys_person_id BIGSERIAL PRIMARY KEY,
person_id TEXT,
title TEXT,
family_name TEXT,
given_names TEXT,
gender TEXT,
date_of_birth TEXT,
passphrase TEXT,
start_timestamp_id TIMESTAMP REFERENCES dw_timestamp,
end_timestamp_id TIMESTAMP REFERENCES dw_timestamp,
UNIQUE (person_id, start_timestamp_id)
);
Understanding Incremental
Unchanged: Update the end timestamp to the current timestamp.
Updated: Create a new row with the start and end timestamp set to the current timestamp.
New (Inserted): Create a new row with the start and end timestamp set to the current timestamp.
Deleted: This row will never be extracted from the source.
Consolidation and Data Mart
Differences
Query Speed: Data Marts are optimised for querying
Schema: A Data Mart Schema is denormalised
Tranformation: Simple or no tranformation is required from the Consolidation Database
Rationalle
Transformation
The second step in the ETL process. A series of rules are modifications to each row
Load
The final step in the ETL process.

Any triggers are fired as each row is inserted.
Depending on the triggers and indexes, this can be very time-consuming.
It can be faster to drop all the indices on the tables for the load and recreate them after the ETL process.
Historical Extraction
In principle, a Data Warehouse should contain a complete historical log of the original data.

A record may store the fact that it is no longer current, e.g. cancelled.
A record may store the modified by and time, but not the changes.
A record may store the changes in a audit log file, which may be difficult to process and extract from.
Data Integrity
Integrity checks scan the data source for physical and data errors
If the check fails skip the insertion and raise an error flag
Many older applications are prone to these integrity problems
Data Validation
Validation checks scan the data source for logical errors. If the check fails;

Insert into the warehouse and raise an error flag or
Skip the insertion and raise an error flag
Resolving Errors
During the ETL process, records can be discovered that contain errors.

All erroneous records should be flagged for DBA attention
Erroneous records should always be resolved in the master data source to ensure stable and reliable data
Erroneous records can also identify logical flaws in the design of the Data Warehouse
Garbage In, Garbage Out
Garbage In, Garbage Out is the fact that, computers will unquestioningly process all regardless of purpose

Ensure that all input is correctly analysed
Validate and correct misleading data is corrected
Check the ETL tool error logs regularly
Confirm the output against a secondary source
Extraction
The first step in the ETL process
Databases: Data is extracted via remote SQL queries
Documents/Log Files: Data is extracted via local directory paths or remote URLs
Websites: Data is extracted via remote URLs
Email: Data is extracted through an email connection
Data Sources
Consolidation
Database
Data Marts
ETL
Reporting
Engine
Data Standardisation
Identify and standardise all common data from different sources.
Gender: Male / Female => M / F etc.
Case Sensitivity: university => University
Addresses: Line 1 / Line 2 => Company / Street
Deleting Data
Any column that has been extracted from the data source can be explicitly deleted from the load.

If your extraction query contains a wildcard (*), then you may wish to delete extraneous columns.
Adding Data
Arbitrary columns, with pre-set values, can be added to the consolidation database during the extraction process.

This is most often used to add descriptive details alongside extracted information.
Modifying Data
Any column that has been extracted from the data source can be modified and transformed into any style you require.

Different applications do this in different ways. One of the most powerful methods is regular expressions.
Joining Data
Your ETL system needs to be able to join multiple columns together from the source data.

The join should be separated by an optional delimiter.
Splitting Data
Your ETL system needs to be able to split apart a data based on a given delimiter.

The delimiter can either be a string or a regular expression.
Dropping Rows
For data validation purposes you must be able to drop entire rows from the load based on the value, or transformed value, of a given column.

This ensures that your Data Warehouse (or Data Mart) is in a stable and consistent state at all times.
eleybourn@gmail.com
<input>
<select>SELECT * FROM company</select>
<transform>
<type>add</type>
<out>organisation_type</out>
<value>Company</value>
</transform>
</input>
<input>
<select>SELECT * FROM ti</select>
<transform>
<type>add</type>
<out>organisation_type</out>
<value>Tertiary Institution</value>
</transform>
</input>
<input>
<select>SELECT * FROM customer</select>
<table>person</table>
<transform>
<type>delete</type>
<in>modified_time</in>
</transform>
</input>
<input>
<table>email</table>
<transform>
<type>regex</type>
<in>body</in>
<out>body</out>
<action>s/<[^<>]+>//g</action>
</transform>
</input>
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>drop</type>
<in>email_address</in>
<action>/^([a-zA-Z0-9_\.\-])+\@
(([a-zA-Z0-9\-])+\.)+([a-zA-Z0-9]{2,4})+$/</action>
</transform>
</input>
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>join</type>
<in>username,domain</in>
<out>address</out>
<action>@</action>
</transform>
</input>
<input>
<select>SELECT * FROM email</select>
<table>email</table>
<transform>
<type>split</type>
<in>email_address</in>
<out>username,domain</out>
<action>@</action>
</transform>
</input>
A regular expression, or regex for short, is a pattern describing text, and optionally a replacement string.

Match a string: m/SEARCH/
Search and Replace: s/SEARCH/REPLACE/
Global Search and Replace: s/SEARCH/REPLACE/g
Regular Expressions (REGEX)
Literal Characters
The most basic regular expression consists of literal characters, e.g.: "a".
It will match the first occurrence of those character in the string.
If the string is "Data warehousing can improve business intelligence", it will match the first a after the D.
Character Classes
A character class will match only one out of a range of possible characters.
To match an "a" or an "e", use "[ae]" or "[ea]".
You could use this in "gr[ae]y" to match either "gray" or "grey", though not "graay", "graey".
"\d" matches a single character that is a digit
"\w" matches a "word character" (alphanumeric characters plus underscore)
"\s" matches a whitespace character (includes tabs and line breaks).
Character Ranges
You can use a hyphen inside a character class to specify a range of characters.
"[0-9]" matches a single digit between "0" and "9".
You can use more than one range. eg "[0-9a-zA-Z]"
Special Characters
. (period or dot) - Matches one character regardless of the character.
| (vertical bar or pipe symbol) - Matches either or.
? (question mark) - Matches the preceding character 0 or 1 times.
* (asterisk or star) - Matches the preceding character 0 or more times.
+ (plus sign) - Matches the preceding character 1 or more times.
( (opening round bracket) - Start a character group
) (closing round bracket) - End a character group
Special Characters
There are eleven characters (or metacharacters) with special meanings;

[ (Open Square Bracket) - Start a character class
\ (backslash) - Turns metacharacters into normal characters.
^ (caret) - Within a character class. This will negate the character class.
^ (caret) - Outside a character class. This will match the beginning of a line.
$ (dollar sign) - This will match the end of a line.
Negation
Typing a caret "^" after the opening square bracket will negate the character class.
"h[^o]" matches "he" in "hello", but it will not match "hope".
Writing Good Reports
Status and Dashboarding
Ad-hoc Reporting
Written to answer a single business requirement.
Normally it does not need to be run again
Discarded after use.
Web Services
A collection of protocols and standards used for exchanging data
A common standard across all programming languages and platforms
Can be utilised to allow other internal and third party applications to query the Data Warehouse.
Narrowing your Results
Most reports display aggregated information to represent an organisational fact.
Advanced reporting tools can "drill down" into reports to extract more detailed information.
Users to look at the organisation level to see trends and statistics and further interrogate points of interest.
The Semantic Web
By building an intelligent semantic framework around the data contained within a Data Warehouse, the report generator can accurately respond to automated and free-form queries.

A project to give a semantic meaning, understandable by machines, to documents on the Web
Consists of a data model called Resource Description Framework (RDF), interchange formats and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL)
These facilitate formal description of concepts, terms, and relationships within a given domain
Data Access Control
Data security and access control is critical in any data-rich environment
Sensitive and personal information should not be available to users who do not have a need to know
Users should have their access to the data and stored reports restricted based on the business requirements and trust level.
Scheduled Reporting
Written to meet a general business requirement
Users who cannot create reports can access the pre-run scheduled reports
Interaction required by the Data Warehouse administrators is minimised
Complicated and processor intensive reports to be run during non-peak times
A report can be as simple as a representation of a Key Performance Indicator (KPI), which can;

Identify and correct negative trends
Identify new trends
Measure efficiencies/inefficiencies
Monitor the performance of individual sections
A digital dashboard can display rapid and real time statistics
Data sources must be appropriate to the business requirements
Irrelevant information can provide misleading information
The input data must be accurate
Most importantly, the person writing the report must understand the data
Users must be fully trained in the content and structure
Hello World
Full transcript