Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Redshift best practices

No description
by

on 10 July 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Redshift best practices

Some more materials
How can I learn SQL?
Optimizing query performance
How Redshift works
How to use Redshift like a pro?
How can I get access?
Examples
How can I use it?
Basic rules
What's in there?
Redshift docs are amazing
This talk is based on Chartio's white paper
Redshift is built on the top of postgreSQL
Differences:
http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html
http://www.w3schools.com/sql/default.asp
http://www.sqlcourse.com/select.html
use it!
http://aws.amazon.com/documentation/redshift/
each team has its own schema and user group
team member can write the team's tables
everyone can read all the tables
ask the data team
datasupport@prezi.com
Prezi's data warehouse
data to be analyzed interactively
data for Chartio
use
redshift
command on hadoopclient.prezi.com
or set up Workbench locally
https://docs.google.com/a/prezi.com/document/d/1rLRki7XXX7BrWm_Y2c--KBaer39pYoPpaKqyqoFBSfw/edit
http://chartio.com/blog/2014/10/amazon_redshift_distribution_style
Prezi has a 4-node cluster
dense storage
dense compute
vs.
cheaper
huge storage
slower computation
for performance intensive workloads
less storage
expensive
data is distributed between the nodes
distribution has a huge effect on performance
even one slow query can block the cluster
Workload management
Distributing your table
EVEN
data gets distributed in a round-robin fashion
default setting
DON'T if you want to join this table
KEY
distribute by one column
same values are on the same node
use if for effectively joining big tables
ALL
the table will be copied to all nodes
use it for slowly changing table that get joined very often
Sortkey
can improve performance of selections, joins
If
recent data is queried
most frequently, specify the timestamp column as the leading column for the sort key.
If you do frequent
range filtering or equality filtering
on one column, specify that column as the sort key.
If you
frequently join a table
, specify the join column as both the sort key and the distribution key.
Redshift tutorial is pretty straightforward:
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
if you use your table for several purposes....well, there will be a trad-off somewhere
determines the order of data on each node
Other tips and tricks
try to minimize the result size
filter
aggregate
filter before joining
use minimal number of columns
use LIMIT
use primary and unique keys
Redshift doesn't enforce if
query planner uses it for optimisation
EXPLAIN
you can get the execution plan to see what is the order of execution
see the relative cost of the steps
user_table
user_id
username
registration_date
location
licence
let's say we have a
prezi_table
and a
prezi_id
user_id
creation_date
path_steps
number of prezis created by US users/ top 10 countries with the most prezis
number of prezis created on the first day of registration
average path steps in prezi for pro/enjoy/public users
https://github.com/prezi/flowkeeper-core-jobs/blob/master/jobs/new-payment-events/staging-all-payment-events/init.sql
The query timeout is set to 10 minutes
SET query_group TO "override_query_timeout";
SET statement_timeout TO 3600000;
if 10 minutes are not enough:
Maximum possible value is 7200000 (=2 hours, as statement_timeout is in millisecond.)
https://github.com/prezi/flowkeeper-core-jobs/blob/master/jobs/warehouse-user/metrics-user/init.sql
http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html
add your username/password to your .bash_profile
http://wiki.prezi.com/index.php?title=How_to_use_Chartio
datadoc.prezi.com
https://github.com/prezi/flowkeeper/blob/master/docs/redshift_sql_recommendations.md
Style guidelines and some tips
Loading data
from S3
from another Redshift table
with COPY command
http://docs.aws.amazon.com/redshift/latest/dg/r_INSERT_30.html
http://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_INTO.html
vs
if you delete a lot of data, the table should be VACUUMed
think about TRUNCATE instead of frequent deletes
use SELECT INTO or INSERT INTO ... SELECT
individual insert is slow
if you automatize it with Flowkeeper, the we take care about the syntax
Note:
after DELETE
Do you have questions?
datasupport@prezi.com
Reach out for Data Team
Metrics
Data Services
Analytics
core datasets
datadoc.prezi.com
big experience with Redshift
Flowkeeper
Redshift performance
Redshift support
Redshift and Chartio user management
analytical insights
(up to 8TB, now we have 4TB)
https://github.com/prezi/flowkeeper-core-jobs/blob/master/jobs/prezi-create/prezi-creates/prezi-creates-init.sql
https://github.com/prezi/flowkeeper-user-jobs/blob/master/jobs/teams/onlinemarketing/segmentation_answers/init.sql
number of prezis created in the last 30 days by pro/public/enjoy/edu users
Full transcript