Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
Redshift best practices
Transcript of Redshift best practices
How can I learn SQL?
Optimizing query performance
How Redshift works
How to use Redshift like a pro?
How can I get access?
How can I use it?
What's in there?
Redshift docs are amazing
This talk is based on Chartio's white paper
Redshift is built on the top of postgreSQL
each team has its own schema and user group
team member can write the team's tables
everyone can read all the tables
ask the data team
Prezi's data warehouse
data to be analyzed interactively
data for Chartio
command on hadoopclient.prezi.com
or set up Workbench locally
Prezi has a 4-node cluster
for performance intensive workloads
data is distributed between the nodes
distribution has a huge effect on performance
even one slow query can block the cluster
Distributing your table
data gets distributed in a round-robin fashion
DON'T if you want to join this table
distribute by one column
same values are on the same node
use if for effectively joining big tables
the table will be copied to all nodes
use it for slowly changing table that get joined very often
can improve performance of selections, joins
recent data is queried
most frequently, specify the timestamp column as the leading column for the sort key.
If you do frequent
range filtering or equality filtering
on one column, specify that column as the sort key.
frequently join a table
, specify the join column as both the sort key and the distribution key.
Redshift tutorial is pretty straightforward:
if you use your table for several purposes....well, there will be a trad-off somewhere
determines the order of data on each node
Other tips and tricks
try to minimize the result size
filter before joining
use minimal number of columns
use primary and unique keys
Redshift doesn't enforce if
query planner uses it for optimisation
you can get the execution plan to see what is the order of execution
see the relative cost of the steps
let's say we have a
number of prezis created by US users/ top 10 countries with the most prezis
number of prezis created on the first day of registration
average path steps in prezi for pro/enjoy/public users
The query timeout is set to 10 minutes
SET query_group TO "override_query_timeout";
SET statement_timeout TO 3600000;
if 10 minutes are not enough:
Maximum possible value is 7200000 (=2 hours, as statement_timeout is in millisecond.)
add your username/password to your .bash_profile
Style guidelines and some tips
from another Redshift table
with COPY command
if you delete a lot of data, the table should be VACUUMed
think about TRUNCATE instead of frequent deletes
use SELECT INTO or INSERT INTO ... SELECT
individual insert is slow
if you automatize it with Flowkeeper, the we take care about the syntax
Do you have questions?
Reach out for Data Team
big experience with Redshift
Redshift and Chartio user management
(up to 8TB, now we have 4TB)
number of prezis created in the last 30 days by pro/public/enjoy/edu users