Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

How to build a data warehouse that people will actually use

No description
by

on 10 July 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of How to build a data warehouse that people will actually use

Redshift
Flowkeeper
Flowkeeper & Redshift
Challenges
Jenkins must be dummy-proof
Prezi's data warehouse
How to build a data warehouse that people will actually use
Göbölös-Szabó Julianna
julianna.gobolos-szabo@prezi.com
@gszjulcsi
2008-2011
2011-2013
2014-
Redshift in Prezi
very fast
we own the data
combined with Chartio it beats GoodData
SQL
you need to understand distributions
lack of indexing
not (that) cheap
scales well
driver is in bash, jobs are in bash
define job types to make your buddies life easier
eg. redshift loader, pig job, redshift transform
Btw: it's in Go
how to enforce good table definitions
target person: not-that-technical analysts
future: submit jobs from flowtracker
easy to kick-off
nice visualisations
OLAP
connector for ZenDesk and GetSatisfaction
http://www.gooddata.com/customers/prezi
doesn't scale well
vendor lock-in
too complex, not intuitive
you must aggregate your data first
people love SQL
integrates with Chartio
dense storage nodes
dense compute
cheaper
huge storage
slower computation
for performance intensive workloads
less storage
expensive
data is distributed between the nodes
distribution has a huge effect on performance
vs
especially compared to pig
We need to rewrite it
Goal: make data usage smooth for non-engineers
UX researchers
analysts
product managers
git, ssh, pig
sql, working locally
Loading data to Redshift
Jenkins for jobs
flowtracker
in etl
can be used through
Workbench locally
(we have 4 of these)
(old) ETL framework:
became a disfunction
only engineers (can) write jobs
hard to maintain
handle dependencies
make recovery super easy
make it easy to develop and operate
we started to use CI and proper deployment process
dependency graph based on input/output datasets
provide good tooling
https://prezi.com/hza1klnkdmn-/go-meetup-2014-09-flowkeeper/
in flowkeeper
Full transcript