Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Big data

My learning on big data

Nikhil Dhankani

on 22 November 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big data

BIGDATA How big is big data? What is big data? Can you give some example? Not all large datasets are big and its not only Volume of data that makes the data BIG Wiki - In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.
Handles more than 1 million customer transactions every hour

Has customer profile data - location, age, gender, annual/monthly salary, criminal/fraud record, etc.

Has terabytes of previous sales information

Has data about online history - clicks, wishlists

Has past CRM data

Customer service call logs

Social data - sentiments, plans, feedback

Product related data 0|1 - bit
8 bit - 1 byte
1024 byte - 1 KB
1024 KB - 1 MB
1024 MB - 1 GB
1024 GB - 1 TB
1024 TB - 1 Petabyte
1024 PB - 1 Exabyte
1024 EB - 1 Zettabyte
1024 ZB - 1 Yottabyte Definition Industry - 3 Vs - Volume, Velocity, Variety Gartner - "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization" Facebook(25+ TB/day), twitter(12+ TB/day) generate enormous data Sensors, RFID, social networking sites, generate enormous amount of data TBs ? PBs ? GBs ? EBs ? ZBs ? YBs ? CERN generated 40 TB per _ _ _ Let's revisit Flights generating data @ 10TB / 30 minutes Customer specific promotional offer Target GFS MapReduce Apache Hadoop HDFS MapReduce BigTable HBase Structured/Unstructured
As the stat says 20/80
Data Scientist at Target, Andre Pole, says,
“If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an e-mail we’ve sent you or visit our Web site, we’ll record it and link it to your Guest ID.”
GPS Enabled
Flights, cabs, railways, smartphones
Scheduled, Feeded
At rest/ In motion
Twitter stream, coordinates of sandy storm, 20 years POS data, Baby names since 1888 Why? Can we search the web as we type?

Can banks detect risk, fraud?

Can retailers target potential customers?

Can doctors predict future diseases, analyse body faster?

Can we analyze sentiment about a product, organization, movie?

Can hedge funds compare growth? VVV Hadoop Based on GFS, MapReduce white papers from Google
Is a data storage and processing system.
It is scalable, fault-tolerant and distributed.
Able to store any kind of data in its native format, inexpensively
Runs on clusters of commodity servers, each of those servers has local CPU and storage.
Two critcial components, HDFS and MapReduce HDFS Is the storage system for a Hadoop cluster.
When data arrives at the cluster, it is broken into pieces (chunks/blocks) and distributed among the different servers participating in the cluster.
Each server stores just a small fragment of the complete data set, and each piece of data is replicated on more than one server(3+).
Single master coordinates access
No data caching
Familiar API, interface GFS assumptions - High component failure rates
Huge(in millions) number of large files(64 MB or larger)
Files are written once, if required, appended
Large streaming reads
High sustained throughput over low latency MapReduce Automatic parallelization & distribution
Fault tolerance
Status & Monitoring tools
Clean abstraction for programmers
Analytical jobs can be distributed, in parallel, to each of the servers storing part of the data
Each server evaluates the question against its local fragment simultaneously and reports its results back for collation into a comprehensive answer. Map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: (lineNumber, line) map() produces one or more intermediate values long with an output key from the input : (word, wordCount) map(String input_key, String input_value) {
// input_key: document name
// input_value: document contents
for each word w in input_value:
emit(w, 1);
} Reduce reduce(String output_key,
Iterator<int> intermediate_values) {// output_key: a word// output_values: a list of countsint result = 0;for each v in intermediate_values:result += v;emit(output_key, result);
} After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key map(), reduce() run in parallel, reduce phase cannot start until map phase is completely finished What more to look out? Google Dremel - Apache Drill Large-scale
+ ad-hoc querying of data
+ with radically lower latencies
+ that are especially apt for data exploration possible to scan over petabytes of data in seconds, to answer ad hoc queries and presumably, power compelling visualizations IBM BigSheet EMC offerings Why not BigData? Danah Boyd, known for her public commentary on technology, has brought front few concerns with BigData 1) Bigger Data are Not Always Better Data
quality matters more than quantity
2) Not All Data are Created Equal
data collected through surveys, interviews, observations, and experiments is different from social network data
3) What and Why are Different Questions
Nobody loves Big Data better than marketers. And nobody misinterprets Big Data better than marketers.
4) Be Careful of Your Interpretations Just because data is accessible doesn't mean that using it is ethical QUESTIONS??? References - My thoughts Why?
What next?
Why not? - Nikhil Dhankani https://vimeo.com/3584536
GFS white paper
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/ The How? part
Full transcript