Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Big Data

CS 463 - Big Data
by

A Bautista

on 9 December 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data

Big Data
"Throughout the 60-year history of computer science, the emphasis has been on the algorithm as the main subject of study. But some recent work in AI suggests that for many problems, it makes more sense to worry about the data and be less picky about what algorithm to apply. "

Russell & Norving. (2010) "Artificial Intelligence:
A Modern Approach." 3rd Edition. Prentice Hall
New technology solutions.
References:
https://hadoop.apache.org/
Akerkar, R. (2013) "Big Data Computing" (http://opac.library.csupomona.edu:80/record=b2759870~S4
Lam, C. (2009) "Hadoop in Action" (http://opac.library.csupomona.edu:80/record=b1743801~S4)
Kudyba, S. (2014) "Big Data, Mining, and Analytics" (http://opac.library.csupomona.edu:80/record=b2761416~S4
Plunkett, T. (2014) "Oracle Big Data Handbook" (http://opac.library.csupomona.edu:80/record=b2759088~S4)
Definition
Methodology &
Technology

Applications
Ethical
Issues

Because of its size, speed, or format, it cannot be easily stored, manipulated or analyzed with traditional methods, like spreadsheets, relational databases, or common statistical software.
What is BD?
Volume
How fast that data is processed
Streaming data to be delivered in real time
Velocity
Variety
Why?
How much data.
Software lowers cost and increases capacity & performance.
Types of Data
Structured
Photographs
Sources:
Use of different devices to communicate
Demand for video and music
Internet
People are increasingly

“living” online
We are generating data faster than ever!
Uses Hadoop to generate over 100 billion personalized recommendations every week.
Amazon, Netapp and Google, allow organizations of all sizes to start benefiting from the potential of Big Data processing.
Where public Big Data sets need to be utilized, running everything in the cloud also makes a lot of sense, as the data does not have to be downloaded to an organization's own system. For example, Amazon Web Services already hosts many public data sets. These include US and Japanese Census data, and many genomic and other medical and scientific Big Data repositories.

Google, Amazon and Facebook have already demonstrated how Big Data can permit the delivery of highly personalised search results, advertising, and product recommendations.


Big Data may also help farmers to accurately forecast bad weather and crop failures
US Government
announced an
investment
of
$200 million in Big Data projects
"to greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data". (May 2012)
What's Next?
May also help farmers to accurately forecast bad weather and crop failures. Governments may use Big Data to predict and plan for civil unrest or pandemics.
By using Big Data techniques to analyze the 12 terabytes of tweets written every day, it is already becoming possible to conduct real-time sentiment analysis to find out how the world feels about things. Such a service is indeed already now offered for free [Sentiment140.com]
In a recent report on Big Data, the McKinsey Global Institute estimated that the US healthcare sector could achieve $300 billion in efficiency and quality savings every year by leveraging Big Data, so cutting healthcare expenditures by about 8 per cent. Across Europe, they also estimate that using Big Data could save at least $149 billion in government administration costs per year. More broadly, in manufacturing firms, integrating Big Data across R&D, engineering and production may significantly reduce time to market and improve product quality.
May increase sustainability by improving traffic management in cities and permitting the smarter operation of electricity generation infrastructures.
(1) collect real-time data
(2) process data as it flows
(3) explore and visualize.
Scenarios:
1)
As an ETL (
Extract-Transform-Load
) and Filtering Platform
– Hadoop platforms can read in the raw data, apply appropriate filters and logic, and output a structured summary or refined data set.

2)
As an exploration engine
– Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refined output is in a Hadoop cluster, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.

3)
As an Archive
. With cheap storage in a distributed cluster, lot’s of data can be kept “active” for continuous analysis.

Searching, log processing, recommendation systems, data warehousing, video and image analysis.

Scalable
Reliable
Fault-tolerant
Leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a “social media feeds”, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.
1000 Kilobytes = 1 Megabyte
1000 Megabytes = 1 Gigabyte
1000 Gigabytes = 1 Terabyte
1000 Terabytes = 1 Petabyte
1000
Petabytes
= 1 Exabyte
1000
Exabytes
= 1 Zettabyte [Facebook, Google]
1000 Zettabytes = 1 Yottabyte
1000 Yottabytes = 1 Brontobyte
1000 Brontobytes = 1 Geopbyte
Storing massive data sets using low-cost storage
Global Internet Traffic
Cisco Global Cloud Center
How to obtain value from a global phenomenon?
Organizations need to analyze data to make decisions for achieving greater efficiency, profits, and productivity
Semistructured
Unstructured
Complex simulations
Video
3D models
Audio
Location data
PDF
Flat files
EDI
Text
Opportunity to look to what is happening with their customers and respond automatically based on predictive analysis.
Google
Society inundated with digital information
Challenge:
deal with data
Big Data has been part of the business world for many years.
Data growing exponentially
Opportunity:
improve operations and reduce costs through the capture and analysis of data
Big Data
Generating
value
from Big Data:
Store & process large sets of data.
Organizations:
Logistics
Financial services
Customer
Generate digital footprints (electric, phone, credit card, cable bills.)
Protected their information. They prefer to decide where, when, and how to search for products and services.
Organization
Healthcare
Predicting customer behavior
How?
Tweets Globally
6,000 /sec
500,000,000 /day
2000,000,000,000 /year
Books
Blog posts
Comments on news articles
Tweets
80%
Enterprise Data
Nicely formatted data
Siri
yelp
Analytics
Internet-based
Backing
Insurance
Healthcare
Science & Engineering
For Research - NASA, detect planets outside solar system
Excel spreadsheet
Privacy is an important issue:
People don't want their personal information to be public
Care needs to be taken when we're working with big data, especially when personally identifiable information is present, in order to ensure privacy.
https://immersion.media.mit.edu/demo
Content
Alberto Bautista
Relational Databases
Microsoft SQL
MySQL
Oracle
XML
JSON
NoSQL Databases
ETL stands for extract, transform and load. This is a term that developed from data warehousing, where data typically resided in one or more large storage systems or data warehouses, but wasn't analyzed there. Instead, the data had to be pulled out of storage, that's the extract stage, and then it had to be converted to the appropriate format for analyses and especially if you're pulling data from several different sources, which may be in different formats, so you transform it. And then once it's ready, you then have to load it as a separate step into the analysis software.

you have to transform all of them into a single common format so you can do all your work on them simultaneously.


when you're dealing with Hadoop, you don't have to be so aware of the extract, transform, load process because it doesn't really happen in quite the same way. So there's not so much inspection of the data, you don't have to think about it so deliberately to solve these problems. It doesn't force you to think about it. On the other hand, you really miss some opportunities to better understand your data and to check for errors along the way.
Challenges:
built and maintain large clusters
Example: WordCount
Count the number of times each word occurs in a set of documents.
list(<String file_name, String file_content>)
MongoDB

NoSQL Databases
CouchDB

Hadoop
Distributes or "
maps
" large data sets across multiple nodes. Each of these nodes then performs processing on the data, and from this creates a summary. The summaries created on each node are then aggregated in the so-termed "
Reduce
" stage.
Processes extremely large amounts of data. Distributes the storage and processing across numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.


Viable platform for Big Data
Collection of software applications
Open-source project from Apache
Java-based software framework
Runs on commodity hardware

Hadoop was the name for the stuffed animal that belonged to the son of one of the developers. It was a stuffed elephant, which explains the logo as well.
Hadoop Ecosystem
HDFS
MapReduce
Pig
Hive
Jaql
HBase
Cassandra
Oozie
Lucene
Avro
Mahount
Streams
Hadoop is not a single thing. It's a collection of software applications that are used to work with big data. It's a framework or platform that consists of several different modules.
A platform in Hadoop that's used to write MapReduce programs, the process by which you split things up and then gather back the results and combine them. It uses its own language. It's called the Pig Latin Programming Language.
Summarizes queries and analyzes the data that's in Hadoop. It uses a sequel-like language called HiveQL for query language, and this is the one that most people are going to use in terms of how to actually work with the data, so between the Hadoop distributed file system and the MapReduce or YARN, and Pig and Hive, you've covered most of what people use when they're using Hadoop. On the other hand, there are other components that are available. For instance, HBase is a no sequel database, so a nonrelational database, or not only sequel database for Hadoop.
Processing part of Hadoop
The number of servers in a cluster, (50 - >=2000 or more. Processing vast quantities of data across large, lower-cost distributed computing infrastructures
by allocating partitioned data sets to numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.
Hadoop Distributed File System (HDFS)
MapReduce
Hadoop Cluster
BigData
Client
Progra
Result
Fault-tolerant: If a function dies in its process, the framework can recover by transferring the function to another node (holding the duplicate data)
High-performance parallel/distributed data processing framework
Framework can issue same task to multiple nodes, and take the result of that node that finishes first.
Hadoop takes care of details: file I/O, failure recovery, networking.
Scales up
Map function: perform independent records transformations receives a key, value pairs of some genetic types. and outputs a list of key, value: (k1, v1) -> list(k2, v2)

Reduce funcion: aggregate all map outputs.
for every unique value k2, it reciebes a list of values v2.
the output a list of key, value pairs of k3,v3: (k2, list(v2)) -> list(k3, v3)

Framework:
scheduling tasks, rerunning failed tasks
Hadoop Cluster
The Big Picture
Divides input into smaller parts
Redistributes the parts among nodes
Data Replication
Self-healing
Scalable (linear)
nodes are connected through the software to each other.
Hadoop Cluster
Input
map()
shuffle
reducer()
"Move code-to-data" philosophy
Data
Computation
Windows
- cygwin to enable shell scripting (www.cygwin.com)
JDK
1.6 or higher (http://java.sun.com/javase/downloads/index.jsp)
Hadoop
(http://hadoop.apache.org/core/releases.html/)
Important (inner) classes
if (sum > 4)
word.set.( itr.nextToken().toLowerCase() );
m
Storage Analysis
Disks Processors
HDFS MapReduce
Data
Input
.
.
.
D
1
D
2
D
3
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
It has a master node (NameNode) and its slave component (DataNode)
If we want to communicate to the file system, we talk to the NameNode.
The NameNode keeps track of the nodes where data has been allocated. Knows on what nodes the data is located
NameNode
Master of File System
Knows on what DataNodes the data is located
Slave component of File System
DataNode 1
DataNode 2
DataNode 3
NameNode DataNode Interaction
GB-TB
Who?
CBS Interactive
Walt Disney
Wal-mart
General Electric
Nokia
Bank of America
Watson (IBM)
E-bay
There is trade-off between convenience and privacy.
Most powerful applications of big data is in
Data Mining
and it's close cousin Text Analytics. Data Mining, use statistical procedures to find unexpected patterns in data. Those patterns might include unexpected associations between variables or people who cluster together in unanticipated ways.

Text Analytics

is sufficiently distinct to be it's own field. The goal here is to take the actual content of Text data, such as tweets or customer reviews and find meaning and pattern in the words.

Predictive analytics
try to predict future events based on past observations. In the popular world, there are a few well-known examples of predictive analytics. The first is in baseball, as shown in the book and the movie, Moneyball, where statistical analysis is used to assist to identify an offensive player's scoring ability. And the standard criteria that have been used by people for a hundred years in baseball is to look at things like batting averages and RBIs or runs batted in, stolen bases, and what happened is, baseball has an enormous data set, because it's very easy to count discrete events that occur, and so, you're able to go back and deal with an extraordinarily large data set for sports.


Companies do share limited amounts of information with third parties.
Information is stolen from companies. Several companies have had their data stolen, including credit card information, addresses, and other important personal information.

Companies sometimes had to give their information to courts or government regulators as part of law suits that could've put them out of business completely if they did not provide the information. The trick with that one is while it is a legal process, it is not something that the users originally agreed to, and so there is a violation of trust even if what's happening is technically a legal process.
Sun Microsystems CEO
Scott McNealy
"You have zero privacy anyway.
Get over it."
1999
key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Copies
Client
Hadoop Cluster
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Program
Final Result
Moving Computation is Cheaper
than Moving Data
MapReduce
NameNode
Master of File System
Knows on what DataNodes the data is located
Slave component of File System
DataNode 1
DataNode 2
DataNode 3
NameNode DataNode Interaction
NameNode
DataNode 1
DataNode 2
DataNode 3
JobTracker TaskTracker Interaction
JobTracker
TaskTracker
TaskTracker
TaskTracker
Task
--> R6
--------> R8
---------> R9
R_Final (R6, R8, R9)
mapping
reducing
Now, the original MapReduce program has been replaced by a patchy Hadoop YARN, which stands for Yet Another Resource Negotiator. Sometimes people just call it MapReduce, too, and YARN allows a lot of things that the original MapReduce couldn't do. The original MapReduce did batch processing, which meant you had to get everything together at once, you split it out at once, you waited until it was done, and then you got your result. YARN can do batch processing, but it also can do stream processing, which means things are coming in as fast as possible and going out simultaneously, and it can also do graph processing, which is social network connections. That's a special kind of data.
MapReduce Program
MapReduce

Input Output
map()
<k1, v1> list(<k2, v2>)
reduce()
<k2, list(v2)> list(<k3, v3>)
Input
: list of (key, value) pairs

map()
: (K1, V1) list(K2, V2)

reduce()
: (K2, list(V2) ) list(K3, V3)
Aggregate results from map phase
Ignored
Data Flow
Input
>1 file
split
map()
Map
map()
map()
map()
map()
Shuffle
& Sort
reduce()
reduce()
reduce()
Output
Reduce
list<k, v>
<k1, v1>
list
<k2, v2>
<k2, list(v2) >
list<k3, v3>
list( <String word, Integer 1> )
<String file_name, String file_content>
list
<"one", 1>
<"two", 1>
<"foo", 1>
<"foo", 1>
File Set
one two foo foo
file_1
two foo foo foo
file_2
list
<"two", 1>
<"foo", 1>
<"foo", 1>
<"foo", 1>
<k, list(v) >
<"one", 1>
<"two", list(1, 1) >
<"foo", list(1, 1, 1, 1, 1) >
list
<"one", 1>
<"two", 2 >
<"foo", 5 >
Output
Pseudo-code for map & reduce
Disadvantages
Complex to install, configure, administer
Skelled programmers
Advantages
Open-source
Anyone with Big Data
Recommendation Engines / Suggestions
Spotify
Amazon
Google
Yahoo
Netflix
Hulu
Has 9 petabytes of data in their Hadoop and Teradata cluster. With 97 million active buyers and sellers they have 2 Billion page view and 75 billion database calls each day. E-bay like others is racing to put in the analytics infrastructure to (1) collect real-time data; (2) process data as it flows; (3) explore and visualize.
Example:
Automotive Industry
Improvements in Car Design for Driver Safety
Stolen data from companies

Companies share data
Government
Privacy is an important issue
Edward Snowden
Whistle Blower?
Hero ?
NSA - surveillance
PRISM
Boundless Information
MYSTIC
Google
Yahoo
Gmail
Facebook
Skype
Apple
Microsoft
AOL
Large Hadron Collider - 40 TB/sec
Big Data Provides valuable services
Amazon
Netflix
Spotify
Give a piece of info but exactly what we need
list( <k, v> )
National Highway Traffic Safety Administration (NHTSA) - 2013
Use of mobile phones for texting while driving
#1
Distraction
Solution
: Mechanism takes control from the driver...
In process... (5-7 years)
Testing: 4,000 data sets 2/5sec.
The analysis takes weeks
With Big Data technology:
Savings: analysis time
Identify defect patters
Full transcript