Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of Big Data
"Throughout the 60-year history of computer science, the emphasis has been on the algorithm as the main subject of study. But some recent work in AI suggests that for many problems, it makes more sense to worry about the data and be less picky about what algorithm to apply. "
Russell & Norving. (2010) "Artificial Intelligence:
A Modern Approach." 3rd Edition. Prentice Hall
New technology solutions.
Akerkar, R. (2013) "Big Data Computing" (http://opac.library.csupomona.edu:80/record=b2759870~S4
Lam, C. (2009) "Hadoop in Action" (http://opac.library.csupomona.edu:80/record=b1743801~S4)
Kudyba, S. (2014) "Big Data, Mining, and Analytics" (http://opac.library.csupomona.edu:80/record=b2761416~S4
Plunkett, T. (2014) "Oracle Big Data Handbook" (http://opac.library.csupomona.edu:80/record=b2759088~S4)
Because of its size, speed, or format, it cannot be easily stored, manipulated or analyzed with traditional methods, like spreadsheets, relational databases, or common statistical software.
What is BD?
How fast that data is processed
Streaming data to be delivered in real time
How much data.
Software lowers cost and increases capacity & performance.
Types of Data
Use of different devices to communicate
Demand for video and music
People are increasingly
We are generating data faster than ever!
Uses Hadoop to generate over 100 billion personalized recommendations every week.
Amazon, Netapp and Google, allow organizations of all sizes to start benefiting from the potential of Big Data processing.
Where public Big Data sets need to be utilized, running everything in the cloud also makes a lot of sense, as the data does not have to be downloaded to an organization's own system. For example, Amazon Web Services already hosts many public data sets. These include US and Japanese Census data, and many genomic and other medical and scientific Big Data repositories.
Google, Amazon and Facebook have already demonstrated how Big Data can permit the delivery of highly personalised search results, advertising, and product recommendations.
Big Data may also help farmers to accurately forecast bad weather and crop failures
$200 million in Big Data projects
"to greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data". (May 2012)
May also help farmers to accurately forecast bad weather and crop failures. Governments may use Big Data to predict and plan for civil unrest or pandemics.
By using Big Data techniques to analyze the 12 terabytes of tweets written every day, it is already becoming possible to conduct real-time sentiment analysis to find out how the world feels about things. Such a service is indeed already now offered for free [Sentiment140.com]
In a recent report on Big Data, the McKinsey Global Institute estimated that the US healthcare sector could achieve $300 billion in efficiency and quality savings every year by leveraging Big Data, so cutting healthcare expenditures by about 8 per cent. Across Europe, they also estimate that using Big Data could save at least $149 billion in government administration costs per year. More broadly, in manufacturing firms, integrating Big Data across R&D, engineering and production may significantly reduce time to market and improve product quality.
May increase sustainability by improving traffic management in cities and permitting the smarter operation of electricity generation infrastructures.
(1) collect real-time data
(2) process data as it flows
(3) explore and visualize.
As an ETL (
) and Filtering Platform
– Hadoop platforms can read in the raw data, apply appropriate filters and logic, and output a structured summary or refined data set.
As an exploration engine
– Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refined output is in a Hadoop cluster, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.
As an Archive
. With cheap storage in a distributed cluster, lot’s of data can be kept “active” for continuous analysis.
Searching, log processing, recommendation systems, data warehousing, video and image analysis.
Leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a “social media feeds”, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.
1000 Kilobytes = 1 Megabyte
1000 Megabytes = 1 Gigabyte
1000 Gigabytes = 1 Terabyte
1000 Terabytes = 1 Petabyte
= 1 Exabyte
= 1 Zettabyte [Facebook, Google]
1000 Zettabytes = 1 Yottabyte
1000 Yottabytes = 1 Brontobyte
1000 Brontobytes = 1 Geopbyte
Storing massive data sets using low-cost storage
Global Internet Traffic
Cisco Global Cloud Center
How to obtain value from a global phenomenon?
Organizations need to analyze data to make decisions for achieving greater efficiency, profits, and productivity
Opportunity to look to what is happening with their customers and respond automatically based on predictive analysis.
Society inundated with digital information
deal with data
Big Data has been part of the business world for many years.
Data growing exponentially
improve operations and reduce costs through the capture and analysis of data
from Big Data:
Store & process large sets of data.
Generate digital footprints (electric, phone, credit card, cable bills.)
Protected their information. They prefer to decide where, when, and how to search for products and services.
Predicting customer behavior
Comments on news articles
Nicely formatted data
Science & Engineering
For Research - NASA, detect planets outside solar system
Privacy is an important issue:
People don't want their personal information to be public
Care needs to be taken when we're working with big data, especially when personally identifiable information is present, in order to ensure privacy.
ETL stands for extract, transform and load. This is a term that developed from data warehousing, where data typically resided in one or more large storage systems or data warehouses, but wasn't analyzed there. Instead, the data had to be pulled out of storage, that's the extract stage, and then it had to be converted to the appropriate format for analyses and especially if you're pulling data from several different sources, which may be in different formats, so you transform it. And then once it's ready, you then have to load it as a separate step into the analysis software.
you have to transform all of them into a single common format so you can do all your work on them simultaneously.
when you're dealing with Hadoop, you don't have to be so aware of the extract, transform, load process because it doesn't really happen in quite the same way. So there's not so much inspection of the data, you don't have to think about it so deliberately to solve these problems. It doesn't force you to think about it. On the other hand, you really miss some opportunities to better understand your data and to check for errors along the way.
built and maintain large clusters
Count the number of times each word occurs in a set of documents.
list(<String file_name, String file_content>)
Distributes or "
" large data sets across multiple nodes. Each of these nodes then performs processing on the data, and from this creates a summary. The summaries created on each node are then aggregated in the so-termed "
Processes extremely large amounts of data. Distributes the storage and processing across numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.
Viable platform for Big Data
Collection of software applications
Open-source project from Apache
Java-based software framework
Runs on commodity hardware
Hadoop was the name for the stuffed animal that belonged to the son of one of the developers. It was a stuffed elephant, which explains the logo as well.
Hadoop is not a single thing. It's a collection of software applications that are used to work with big data. It's a framework or platform that consists of several different modules.
A platform in Hadoop that's used to write MapReduce programs, the process by which you split things up and then gather back the results and combine them. It uses its own language. It's called the Pig Latin Programming Language.
Summarizes queries and analyzes the data that's in Hadoop. It uses a sequel-like language called HiveQL for query language, and this is the one that most people are going to use in terms of how to actually work with the data, so between the Hadoop distributed file system and the MapReduce or YARN, and Pig and Hive, you've covered most of what people use when they're using Hadoop. On the other hand, there are other components that are available. For instance, HBase is a no sequel database, so a nonrelational database, or not only sequel database for Hadoop.
Processing part of Hadoop
The number of servers in a cluster, (50 - >=2000 or more. Processing vast quantities of data across large, lower-cost distributed computing infrastructures
by allocating partitioned data sets to numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.
Hadoop Distributed File System (HDFS)
Fault-tolerant: If a function dies in its process, the framework can recover by transferring the function to another node (holding the duplicate data)
High-performance parallel/distributed data processing framework
Framework can issue same task to multiple nodes, and take the result of that node that finishes first.
Hadoop takes care of details: file I/O, failure recovery, networking.
Map function: perform independent records transformations receives a key, value pairs of some genetic types. and outputs a list of key, value: (k1, v1) -> list(k2, v2)
Reduce funcion: aggregate all map outputs.
for every unique value k2, it reciebes a list of values v2.
the output a list of key, value pairs of k3,v3: (k2, list(v2)) -> list(k3, v3)
scheduling tasks, rerunning failed tasks
The Big Picture
Divides input into smaller parts
Redistributes the parts among nodes
nodes are connected through the software to each other.
"Move code-to-data" philosophy
- cygwin to enable shell scripting (www.cygwin.com)
1.6 or higher (http://java.sun.com/javase/downloads/index.jsp)
Important (inner) classes
if (sum > 4)
word.set.( itr.nextToken().toLowerCase() );
It has a master node (NameNode) and its slave component (DataNode)
If we want to communicate to the file system, we talk to the NameNode.
The NameNode keeps track of the nodes where data has been allocated. Knows on what nodes the data is located
Master of File System
Knows on what DataNodes the data is located
Slave component of File System
NameNode DataNode Interaction
Bank of America
There is trade-off between convenience and privacy.
Most powerful applications of big data is in
and it's close cousin Text Analytics. Data Mining, use statistical procedures to find unexpected patterns in data. Those patterns might include unexpected associations between variables or people who cluster together in unanticipated ways.
is sufficiently distinct to be it's own field. The goal here is to take the actual content of Text data, such as tweets or customer reviews and find meaning and pattern in the words.
try to predict future events based on past observations. In the popular world, there are a few well-known examples of predictive analytics. The first is in baseball, as shown in the book and the movie, Moneyball, where statistical analysis is used to assist to identify an offensive player's scoring ability. And the standard criteria that have been used by people for a hundred years in baseball is to look at things like batting averages and RBIs or runs batted in, stolen bases, and what happened is, baseball has an enormous data set, because it's very easy to count discrete events that occur, and so, you're able to go back and deal with an extraordinarily large data set for sports.
Companies do share limited amounts of information with third parties.
Information is stolen from companies. Several companies have had their data stolen, including credit card information, addresses, and other important personal information.
Companies sometimes had to give their information to courts or government regulators as part of law suits that could've put them out of business completely if they did not provide the information. The trick with that one is while it is a legal process, it is not something that the users originally agreed to, and so there is a violation of trust even if what's happening is technically a legal process.
Sun Microsystems CEO
"You have zero privacy anyway.
Get over it."
key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Moving Computation is Cheaper
than Moving Data
Master of File System
Knows on what DataNodes the data is located
Slave component of File System
NameNode DataNode Interaction
JobTracker TaskTracker Interaction
R_Final (R6, R8, R9)
Now, the original MapReduce program has been replaced by a patchy Hadoop YARN, which stands for Yet Another Resource Negotiator. Sometimes people just call it MapReduce, too, and YARN allows a lot of things that the original MapReduce couldn't do. The original MapReduce did batch processing, which meant you had to get everything together at once, you split it out at once, you waited until it was done, and then you got your result. YARN can do batch processing, but it also can do stream processing, which means things are coming in as fast as possible and going out simultaneously, and it can also do graph processing, which is social network connections. That's a special kind of data.
<k1, v1> list(<k2, v2>)
<k2, list(v2)> list(<k3, v3>)
: list of (key, value) pairs
: (K1, V1) list(K2, V2)
: (K2, list(V2) ) list(K3, V3)
Aggregate results from map phase
<k2, list(v2) >
list( <String word, Integer 1> )
<String file_name, String file_content>
one two foo foo
two foo foo foo
<k, list(v) >
<"two", list(1, 1) >
<"foo", list(1, 1, 1, 1, 1) >
<"two", 2 >
<"foo", 5 >
Pseudo-code for map & reduce
Complex to install, configure, administer
Anyone with Big Data
Recommendation Engines / Suggestions
Has 9 petabytes of data in their Hadoop and Teradata cluster. With 97 million active buyers and sellers they have 2 Billion page view and 75 billion database calls each day. E-bay like others is racing to put in the analytics infrastructure to (1) collect real-time data; (2) process data as it flows; (3) explore and visualize.
Improvements in Car Design for Driver Safety
Stolen data from companies
Companies share data
Privacy is an important issue
NSA - surveillance
Large Hadron Collider - 40 TB/sec
Big Data Provides valuable services
Give a piece of info but exactly what we need
list( <k, v> )
National Highway Traffic Safety Administration (NHTSA) - 2013
Use of mobile phones for texting while driving
: Mechanism takes control from the driver...
In process... (5-7 years)
Testing: 4,000 data sets 2/5sec.
The analysis takes weeks
With Big Data technology:
Savings: analysis time
Identify defect patters