Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Big Data

No description

duygu sinanc

on 13 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data

Big Data: A Review

Big data
is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results.

The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as
big data analytics

These useful information for companies or organizations with the help of gaining richer and deeper insights and getting an advantage over the competition.
Source of Data
Data are generated from online transactions, emails, videos, audios, images, click streams, logs, posts, search queries, health records, social networking interactions, science data, sensors and mobile phones and their applications.
Big Data Components

or the size of data now is larger than terabytes and petabytes.

is required not only for big data, but also all processes.
Big data comes from a great
of sources.
After producing and processing of big data, it should create a plus
for the organization
Benefits & Barriers
Benefits of big data are:
better aimed marketing, more straight business insights, client based segmentation, recognition of sales and market chances, automated decision making, definitions of customer behaviors, greater return on investments, quantification of risks and market trending, comprehension of business alteration, better planning and forecasting, identification consumer behavior from click streams and production yield extension

Potential barriers to implementing big data analytics are:
inexpert stuff and can not find to hire big data experts, cost, privation of business sponsorship, hard to designing analytic systems, lack of current database software in analytics, scalability problems, incapable to make big data usable for end users, data load not fast enough in current database software, lack of compelling business case
Big Data Samples
clinical decision support systems, individual analytics applied for patient profile, performance based pricing for personnel, analyze disease patterns and improve public health
improved demand forecasting, supply chain planning, sales support, developed production operations and web search based applications
Personal location data:
smart routing, geo targeted advertising, emergency response, urban planning and new business models
Public sector:
creating transparency by accessible related data, discover needs, improve performance, decision making with automated systems to decrease risks, customize actions for suitable products and services
in store behavior analysis, variety and price optimization, product placement design, improve performance, labor inputs optimization, distribution and logistics optimization, web based markets
Social network:
understanding user intelligence for more targeted advertising, marketing campaigns, capacity planning, customer behavior, buying patterns and sentiment analytics
Big Data Methods (MapReduce)
MapReduce can be divided into two stages:

Map Step:
The master node is chopped up data into many smaller subproblems. A worker node processes some subset of smaller problems under control of the JobTracker node and stores result in the local file system where a reducer is able to access it.

Reduce Step:
This step analyzes and merges input data from map steps. There can be multiple reduce tasks to parallelize aggregation, and these tasks are executed on worker nodes under control of the JobTracker.
Big Data Methods (Hadoop)
Hadoop is Java based framework and heterogeneous open source platform.
It is not a replacement for database, warehouse or ETL (Extract, Transform, Load) strategy.
Hadoop includes a distributed file system, analytics and data storage platforms and a layer that manages parallel computation, workflow and configuration administration.
HDFS (Hadoop Distributed File System) runs across the nodes in a Hadoop cluster and connects together the file systems on many input and output data nodes to make them into one big file system.
Big Data Methods (HPCC)
HPCC (High Performance Computing Cluster) Systems
are distributed open source computing platform and provides big data workflow management services. Unlike Hadoop, HPCC’s data model defined by user. Furthermore HPCC Platform does not require third party tools like GreenPlum, Cassandra, RDBMS, Oozie etc.
is a programming framework for distributed computing which was created by Google using the divide and conquer method to break down complex big data problems into small units of work and process them in parallel.
created to inspire by BigTable which is Google’s data storage system, Google File System and MapReduce.
HPCC clusters can be exercised in Thor and Roxie. Hadoop clusters perform with MapReduce processing.
In HPCC environments ECL is primary programming language. However, Hadoop MapReduce processes are based on Java.
HPCC platform builds multikey and multivariate indexes on distributed file system. Hadoop HBase procures column oriented database.
On the same hardware configuration a 400-node system, HPCC success is 6 minutes 27 seconds and Hadoop success is 25 minutes 28 seconds. This result showed that HPCC faster than Hadoop for this comparison.
Comparison Between HPCC Systems Platform and Hadoop Architecture
Knowledge Discovery Methods from Big Data
Knowledge discovery includes a variety of analysis methods as distributed programming, pattern recognition, data mining, natural language processing, sentiment analysis, statistic, visual analysis and human computer interaction. Therefore architecture must support various methods and analysis techniques.
A comprehensive knowledge discovery architecture must provide to preparation of data and batch analytics for proper troubleshooting with errors, missing values and unusable format also processing structured and semi structured data
It is cardinal that making results accessible and foolproof. For this reason using open source and popular standards and web based architectures also publicly available results are used to overcome this issue.
Thank you for your patience

Questions & Answers
Advantages of Using Big Data for Security*
• Low barrier to experimentation

• Security is often about detecting anomalies, so, you need to have a full spectrum view that if you have enough data to know what constitutes normal or abnormal.

• The goal with many information security solutions is to translate “back office intelligence” into “customer facing protection”.

• To make the most accurate decisions, we need to take advantage of all the intelligence available to us. Big data techniques can be used to extract the most value from this wealth of information.

• Big data techniques are also useful in doing more broad visualization of security-related metrics. Having such a big picture understanding can help identify root causes to problems. In contrast, many traditional approaches only address symptoms rather than causes.

• Big data techniques can lead to entirely new sets of security capabilities and there are a wealth of new opportunities waiting to be uncovered.
Disadvantages of Using Big Data for Security*
• Analysis tools still very immature

• High-skill analysts are hard to find

• While there has been a rapid proliferation of big data technologies out there, not all of them are well baked enough to be used in production environments.

• Security decision-making needs to be rapid and that does not always align with the batch-oriented processing of large data sets.

• There are no one-size fits all big data technologies. You have to understand both the problem you are trying to solve and the technology you are thinking of leveraging to solve it.

• Big data techniques are powerful, but not every security-related problem requires them, nor can they magically solve every problem that comes up. Instead, it’s important to apply domain expertise and common sense first.

• Before focusing on big data, focus on good data. Many people try to apply sophisticated data mining techniques, but on data that might be poorly collected.

Examination of the studies in literature
Comparison of methods
Knowledge discovery methods from big data
Consideration big data privacy and security
Aim of This Study
Source of Data
From the dawn of civilisation through to 2003, humans produced 5 exabytes of data.
Now we produce 5 exabytes of data every 2 days.

In first quarter of 2013
2.7 billion people – 40% of the world’s population – are online*
6.8 billion mobile subscriptions*
Facebook : 1.1 billion monthly active users, 751 million mobile users**
Twitter : 288 million monthly active users, 500 million registered accounts**
YouTube : 1 billion unique monthly visitors, 6 billion hours of videos are watched every month**
Google+ : 359 million monthly active users**
LinkedIn : 200 million users, 2 new users join it every second**
Source of Data
Big Data Components
Benefits & Barriers
Big Data Samples
Big Data Methods
Knowledge Discovery Methods from Big Data
Privacy & Security Issues
Advantages of Using Big Data for Security
Disadvantages of Using Big Data for Security
Big Data Methods
HPCC Systems
Automated tools that collect diverse data types and normalize them. So it can be mined for threats such as evidence of malware, anomalies, or phishing.

security procedures are needed to capture and analyze network traffic such as metadata, packet capture, flow and log information.

Analytics engines must manage to process massive volumes of fast changing data in real time.

Active controls need such as additional user authentication, blocking data transfers or simplification analysts' decision making.

N-tier infrastructures that create scalability across vectors and have ability to process large and complex searches and queries.
Because of massive size and scope of data varies by every source and not easily manageable as the lowest level of data. An organization may or not agree with views of what constitutes data privacy.
There are no analysis tools for privacy. (at least not in Turkey)

There is massive data resource aggregation and providing security like execution cores, disk controllers, memory and network interfaces.

By keeping data in one place, it occurs a target for attackers to sabotage the organization. It required that big data stores are rightly controlled.

To guarantee that records are archived and protected according to standards and regulations.

A lot of data such as, personal information, skills or weakness, psychology, health status and political opinion may be obtained and sold without authorization.
Organizations should guaranteed investments in security products using agile technologies based analytics not static equipments.

Standardized views into demonstrations of compromise that are created in machine readable form and can be shared at scale by trusted sources.

To ensure authentication a cryptographically secure communication framework has to be implemented.

Controls should be using principle of reduced privileges, especially for access rights, except for an administrator who have permission data to physical access.

When organizations categorize knowledge, this allows organizations to select data that has neither little value nor any need to be kept so that it is no longer available for theft.
The three main HPCC components are :

is massively parallel ETL engine that enables data integration on a scale and provides batch oriented data manipulation.

has high throughput, low latency, ultra fast, efficient multi user retrieval of data and structured query response engine.

(Enterprise Control Language) is simple usage programming language which is automatically distributes workload between nodes, extensible machine learning library, optimized for big data operations and query transactions.
Big Data Methods (HPCC)
Privacy & Security Issues
Privacy & Security Issues
Privacy & Security Issues
Full transcript