Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Big Data

No description

Priyanka Modi

on 18 November 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data

Introduction to Big Data
Big Data
Large corpuses of data coming together
Sensors collecting information through web logs, security flows, and monitoring information.
Need of Big Data
Around terabytes and petabytes of data being generated.
Sensors, web logs, imagery and data streaming from devices – is growing.
Dimensions of Big Data
Big Data spans six dimensions :
Classification of Big Data
On the basis of nature Big Data can be classified as follows:
Unstructured Data
Semi-Structured Data
Structured Data
Data at rest
Importance of Big Data
Big Data plays an important roles in various domains such as :

Decision Making
Business Intelligence
Social Development
Sentiment Analysis

Data in motion
Data in many forms
Data in doubt
Decision Making
Data in many ways
Intermingling of data
Studies have shown that businesses which have adopted Big Data strategies earlier and enabled data driven decision making were able to achieve 5% to 6% greater productivity.

This has helped in many sectors such as Telecom sector, Retail sector, Manufacturing sector etc.

Business Intelligence
Unstructured Data
No identifiable structure
Consists of mainly loosely structured data
This kind of data is inconsistent and always unique.
Ex- media files, word files, pdf files, ppt presentations
In Business Intelligence, data is analyzed for many purposes such as to perform system log analytics and social media analytics for risk assessment, customer relation, brand management etc.

Big data analytics helps to do real time analysis of data thus providing an help to Business Intelligence tools.
Semi-Structured Data
No fixed schema
Also known as 'self-describing' data
Contains information about the schema of the data contained in it.
Social Development
Structured Data
Industries like Healthcare, surveys done by government agencies like NSSO, NGOs collect data from a mixture of people. This information is helpful in studying general social and economic condition of people and thus helps in planning developmental projects.
This type of data is grouped into relational schema and can be analyzed with the help of simple queries. This type of data is grouped into relational schema and can be analyzed with the help of simple queries.
Heterogeneity and Incompleteness
Challenges faced during Big Data analysis
The analysis of Big Data involves multiple distinct phases as shown in the figure below, each of which introduces challenges.The top challenge cited are: the rate data growth and the cost and effort required to contain or store it. The various challenges faced are:
Unlike humans, computers cannot understand heterogeneous data. Machine analysis algorithms expect homogeneous data. Therefore, data is required to be structured prior to data analysis.
To cope up with this challenge Google applies MapReduce to all the complex heterogeneous data it gathers from internets.
This refers to the sheer volume of data being accumulated. (terabytes)
To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel.
Preferably known as the acquisition and processing rate based upon the volume of data. Consider a fraudulent credit card transaction. It is ideally required to stop the process before the transaction takes place at all. To cope up with this scenario we need a proper result at proper time. Hence, velocity is a major challenge being faced.
Managing the privacy of data is a major concern for all kinds of organizations. It is an important part of data analysis to protect the privacy of people.

Lacking security may lead to steal sensitive information assets – intellectual property, credit card numbers, customer databases – commit fraud, or otherwise damage the enterprise. Many Big Data security solutions have emerged in market like Gazzang, Ncrypt™.
Access and Sharing
It refers to the need of proper accessibility rights. There is a large amount of data that is closely held by the corporations and is not accessible in public because there exists a culture of secrecy.
Human Collaboration
There is a great value of human input at all stages of the analysis pipeline. In spite of great advancements in computational analysis and Big Data the importance of Human Collaboration is not reduced.
Technology Outlook
This section would elaborate the technology being selected for storing, managing and analyzing Big Data.
Its an open source framework that allows distributed storage and processing of large data sets over clusters of computers. Hadoop has become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.The Hadoop libraries itself are designed to detect and handle failures at the application layer, hence enabling it to deliver high-availability service.
Hadoop Distributed File System (HDFS)
HDFS is a file system used by Hadoop which is ideal for storing data of size terabytes and petabytes.
HDFS lets you connect nodes contained within clusters over which data files are distributed. One can then access and store the data files as one seamless file system.
Data in Hadoop cluster is broken down into smaller pieces called blocks and distributed throughout the clusters. The default block size for Hadoop is 64 MB.
Map Reduce
MapReduce is a framework for performing distributed data processing using the Map-Reduce programming paradigm.
The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes.
Distributing the computation solves the issue of data too large to fit onto a single machine.
Job Tracker
It is the master of the system which manages the jobs and resources in the cluster.
If any of the job fails, it reallocates the job to another node.
Task Tracker
These are the slaves that are deployed on each machine.
TaskTracker also constantly sends a message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.
Job History Server
It is a process for serving up information about historically completed applications so that the JobTracker does not need to track them.
Hadoop Based Search Engine Q-Search
Description of Project
The major issue involved in this project was the type of data in which searching is be done. The type which we have chosen is highly unstructured containing a large amount of data. The files contain a large amount of pictures and different types of tags and formatting elements which need to be removed for effective searching. The second major issue involved is to ensure that searching is done in optimum time and gives a reliable output, to do this we have used various methods like caching, etc. We have also sorted the displayed results on the basis of number of hits. To improve the user experience we have done pagination of results to be displayed.
Use Case Diagram
Working of Project
User enters a Search String
Caching Module Checks in Cache
This module would be the first one to be invoked which involves searching into the cache for the search query entered by the user. It redirects the flow to the Display Result module in case of success, else invokes the Create File List module.
Create File List
The module gets invoked when the required results are not present in the cache hence, leading to the generation of the file list. The file list is obtained by browsing through the file system.
HDFS Search
It searches for the key in all files mentioned in the file list (which is being obtained from the Create File List module). For each type of file, the required driver is identified and processed so as to call their respective Mappers and Reducers, which are responsible for generating results and storing them in temporary files.
Update Cache
The temporary files generated are read and stored in the cache for faster retrieval of results the next time it searches for the same word. Each output file is read and the results are then stored into the cache database.
Delete Temporary Files
Once the output results are stored into the cache database, this module deletes the temporary files being generated hence, releasing the memory being shared by them.
Manage Cache
For every new result that is being added to the cache, this module clears the cache when distinct search keys stored in cache exceeds over a fixed limit (5000). If the number of distinct search keys is greater than the primitive fixed limit, it deletes the Least Frequently Used result from the cache hence, updating the counter of the currently searched key.
Display Content
This module can also be directly invoked by the Caching Module if the required results are found through the same. It displays the results in decreasing order of the number of hits. The module also grants access to the corresponding file through separate links provided for that file.
Various Phases in Big Data Processing
Data Collection and Storage
Best Practices for Harnessing Big Data
Use sandboxing
Dimensionalize the data
Embed Analytics into operational workflow/routine
Gather business requirements before gathering data
Opportunities in Big Data
Quantum Computing
Machine Learning
Data Filtering, Aggregation and Representation
Data Modeling and Analysis
Query Processing and Information Extraction
Interpretation and Visualization of Big Data
This phase must support acquisition of data through low, predictable latency in both capturing data and executing short, simple queries.
Large corpuses of data are kept on the same original location (disk) rather than moving data from one disk to another.
Hadoop is the new tool developed that helps in organizing and integrating data on the original storage cluster.
The infrastructure should be able to deliver fast, scale to extreme data volumes and enhance response times and decision making.
The data is needed to be analyzed in context of the old to provide new perspectives and solutions on old problems.
Big Data is generally noisy and not trustworthy but still, valuable to extract.
This phase extracts useful information from 'raw' data through predictive modelling.
When considering data visualization, one of the best ways is to present the data graphically by extracting important information so as to communicate the meaning of data.
This is required so as to spot values that are generally not obtained while observing raw values, and making the data more user-interpretable
Technology Outlook
Result Display
Full transcript