Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Cost Minimization for Big Data Processing in

No description

Anjani Nanda

on 4 December 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Cost Minimization for Big Data Processing in

Cost Minimization for Big Data Processing in Geo-Distributed Data Centers
Anjani Nandan
M.tech S.E.

Big data is an any collection of data sets so large and complex
It becomes difficult to process them using traditional data processing applications.
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process data within a tolerable elapsed time.
Big data "size" is a constantly moving , as of 2012 ranging from a few dozen terabytes to many peta bytes of data.
Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large data sets.
what is Big data
Impact of explosive Growth
They proposed Data center resizing (DCR) to reduce the computation cost by adjusting the number of activated servers via task placement.
they distributes data in entire geographical data centers to lower the electricity cost.
they distributes data bases on no of user and distribution of industries.
Efforts ....
LIN GU, (Student Member, IEEE),
DEZE ZENG, (Member, IEEE),
PENG LI, (Member, IEEE)
SONG GUO, (Senior Member, IEEE)

How much data?
Google processes 20 PB a day (2009)

Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day
CERN’s Large Hydron Collider (LHC)
generates 15 PB a year

Explosive Growth of demand....
why cost minimization ?
Gartner(usa r&d) predicts that by 2015, 71% of
worldwide data center hardware
spending will come from the
big data processing,
which will surpass $126.2 billion
challenge cont...
Second, the links in networks vary on the transmission rates and costs according to their unique features e.g.,the distances and physical optical fiber facilities between data centers. However, the existing routing strategy among data centers fails to exploit the link diversity of data center networks. Due to the storage and computation capacity constraints, not all tasks can be placed onto the same server, on which their corresponding data reside.
Third, the Quality-of-Service (QoS) of big data tasks has not been considered in existing work. Similar to conventional cloud services, big data applications also exhibit Service-Level-Agreement (SLA) between a service provider and the requesters.Existing studies, on general cloud computing tasks mainly focus on the computation capacity constraints, while ignoring the constraints of transmission rate.
Observation & Strategies ....
we are the first to consider the cost minimization problem of big data processing with joint consideration of data placement, task assignment and data routing.
To describe the rate-constrained computation and transmission in big data processing process, we propose a two-dimensional Markov chain and derive the expected task completion time in closed form.
Based on the closed-form expression, we formulate the cost minimization problem in a form of mixed-integer nonlinear programming (MINLP) to answer the following questions
1) how to place these data chunks in the servers.
2) how to distribute tasks onto servers without violating the resource constraints.
3)how to resize data centers to achieve the operation cost minimization goal.
To deal with the high computational complexity of solving MINLP, we linearize it as a mixed-integer linear programming (MILP) problem, which can be solved using commercial solver. Through extensive numerical studies, we show the high efficiency of our proposed joint-optimization based algorithm
A.server cost minimization
key issue in Large-scale data centers is electricity cost and operating cost Therefore, reducing the electricity cost has received significant attention from both academia and industry.so reduce electricity cost by routing user requests to geo-distributed data centers with accordingly updated sizes that match the requests using a holistic approach of workload balancing.
B.big data management
key issue in big data management is reliable and effective data placement. To achieve this goal they propose a scheduling algorithm, which takes into account energy efficiency in addition to fairness and data locality properties. and new design philosophy providing a new agile and deep data analytics for one of the world's largest networks at Fox Audience Network, using the Greenplum parallel database system.
C.data placement
key issues in data placement is video on demand (VoD) they proposed automated data placement mechanism Volley for Geo-distributed DCR

System Model
There are a set I of data centers, and each data center i 2 I consists of a set Ji of servers that are connected to a switch mi 2 M with a local transmission cost of CL. In general, the transmission cost CR for inter-data center trafc is greater than CL, i.e., CR > CL. Without loss of generality, all servers in the network have the same
computation resource and storage capacity, both of which are normalized to one unit.We use J to denote the set of all severs,
sys model cont..
The weight of each link w(u;v), representing the corresponding communication cost, can be defined as
w(u,v) ={CR; if u, v in M;
CL; otherwise:

As for task distribution, the sum rates of task assigned to each server should be equal to the overall rate,

A server shall be activated if there are data chunks placed
onto it or tasks assigned to it.
we jointly study the data placement, task assignment, data center resizing and routing to minimize the overall operational cost in large-scale geo-distributed data centers for big data applications.
Question ?
1. first data locality
2. result in a waste of resources.
For example:-most computation resource of a server with less popular data may stay idle. The low resource utility further causes more servers to be activated and hence higher operating cost.
Data centers
Impact of explosive growth we need a large database for storing data.
Data centers is collection of lots of server which facilitate
storage area.
it also process data to distributed geographical area nd control environment
large industries basically having data centers
google having data centers over 8 countries in 4 continent.

Full transcript