Introducing
Your new presentation assistant.
Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.
Trending searches
Hadoop and Hive Integration
Hadoop Distributed File System (HDFS)
Designed for low-cost hardware with high fault tolerance.
Stores large data sets reliably and streams them at high bandwidth.
Dynamically scales across thousands of servers, ensuring economical growth.
Map-Reduce (M-R) Process:
system design enhancement in distributed systems aims to optimize the system's performance, reliability, and usability while accommodating changing needs and environments.
Hive: Data warehouse infrastructure built on Hadoop.
Enables easy data summarization, reporting, ad hoc querying, and analysis.
Provides HiveQL, a SQL-based query language for querying data stored in HDFS.
Significantly reduces time for job completion, making complex tasks manageable in minutes.
Availability
Openness
Reliability
Handles terabytes and petabytes of data daily.
Two Major Phases: Map and Reduce.
Steps:
Emphasizes high throughput over low latency, suitable for batch processing.
Utilizes a write-once-read-many access model for files.
Namespace exposed, user data stored in blocks across Data Nodes.
Single Name Node simplifies system architecture, managing all metadata.
Mem cached sever
A Memcached server is a distributed memory-caching system that stores data in memory as key-value pairs
Facebook's Reliance on Hadoop Platform
New App Requirements: Facebook adopts applications needing high write throughput, cost-effective storage, and low latency.
"What events are occurring on the server?"
Hadoop Integration
Apache Hive
4-Tier Architecture :
Hadoop is chosen for its ability to handle unstructured and structured data.
Ideal for data discovery and processing large volumes of data.
Two Main Components:
Map-Reduce (M-R): Computation
Hadoop Distributed File System (HDFS): Storage
Challenges that were faced:
How were they addressed?
Hadoop Architecture:
Master Node:
Job Tracker
Task Tracker
Name Node (NN)
Data Node (DN)
Worker Nodes
Facebook is recognized as the largest online social network system, and it is designed as a distributed system. The datacenters behind the Facebook network are robust, scalable, reliable, and secure, allowing Facebook to be highly accessible from anywhere with high availability. The system is responsible for processing large quantities of data, known as "Big Data," and it interconnects users through friendship relations, allowing for synchronous and asynchronous communications of user-generated content.
Tier 3:
Tier 4:
Tier 2:
Scalability and Management: Scaling MySQL clusters quickly presents challenges in load balancing and management overhead.
Tier 1:
Author name:
->MySQL and Memcached Usage
->MySQL Performance Issues
Web Servers
Scribe-Hadoop Clusters (Log Aggregation)
Hive-Hadoop Clusters (Data Warehousing)
Database (Federated MySQL)
Asma Salem
University of Jordan
Alternatives:
Publication date:
Stores core user data and application data.
Facebook's centralized datacenters in the US (California: Santa Clara, Palo Alto, Ashburn) pose bandwidth and latency challenges for users outside the US
Facebook utilizes Content Delivery Networks (CDNs) geographically distributed across regions like Russia, Egypt, Sweden, and the UK
Process client requests.
Aggregate logs.
Facebook Distributed System Case Study For Distributed System Inside Facebook
Datacenters
Receive uncompressed logs from web servers.
Aggregate and store logs for analysis.
August 2017
Scalability: Ability to handle increasing workload efficiently.
Reliability: Ensuring system availability and fault tolerance.
Components:
Systems Design
Big Data Processing and Analysis
Huge Storage
Production Cluster:
High priority jobs with strict deadlines.
Stores and analyzes production data.
Ad-hoc Cluster:
Low priority batch jobs.
User-initiated ad-hoc analysis on historical data.
Affiliation:
INTERNATIONAL JOURNAL OF TECHNOLOGY ENHANCEMENTS AND EMERGING ENGINEERING RESEARCH, VOL 2, ISSUE 7 152
ISSN 2347-4289
Cloud Computing:
https://www.researchgate.net/publication/318946789
Cloud computing in distributed systems involves delivering computing resources and services over a network, enabling scalable and flexible access to resources without local hardware.
Scalability
Security
key points:
Hadoop RPC in Facebook's Distribution System:
Social Network:
>Virtual Org - Social networks act like online communities
for sharing resources.
>Social Cloud - used to share scalable data
>Service level agreements (SLAs)
>Cloud Hosts Social Networks - Cloud platforms can be the foundation of social networks.
>Facebook Apps Use Apps data to offer richer features.
>Isolated App Communication
(FBJS)
MORE FEATURES: