Data Science

MapReduce is consists of two functions:
- A map function
- A reduce function
In Hadoop:
- MapReduce is a distributed computing framework
- HDFS is a distributed storage framework
As a result, the map and reduce functions run on many different computers

A map function
- Takes in data from each DataNode as input
- Outputs key-value pairs
  - A key is a piece of data from the DataNode
  - A value is an aggregation from the DataNode
- Each value only measures an aggregation of an individual DataNode
A reduce function
- Takes in key-value pairs for each DataNode
- Outputs updated key-value pairs from all DataNodes
  - A key is a piece of data from all DataNodes
  - A value is an aggregation from all DataNodes

mapreduce

A MapReduce application consists of two main services:
- One JobTracker
- Some TaskTrackers
A JobTracker has the following properties:
- It acts like a master-server
- It communicates with the NameNode
- It ensures the execution of submitted jobs is completed
A TaskTracker has the following properties:
- It communicates with the DataNodes
- It is responsible for performing the actual service
- Meaning, it performs mapping, shuffling, and reducing tasks

Client submits an application to the JobTracker
- The JobTracker separates the application into tasks
- These tasks include the map, reduce, shuffle functions
That JobTracker requests metadata from its NameNode
- This metadata includes the location of relevant data
The NameNode provides the JobTracker with metadata
- This metadata has data about the location of DataNodes
- Only the DataNodes with any relevant data are included
The JobTracker locates available TaskTrackers
- It tries to find TaskTrackers that are:
  - Available
  - Closest to the relevant DataNodes as possible
The JobTracker submits its tasks to the TaskTrackers
- Only the chosen TaskTrackers are included
The TaskTrackers execute any individual tasks
- They communicate with their specified DataNodes
- TaskTrackers send progress reports to the JobTracker
- They do this by sending heartbeat signals
- If the JobTracker doesn't receive a heartbeat signal, it will assume the TaskTracker has failed
- Then, it will reschedule its task and start a new TaskTracker
The TaskTrackers complete all individual tasks
- They update the JobTracker
The JobTracker updates its status to complete
- Client applications can poll the JobTracker for information now

HBase

YARN