mapreduce源碼分析作業分配過程

jopen 12年前發布 | 12K 次閱讀 MapReduce

前面提到作業初始化將創建一系列的TaskInProgress緩存到內存,等待各個 tasktracker結點向jobtracker發送心跳請求任務,由jobtracker端的調度器分配任務,默認 JobQueueTaskScheduler,具體實現對應assignTasks方法

assignTasks核心算法：

1、對個某個tasktracker,計算可用的slot數目，調度器會盡量將任務均勻分布各個結點上,負載均衡.

具體做法是:

分別針對reduce和map計算：

首先算出針對該結點的的一個因子factor：請求作業的總任務數 - 該作業已完成的任務數/集群總的任務數 (掃描jobqueue里各作業)

再算可用的slot數: factor*該結點總的slot數 - 該結點正在運行的任務數

2、先后調用jobinprogess的obtainNewLocalMapTask、obtainNewNonLocalMapTask、 obtainNewReduceTask方法,返回Task類任務，再以LaunchAction的形式封裝發回到tasktracker去執行，以 obtainNewLocalMapTask為例，最終調用的是同一個類中findNewMapTask方法,findNewMapTask會返回離 tasktracker最近的task(依次從本結點\本機架\本數據中心去選擇,從未運行任務緩存去取,由作業初始化 Map<Node, List<TaskInProgress>> createcache創建賦值)

部分核心代碼：

assignTasks方法

public synchronized List<Task> assignTasks(TaskTrackerStatus taskTracker)

throws IOException {

ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();

final int numTaskTrackers = clusterStatus.getTaskTrackers();

final int clusterMapCapacity = clusterStatus.getMaxMapTasks();

final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();

Collection<JobInProgress> jobQueue =

jobQueueJobInProgressListener.getJobQueue();

// Get map + reduce counts for the current tracker.

final int trackerMapCapacity = taskTracker.getMaxMapTasks();

final int trackerReduceCapacity = taskTracker.getMaxReduceTasks();

final int trackerRunningMaps = taskTracker.countMapTasks();

final int trackerRunningReduces = taskTracker.countReduceTasks();

//此處taskTracker為心跳發送過來的 TaskTrackerStatus封裝了結點最大map,reduce數以及正在運行的map,reduce數

// Assigned tasks

List<Task> assignedTasks = new ArrayList<Task>();

// Compute (running + pending) map and reduce task numbers across pool

int remainingReduceLoad = 0;

int remainingMapLoad = 0;

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() == JobStatus.RUNNING) {

remainingMapLoad += (job.desiredMaps() - job.finishedMaps());

if (job.scheduleReduces()) {

remainingReduceLoad +=

(job.desiredReduces() - job.finishedReduces());

}

// Compute the 'load factor' for maps and reduces

double mapLoadFactor = 0.0;

if (clusterMapCapacity > 0) {

mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;

}

double reduceLoadFactor = 0.0;

if (clusterReduceCapacity > 0) {

reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;

}

// In the below steps, we allocate first map tasks (if appropriate),

// and then reduce tasks if appropriate. We go through all jobs

// in order of job arrival; jobs only get serviced if their

// predecessors are serviced, too.

// We assign tasks to the current taskTracker if the given machine

// has a workload that's less than the maximum load of that kind of

// task.

// However, if the cluster is close to getting loaded i.e. we don't

// have enough _padding_ for speculative executions etc., we only

// schedule the "highest priority" task i.e. the task from the job

// with the highest priority.

final int trackerCurrentMapCapacity =

Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity),

trackerMapCapacity);

int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps;

boolean exceededMapPadding = false;

if (availableMapSlots > 0) {

exceededMapPadding =

exceededPadding(true, clusterStatus, trackerMapCapacity);

}

int numLocalMaps = 0;

int numNonLocalMaps = 0;

scheduleMaps:

for (int i=0; i < availableMapSlots; ++i) {

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() != JobStatus.RUNNING) {

continue;

}

Task t = null;

// Try to schedule a node-local or rack-local Map task

t =

job.obtainNewLocalMapTask(taskTracker, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts());

if (t != null) {

assignedTasks.add(t);

++numLocalMaps;

// Don't assign map tasks to the hilt!

// Leave some free slots in the cluster for future task-failures,

// speculative tasks etc. beyond the highest priority job

if (exceededMapPadding) {

break scheduleMaps;

}

// Try all jobs again for the next Map task

break;

}

// Try to schedule a node-local or rack-local Map task

t =

job.obtainNewNonLocalMapTask(taskTracker, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts());

if (t != null) {

assignedTasks.add(t);

++numNonLocalMaps;

// We assign at most 1 off-switch or speculative task

// This is to prevent TaskTrackers from stealing local-tasks

// from other TaskTrackers.

break scheduleMaps;

}

int assignedMaps = assignedTasks.size();

// Same thing, but for reduce tasks

// However we _never_ assign more than 1 reduce task per heartbeat

final int trackerCurrentReduceCapacity =

Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity),

trackerReduceCapacity);

final int availableReduceSlots =

Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);

boolean exceededReducePadding = false;

if (availableReduceSlots > 0) {

exceededReducePadding = exceededPadding(false, clusterStatus,

trackerReduceCapacity);

synchronized (jobQueue) {

for (JobInProgress job : jobQueue) {

if (job.getStatus().getRunState() != JobStatus.RUNNING ||

job.numReduceTasks == 0) {

continue;

}

Task t =

job.obtainNewReduceTask(taskTracker, numTaskTrackers,

taskTrackerManager.getNumberOfUniqueHosts()

);

if (t != null) {

assignedTasks.add(t);

break;

}

// Don't assign reduce tasks to the hilt!

// Leave some free slots in the cluster for future task-failures,

// speculative tasks etc. beyond the highest priority job

if (exceededReducePadding) {

break;

}

。。。。

}

return assignedTasks;

}

obtainNewLocalMapTask方法

public synchronized Task obtainNewLocalMapTask(TaskTrackerStatus tts,

int clusterSize,

int numUniqueHosts)

throws IOException {

if (!tasksInited.get()) {

LOG.info("Cannot create task split for " + profile.getJobID());

return null;

}

int target = findNewMapTask(tts, clusterSize, numUniqueHosts, maxLevel,

status.mapProgress());

if (target == -1) {

return null;

}

Task result = maps[target].getTaskToRun(tts.getTrackerName());

//maps緩存存著TaskInprogress,getTaskToRun返回Task,或者MapTask、ReduceTask

if (result != null) {

addRunningTaskToTIP(maps[target], result.getTaskID(), tts, true);

}

return result;

}

findNewMapTask方法：

返回最近的maps[ ]任務列表對應的下標

private synchronized int findNewMapTask(final TaskTrackerStatus tts,
final int clusterSize,
final int numUniqueHosts,
final int maxCacheLevel,
final double avgProgress) {
if (numMapTasks == 0) {
LOG.info("No maps to schedule for " + profile.getJobID());
return -1;
}

String taskTracker = tts.getTrackerName();
TaskInProgress tip = null;

//
// Update the last-known clusterSize
//
this.clusterSize = clusterSize;

if (!shouldRunOnTaskTracker(taskTracker)) {
return -1;
}

// Check to ensure this TaskTracker has enough resources to
// run tasks from this job
long outSize = resourceEstimator.getEstimatedMapOutputSize();
long availSpace = tts.getResourceStatus().getAvailableSpace();
if(availSpace < outSize) {
LOG.warn("No room for map task. Node " + tts.getHost() +
" has " + availSpace +
" bytes free; but we expect map to take " + outSize);

return -1; //see if a different TIP might work better.
}


// For scheduling a map task, we have two caches and a list (optional)
// I) one for non-running task
// II) one for running task (this is for handling speculation)
// III) a list of TIPs that have empty locations (e.g., dummy splits),
// the list is empty if all TIPs have associated locations

// First a look up is done on the non-running cache and on a miss, a look
// up is done on the running cache. The order for lookup within the cache:
// 1. from local node to root [bottom up]
// 2. breadth wise for all the parent nodes at max level

// We fall to linear scan of the list (III above) if we have misses in the
// above caches

Node node = jobtracker.getNode(tts.getHost());

//
// I) Non-running TIP :
//

// 1. check from local node to the root [bottom up cache lookup]
// i.e if the cache is available and the host has been resolved
// (node!=null)
if (node != null) {
Node key = node;
int level = 0;
// maxCacheLevel might be greater than this.maxLevel if findNewMapTask is
// called to schedule any task (local, rack-local, off-switch or speculative)
// tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if findNewMapTask is
// (i.e. -1) if findNewMapTask is to only schedule off-switch/speculative
// tasks
int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);
for (level = 0;level < maxLevelToSchedule; ++level) {
List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key);
if (cacheForLevel != null) {
tip = findTaskFromList(cacheForLevel, tts,
numUniqueHosts,level == 0);
if (tip != null) {
// Add to running cache
scheduleMap(tip);

// remove the cache if its empty
if (cacheForLevel.size() == 0) {
nonRunningMapCache.remove(key);
}

return tip.getIdWithinJob();
}
}
key = key.getParent();
}

// Check if we need to only schedule a local task (node-local/rack-local)
if (level == maxCacheLevel) {
return -1;
}
}

//2. Search breadth-wise across parents at max level for non-running
// TIP if
// - cache exists and there is a cache miss
// - node information for the tracker is missing (tracker's topology
// info not obtained yet)

// collection of node at max level in the cache structure
Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();

// get the node parent at max level
Node nodeParentAtMaxLevel =
(node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);

for (Node parent : nodesAtMaxLevel) {

// skip the parent that has already been scanned
if (parent == nodeParentAtMaxLevel) {
continue;
}

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1381328934330.html

MapReduce

mapreduce源碼分析作業分配過程

相關經驗

相關資訊

相關文檔

目錄