MapReduce的數據流程、執行流程

jopen 12年前發布 | 19K 次閱讀 MapReduce

MapReduce的數據流程：

預先加載本地的輸入文件
經過MAP處理產生中間結果
經過shuffle程序將相同key的中間結果分發到同一節點上處理
Recude處理產生結果輸出
將結果輸出保存在hdfs上

</ol>

MapReduce的數據流程、執行流程

MAP

在map階段，使用job.setInputFormatClass定義的InputFormat將輸入的數據集分割成小數據塊splites，
同時InputFormat提供一個RecordReder的實現。默認的是TextInputFormat，
他提供的RecordReder會將文本的一行的偏移量作為key，這一行的文本作為value。
這就是自定義Map的輸入是的原因。
然后調用自定義Map的map方法，將一個個對輸入給Map的map方法。

最終是按照自定義的MAP的輸出key類，輸出class類生成一個List。

Partitioner

在map階段的最后，會先調用job.setPartitionerClass設置的類對這個List進行分區，
每個分區映射到一個reducer。每個分區內又調用job.setSortComparatorClass設置的key比較函數類排序。

可以看到，這本身就是一個二次排序。
如果沒有通過job.setSortComparatorClass設置key比較函數類，則使用key的實現的compareTo方法。

Shuffle：

將每個分區根據一定的規則，分發到reducer處理

Sort

在reduce階段，reducer接收到所有映射到這個reducer的map輸出后，
也是會調用job.setSortComparatorClass設置的key比較函數類對所有數據對排序。
然后開始構造一個key對應的value迭代器。這時就要用到分組，
使用jobjob.setGroupingComparatorClass設置的分組函數類。只要這個比較器比較的兩個key相同，
他們就屬于同一個組，它們的value放在一個value迭代器

Reduce
最后就是進入Reducer的reduce方法，reduce方法的輸入是所有的（key和它的value迭代器）。
同樣注意輸入與輸出的類型必須與自定義的Reducer中聲明的一致。

MapReduce的數據流程、執行流程

具體的例子：

是hadoop mapreduce example中的例子，自己改寫了一下并加入的注釋

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.GenericOptionsParser;
import com.catt.cdh.mr.example.SecondarySort2.FirstPartitioner;
import com.catt.cdh.mr.example.SecondarySort2.Reduce;
/**

This is an example Hadoop Map/Reduce application.
It reads the text input files that must contain two integers per a line.
The output is sorted by the first and second number and grouped on the
first number.

To run: bin/hadoop jar build/hadoop-examples.jar secondarysort
in-dir out-dir  */
public class SecondarySort {
/**

Define a pair of integers that are writable.
They are serialized in a byte comparable format.
*/
public static class IntPair implements WritableComparable {
 private int first = 0;
 private int second = 0;
/**

Set the left and right values.
*/
public void set(int left, int right) {
 first = left;
 second = right;
}
public int getFirst() {
 return first;
}
public int getSecond() {
 return second;
}
/**

Read the two integers.
Encoded as: MIN_VALUE -> 0, 0 -> -MIN_VALUE, MAX_VALUE-> -1
*/
@Override
public void readFields(DataInput in) throws IOException {
 first = in.readInt() + Integer.MIN_VALUE;
 second = in.readInt() + Integer.MIN_VALUE;
}
@Override
public void write(DataOutput out) throws IOException {
 out.writeInt(first - Integer.MIN_VALUE);
 out.writeInt(second - Integer.MIN_VALUE);
}
@Override
// The hashCode() method is used by the HashPartitioner (the default
// partitioner in MapReduce)
public int hashCode() {
 return first * 157 + second;
}
@Override
public boolean equals(Object right) {
 if (right instanceof IntPair) {

 IntPair r = (IntPair) right;
 return r.first == first && r.second == second;

} else {

 return false;

}
}
/* A Comparator that compares serialized IntPair. /
public static class Comparator extends WritableComparator {
 public Comparator() {

 super(IntPair.class);

}
// 針對key進行比較，調用多次
 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,

     int l2) {
 return compareBytes(b1, s1, l1, b2, s2, l2);

}
}
static {
 // 注意：如果不進行注冊，則使用key.compareTo方法進行key的比較
 // register this comparator
 WritableComparator.define(IntPair.class, new Comparator());
}
// 如果不注冊WritableComparator，則使用此方法進行key的比較
@Override
public int compareTo(IntPair o) {
 if (first != o.first) {

 return first < o.first ? -1 : 1;

} else if (second != o.second) {

 return second < o.second ? -1 : 1;

} else {

 return 0;

}
}
}


/**

Partition based on the first part of the pair.
*/
public static class FirstPartitioner extends

 Partitioner<intpair, intwritable=""> {

@Override
 public int getPartition(IntPair key, IntWritable value,

     int numPartitions) {
 return Math.abs(key.getFirst() * 127) % numPartitions;

}
}
/**

Compare only the first part of the pair, so that reduce is called once
for each value of the first part.
*/
public static class FirstGroupingComparator implements

 RawComparator<intpair> {


// 針對key調用，調用多次
 @Override
 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

 return WritableComparator.compareBytes(b1, s1, Integer.SIZE / 8,
         b2, s2, Integer.SIZE / 8);

}
// 沒有監控到被調用，不知道有什么用
 @Override
 public int compare(IntPair o1, IntPair o2) {

 int l = o1.getFirst();
 int r = o2.getFirst();
 return l == r ? 0 : (l < r ? -1 : 1);

}
}
/**

Read two integers from each line and generate a key, value pair
as ((left, right), right).
*/
public static class MapClass extends

 Mapper<longwritable, text,="" intpair,="" intwritable=""> {


private final IntPair key = new IntPair();
 private final IntWritable value = new IntWritable();
@Override
 public void map(LongWritable inKey, Text inValue, Context context)

     throws IOException, InterruptedException {
 StringTokenizer itr = new StringTokenizer(inValue.toString());
 int left = 0;
 int right = 0;
 if (itr.hasMoreTokens()) {
     left = Integer.parseInt(itr.nextToken());
     if (itr.hasMoreTokens()) {
         right = Integer.parseInt(itr.nextToken());
     }
     key.set(left, right);
     value.set(right);
     context.write(key, value);
 }

}
}
/**

A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends

 Reducer<intpair, intwritable,="" text,="" intwritable=""> {

private static final Text SEPARATOR = new Text(

     "------------------------------------------------");

private final Text first = new Text();
@Override
 public void reduce(IntPair key, Iterable values,

     Context context) throws IOException, InterruptedException {
 context.write(SEPARATOR, null);
 first.set(Integer.toString(key.getFirst()));
 for (IntWritable value : values) {
     context.write(first, value);
 }

}
}
public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 String[] ars = new String[] { "hdfs://data2.kt:8020/test/input",

     "hdfs://data2.kt:8020/test/output" };

conf.set("fs.default.name", "hdfs://data2.kt:8020/");
String[] otherArgs = new GenericOptionsParser(conf, ars)

     .getRemainingArgs();

if (otherArgs.length != 2) {

 System.err.println("Usage: secondarysort <in> <out>");
 System.exit(2);

}
 Job job = new Job(conf, "secondary sort");
 job.setJarByClass(SecondarySort.class);
 job.setMapperClass(MapClass.class);
// 不再需要Combiner類型，因為Combiner的輸出類型<text, intwritable="">對Reduce的輸入類型<intpair, intwritable="">不適用
 // job.setCombinerClass(Reduce.class);
 // Reducer類型
 job.setReducerClass(Reduce.class);
 // 分區函數
 job.setPartitionerClass(FirstPartitioner.class);
 // 設置setSortComparatorClass，在partition后，
 // 每個分區內又調用job.setSortComparatorClass設置的key比較函數類排序
 // 另外，在reducer接收到所有映射到這個reducer的map輸出后，
 // 也是會調用job.setSortComparatorClass設置的key比較函數類對所有數據對排序
 // job.setSortComparatorClass(GroupingComparator2.class);
 // 分組函數
job.setGroupingComparatorClass(FirstGroupingComparator.class);
// the map output is IntPair, IntWritable
 // 針對自定義的類型，需要指定MapOutputKeyClass
 job.setMapOutputKeyClass(IntPair.class);
 // job.setMapOutputValueClass(IntWritable.class);
// the reduce output is Text, IntWritable
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
 System.exit(job.waitForCompletion(true) ? 0 : 1);
}




}</intpair,></text,></out></in></intwritable></intpair,></longwritable,></intpair></intpair,></intpair></pre>

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1385602212718.html

MapReduce

MapReduce的數據流程、執行流程

相關經驗

相關資訊

相關文檔

目錄