Mahout快速入門教程

lidki 9年前發布 | 18K 次閱讀 Mahout 數據挖掘


       Mahout 是一個很強大的數據挖掘工具,是一個分布式機器學習算法的集合,包括:被稱為Taste的分布式協同過濾的實現、分類、聚類等。Mahout最大的優點就 是基于hadoop實現,把很多以前運行于單機上的算法,轉化為了MapReduce模式,這樣大大提升了算法可處理的數據量和處理性能。

一、Mahout安裝、配置

1、下載并解壓Mahout
http://archive.apache.org/dist/mahout/
tar -zxvf mahout-distribution-0.9.tar.gz

2、配置環境變量

set mahout environment

export MAHOUT_HOME=/mnt/jediael/mahout/mahout-distribution-0.9
export MAHOUT_CONF_DIR=$MAHOUT_HOME/conf
export PATH=$MAHOUT_HOME/conf:$MAHOUT_HOME/bin:$PATH

3、安裝mahout
[jediael@master mahout-distribution-0.9]$ pwd
/mnt/jediael/mahout/mahout-distribution-0.9
[jediael@master mahout-distribution-0.9]$ mvn install

4、驗證Mahout是否安裝成功
    執行命令mahout。若列出一些算法,則成功:

    [jediael@master mahout-distribution-0.9]$ mahout  
    Running on hadoop, using /mnt/jediael/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR=  
    MAHOUT-JOB: /mnt/jediael/mahout/mahout-distribution-0.9/examples/target/mahout-examples-0.9-job.jar  
    An example program must be given as the first argument.  
    Valid program names are:  
      arff.vector: : Generate Vectors from an ARFF file or directory  
      baumwelch: : Baum-Welch algorithm for unsupervised HMM training  
      canopy: : Canopy clustering  
      cat: : Print a file or resource as the logistic regression models would see it  
      cleansvd: : Cleanup and verification of SVD output  
      clusterdump: : Dump cluster output to text  
      clusterpp: : Groups Clustering Output In Clusters  
      cmdump: : Dump confusion matrix in HTML or text formats  
      concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  
      cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  
      cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  
      evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes  
      fkmeans: : Fuzzy K-means clustering  
      hmmpredict: : Generate random sequence of observations by given HMM  
      itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  
      kmeans: : K-means clustering  
      lucene.vector: : Generate Vectors from a Lucene index  
      lucene2seq: : Generate Text SequenceFiles from a Lucene index  
      matrixdump: : Dump matrix in CSV format  
      matrixmult: : Take the product of two matrices  
      parallelALS: : ALS-WR factorization of a rating matrix  
      qualcluster: : Runs clustering experiments and summarizes results in a CSV  
      recommendfactorized: : Compute recommendations using the factorization of a rating matrix  
      recommenditembased: : Compute recommendations using item-based collaborative filtering  
      regexconverter: : Convert text files on a per line basis based on regular expressions  
      resplit: : Splits a set of SequenceFiles into a number of equal splits  
      rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  
      rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  
      runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  
      runlogistic: : Run a logistic regression model against CSV data  
      seq2encoded: : Encoded Sparse Vector generation from Text sequence files  
      seq2sparse: : Sparse Vector generation from Text sequence files  
      seqdirectory: : Generate sequence files (of Text) from a directory  
      seqdumper: : Generic Sequence File dumper  
      seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  
      seqwiki: : Wikipedia xml dump to sequence file  
      spectralkmeans: : Spectral k-means clustering  
      split: : Split Input data into test and train sets  
      splitDataset: : split a rating dataset into training and probe parts  
      ssvd: : Stochastic SVD  
      streamingkmeans: : Streaming k-means clustering  
      svd: : Lanczos Singular Value Decomposition  
      testnb: : Test the Vector-based Bayes classifier  
      trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  
      trainlogistic: : Train a logistic regression using stochastic gradient descent  
      trainnb: : Train the Vector-based Bayes classifier  
      transpose: : Take the transpose of a matrix  
      validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  
      vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  
      vectordump: : Dump vectors from a sequence file to text  
      viterbi: : Viterbi decoding of hidden states from given output states sequence  

二、使用簡單示例驗證mahout
1、啟動Hadoop
2、下載測試數據
           http://archive.ics.uci.edu/ml/databases/synthetic_control/鏈接中的synthetic_control.data
或者百度一下也很容易找到這個示例數據。
3、上傳測試數據
hadoop fs -put synthetic_control.data testdata
4、 使用Mahout中的kmeans聚類算法,執行命令:
mahout -core  org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
花費9分鐘左右完成聚類 。
5、查看聚類結果
    執行hadoop fs -ls /user/root/output,查看聚類結果。
[jediael@master mahout-distribution-0.9]$ hadoop fs -ls output  
Found 15 items  
-rw-r--r--   2 jediael supergroup        194 2015-03-07 15:07 /user/jediael/output/_policy  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:07 /user/jediael/output/clusteredPoints  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:02 /user/jediael/output/clusters-0  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:02 /user/jediael/output/clusters-1  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:07 /user/jediael/output/clusters-10-final  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:03 /user/jediael/output/clusters-2  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:03 /user/jediael/output/clusters-3  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:04 /user/jediael/output/clusters-4  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:04 /user/jediael/output/clusters-5  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:05 /user/jediael/output/clusters-6  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:05 /user/jediael/output/clusters-7  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:06 /user/jediael/output/clusters-8  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:07 /user/jediael/output/clusters-9  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:02 /user/jediael/output/data  
drwxr-xr-x   - jediael supergroup          0 2015-03-07 15:02 /user/jediael/output/random-seeds 

來自:http://blog.csdn.net/jediael_lu/article/details/44117367

 本文由用戶 lidki 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!