快速開發機器學習原型:Ramp

jopen 10年前發布 | 19K 次閱讀 Ramp 機器學習

Ramp是一個Python庫用于快速搭建機器學習解決方案原型。它是一個輕量級基于pandas的機器學習框架,可插入已有的Python機器學習和統計工具(如scikit-learn, rpy2等)。Ramp提供了一個簡單的聲明性語法探索功能,算法和快速,高效地轉換。

Why Ramp?

  • Clean, declarative syntax

  • Complex feature transformations

    Chain and combine features:

Normalize(Log('x')) Interactions([Log('x1'), (F('x2') + F('x3')) / 2]) 
Reduce feature dimension:
DimensionReduction([F('x%d'%i) for i in range(100)], decomposer=PCA(n_components=3)) 
Incorporate residuals or predictions to blend with other models:
Residuals(simple_model_def) + Predictions(complex_model_def) 
  • Data context awareness

    Any feature that uses the target ("y") variable will automatically respect the current training and test sets. Similarly, preparation data (a feature's mean and stdev, for example) is stored and tracked between data contexts.

  • Composability

    All features, estimators, and their fits are composable, pluggable and storable.

  • Easy extensibility

    Ramp has a simple API, allowing you to plug in estimators from scikit-learn, rpy2 and elsewhere, or easily build your own feature transformations, metrics, feature selectors, reporters, or estimators.

快速入門

Getting started with Ramp: Classifying insults

Or, the quintessential Iris example:

import pandas
from ramp import *
import urllib2
import sklearn
from sklearn import decomposition


# fetch and clean iris data from UCI
data = pandas.read_csv(urllib2.urlopen(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"))
data = data.drop([149]) # bad line
columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
data.columns = columns


# all features
features = [FillMissing(f, 0) for f in columns[:-1]]

# features, log transformed features, and interaction terms
expanded_features = (
    features +
    [Log(F(f) + 1) for f in features] +
    [
        F('sepal_width') ** 2,
        combo.Interactions(features),
    ]
)


# Define several models and feature sets to explore,
# run 5 fold cross-validation on each and print the results.
# We define 2 models and 4 feature sets, so this will be
# 4 * 2 = 8 models tested.
shortcuts.cv_factory(
    data=data,

    target=[AsFactor('class')],
    metrics=[
        [metrics.GeneralizedMCC()],
        ],
    # report feature importance scores from Random Forest
    reporters=[
        [reporters.RFImportance()],
        ],

    # Try out two algorithms
    model=[
        sklearn.ensemble.RandomForestClassifier(
            n_estimators=20),
        sklearn.linear_model.LogisticRegression(),
        ],

    # and 4 feature sets
    features=[
        expanded_features,

        # Feature selection
        [trained.FeatureSelector(
            expanded_features,
            # use random forest's importance to trim
            selectors.RandomForestSelector(classifier=True),
            target=AsFactor('class'), # target to use
            n_keep=5, # keep top 5 features
            )],

        # Reduce feature dimension (pointless on this dataset)
        [combo.DimensionReduction(expanded_features,
                            decomposer=decomposition.PCA(n_components=4))],

        # Normalized features
        [Normalize(f) for f in expanded_features],
    ]
)

項目主頁:http://www.baiduhome.net/lib/view/home/1404267926436

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!