如何選擇機器學習算法
Howdo you know what machine learning algorithm to choose for your classificationproblem? Of course, if you really care about accuracy, your best bet is to testout a couple different ones (making sure to try different parameters withineach algorithm as well), and select the best one by cross-validation. But ifyou’re simply looking for a “good enough” algorithm for your problem, or aplace to start, here are some general guidelines I’ve found to work well overthe years.
如何針對某個分類問題決定使用何種機器學習算法? 當然,如果你真心在乎準確率,最好的途徑就是測試一大堆各式各樣的算法(同時確保在每個算法上也測試不同的參數),最后選擇在交叉驗證中表現最好的。倘若你只是想針對你的問題尋找一個“足夠好”的算法,或者一個起步點,這里給出了一些我覺得這些年用著還不錯的常規指南。
Howlarge is your training set?
訓練集有多大?
Ifyour training set is small, high bias/low variance classifiers (e.g., NaiveBayes) have an advantage over low bias/high variance classifiers (e.g., kNN),since the latter will overfit. But low bias/high variance classifiers start towin out as your training set grows (they have lower asymptotic error), sincehigh bias classifiers aren’t powerful enough to provide accurate models.
如果是小訓練集,高偏差/低方差的分類器(比如樸素貝葉斯)要比低偏差/高方差的分類器(比如k最近鄰)具有優勢,因為后者容易過擬合。然而隨著訓練集的增大,低偏差/高方差的分類器將開始具有優勢(它們擁有更低的漸近誤差),因為高偏差分類器對于提供準確模型不那么給力。
Youcan also think of this as a generative model vs. discriminative modeldistinction.
你也可以把這一點看作生成模型和判別模型的差別。
Advantagesof some particular algorithms
一些常用算法的優缺點
Advantagesof Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditionalindependence assumption actually holds, a Naive Bayes classifier will convergequicker than discriminative models like logistic regression, so you need lesstraining data. And even if the NB assumption doesn’t hold, a NB classifierstill often does a great job in practice. A good bet if want something fast andeasy that performs pretty well. Its main disadvantage is that it can’t learninteractions between features (e.g., it can’t learn that although you lovemovies with Brad Pitt and Tom Cruise, you hate movies where they’re together).
樸素貝葉斯: 巨 尼瑪簡單,你只要做些算術就好了。倘若條件獨立性假設確實滿足,樸素貝葉斯分類器將會比判別模型,譬如邏輯回歸收斂得更快,因此你只需要更少的訓練數據。 就算該假設不成立,樸素貝葉斯分類器在實踐中仍然有著不俗的表現。如果你需要的是快速簡單并且表現出色,這將是個不錯的選擇。其主要缺點是它學習不了特征 間的交互關系(比方說,它學習不了你雖然喜歡甄子丹和姜文的電影,卻討厭他們共同出演的電影《關云長》的情況)。
Advantagesof Logistic Regression: Lots of ways to regularize your model, and you don’thave to worry as much about your features being correlated, like you do inNaive Bayes. You also have a nice probabilistic interpretation, unlike decisiontrees or SVMs, and you can easily update your model to take in new data (usingan online gradient descent method), again unlike decision trees or SVMs. Use itif you want a probabilistic framework (e.g., to easily adjust classificationthresholds, to say when you’re unsure, or to get confidence intervals) or ifyou expect to receive more training data in the future that you want to be ableto quickly incorporate into your model.
邏輯回歸: 有 很多正則化模型的方法,而且你不必像在用樸素貝葉斯那樣擔心你的特征是否相關。與決策樹與支持向量機相比,你還會得到一個不錯的概率解釋,你甚至可以輕松 地利用新數據來更新模型(使用在線梯度下降算法)。如果你需要一個概率架構(比如簡單地調節分類閾值,指明不確定性,或者是要得得置信區間),或者你以后 想將更多的訓練數據 快速 整合到模型中去,使用它吧。
Advantagesof Decision Trees: Easy to interpret and explain (for some people – I’m notsure I fall into this camp). They easily handle feature interactions andthey’re non-parametric, so you don’t have to worry about outliers or whetherthe data is linearly separable (e.g., decision trees easily take care of caseswhere you have class A at the low end of some feature x, class B in themid-range of feature x, and A again at the high end). One disadvantage is thatthey don’t support online learning, so you have to rebuild your tree when newexamples come on. Another disadvantage is that they easily overfit, but that’swhere ensemble methods like random forests (or boosted trees) come in. Plus,random forests are often the winner for lots of problems in classification(usually slightly ahead of SVMs, I believe), they’re fast and scalable, and youdon’t have to worry about tuning a bunch of parameters like you do with SVMs,so they seem to be quite popular these days.
決策樹: 易于解釋說明(對于某些人來說 —— 我不確定我是否在這其中)。它可以毫無壓力地處理特征間的交互關系并且是非參數化的,因此你不必擔心異常值或者數據是否線性可分(舉個例子,決策樹能輕松處理好類別A在某個 特征維度x的末端 ,類別B在中間,然后類別A又出現在特征維度x前 端的情況 )。它的一個缺點就是不支持在線學習,于是在新樣本到來后,決策樹需要全部重建。另一個缺點是容易過擬合,但這也就是諸如隨機森林(或提升樹)之類的集成 方法的切入點。另外,隨機森林經常是很多分類問題的贏家(通常比支持向量機好上那么一點,我認為),它快速并且可調,同時你無須擔心要像支持向量機那樣調 一大堆參數,所以最近它貌似相當受歡迎。
Advantagesof SVMs: High accuracy, nice theoretical guarantees regarding overfitting, andwith an appropriate kernel they can work well even if you’re data isn’tlinearly separable in the base feature space. Especially popular in textclassification problems where very high-dimensional spaces are the norm.Memory-intensive, hard to interpret, and kind of annoying to run and tune,though, so I think random forests are starting to steal the crown.
支持向量機: 高準確率,為避免過擬合提供了很好的理論保證,而且就算數據在原特征空間線性不可分,只要給個合適的核函數,它就能運行得很好。在動輒超高維的文本分類問題中特別受歡迎。可惜內存消耗大,難以解釋,運行和調參也有些煩人,所以我認為隨機森林要開始取而代之了。
But…
然而。。。
Recall,though, that better data often beats better algorithms, and designing goodfeatures goes a long way. And if you have a huge dataset, then whicheverclassification algorithm you use might not matter so much in terms ofclassification performance (so choose your algorithm based on speed or ease ofuse instead).
盡管如此,回想一下,好的數據卻要優于好的算法,設計優良特征是大有裨益的。假如你有一個超大數據集,那么無論你使用哪種算法可能對分類性能都沒太大影響(此時就根據速度和易用性來進行抉擇)。
Andto reiterate what I said above, if you really care about accuracy, you shoulddefinitely try a bunch of different classifiers and select the best one bycross-validation. Or, to take a lesson from the Netflix Prize (and MiddleEarth), just use an ensemble method to choose them all.
再重申一次我上面說過的話,倘若你真心在乎準確率,你一定得嘗試多種多樣的分類器,并且通過交叉驗證選擇最優。要么就從Netflix Prize(和Middle Earth)取點經,用集成方法把它們合而用之,妥妥的。
Via:博客 jmpoxf
End.
Via:36大數據