面向程序猿的數據科學與機器學習知識體系及資料合集

GretaColeba 8年前發布 | 9K 次閱讀 數據挖掘 機器學習

Table of Contents generated withDocToc

  • DataScience & Machine Learning Reference
  • Introduction & Overview:入門與概覽
    • Collections:資源匯總帖
    • Video Courses:視頻教程
    • Blogs & Forum:博客與論壇
    • Data Process:數據處理
    • Machine Learning:機器學習
    • Nature Language Processing:自然語言處理
    • Deep Learning:深度學習
    • Recommend System:推薦系統
  • CrawlerSE:爬蟲與搜索引擎
    • Search Engine:搜索引擎
  • Data Visual:數據可視化
    • Collections:資源匯總帖
      • 跨學科數據庫與搜索引擎
    • Social Network:社交網絡
    • Driving Data:駕駛數據
    • Competition:機器學習相關競賽

DataScience & Machine Learning Reference

本文是筆者在學習DataScience過程中所有資源的匯總,本文著眼于各個領域的入門介紹以及綜述性質資源的匯總,并不會過多的深挖前沿,若有興趣了解更多,可以關注筆者的 程序猿的數據科學與機器學習實戰手冊 。本文主線從對數據科學與機器學習入門概覽開始,繼而提供一系列的資源、書籍與教程,然后介紹各個具體的領域內的參考文章,最后介紹一系列的實用工具。筆者的數據科學與機器學習世界觀圖解如下,其從屬于筆者的編程世界觀與方法論系列:

本文會隨著筆者自身學習實踐中格局與能力的提升而不斷完善,筆者并非純粹的機器學習與數據挖掘研究者,更多的是從工程的角度來尋找能夠與工程相結合應用的方面。

Introduction & Overview:入門與概覽

Introduction

Machine Learning

Deep Learning

Statistics

News:行業與新聞

Application:數據挖掘/機器學習/深度學習的實際應用案例

Resources:資源

Collections:資源匯總帖

Books:書籍

Video Courses:視頻教程

Blogs & Forum:博客與論壇

Methodology:方法論

Data Process:數據處理

Machine Learning:機器學習

Nature Language Processing:自然語言處理

Deep Learning:深度學習

Application:應用

Recommend System:推薦系統

CrawlerSE:爬蟲與搜索引擎

Crawler:爬蟲

Search Engine:搜索引擎

Toolkits:工具

Language

Python

Java

Matlab

R

ClusterComputing

Data Visual:數據可視化

Books:書籍

Video Courses:視頻教程

Toolkits:工具

Data Sets

Collections:資源匯總帖

單一數據庫

跨學科數據庫與搜索引擎

Text:文本

  • 20 Newsgroups :The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB
  • Amazon Reviews :Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB | SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ |

Social Network:社交網絡

Media:影音圖片

  • Labeled Faces in the Wild :13,000 named faces for facial recognition. Multiple training and test sets. 共173MB
  • Mushroom Identification :For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
  • NORB 3D Object Recognition :Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
  • One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
  • Hate Speech Identification :A sampling of 推ter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
  • Hidden Beauty of Flickr Pictures :15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

Recognition

| Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones |

Driving Data:駕駛數據

Domain:領域數據

Sports:體育

  • Football Strategy :Thousands of scenarios to make the best coaching decisions. 共876KB
  • Horses for Courses :Horse-racing data for predicting race results. 共 19MB
  • NBA & MLB Stats :Current and past season stats for teams and players for fantasy sports predictions.

Medicines:醫藥

Alien:外星人

  • UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。

Foods:飲食

  • Wine Quality :Chemical properties of red and white wines (separately) and quality, for classification. 3個文件,共343KB。

Finance:金融

Others:其他

Competition:機器學習相關競賽

Career:職業

 

來自:https://github.com/wxyyxc1992/DataScience-And-MachineLearning-Handbook-For-Coders/blob/master/DataScience-Reference.md

 

 本文由用戶 GretaColeba 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!