面向程序猿的數據科學與機器學習知識體系及資料合集
Table of Contents generated withDocToc
- DataScience & Machine Learning Reference
- Introduction & Overview:入門與概覽
-  
    - Collections:資源匯總帖
- Video Courses:視頻教程
- Blogs & Forum:博客與論壇
 
-  
    - Data Process:數據處理
- Machine Learning:機器學習
- Nature Language Processing:自然語言處理
- Deep Learning:深度學習
 
-  
    - Recommend System:推薦系統
 
- CrawlerSE:爬蟲與搜索引擎 
    - Search Engine:搜索引擎
 
- Data Visual:數據可視化
-  
    - Collections:資源匯總帖 
      - 跨學科數據庫與搜索引擎
 
- Social Network:社交網絡
- Driving Data:駕駛數據
 
- Collections:資源匯總帖 
      
-  
    - Competition:機器學習相關競賽
 
DataScience & Machine Learning Reference
本文是筆者在學習DataScience過程中所有資源的匯總,本文著眼于各個領域的入門介紹以及綜述性質資源的匯總,并不會過多的深挖前沿,若有興趣了解更多,可以關注筆者的 程序猿的數據科學與機器學習實戰手冊 。本文主線從對數據科學與機器學習入門概覽開始,繼而提供一系列的資源、書籍與教程,然后介紹各個具體的領域內的參考文章,最后介紹一系列的實用工具。筆者的數據科學與機器學習世界觀圖解如下,其從屬于筆者的編程世界觀與方法論系列:

本文會隨著筆者自身學習實踐中格局與能力的提升而不斷完善,筆者并非純粹的機器學習與數據挖掘研究者,更多的是從工程的角度來尋找能夠與工程相結合應用的方面。
Introduction & Overview:入門與概覽
Introduction
Machine Learning
- Visual Intro To Machine Learning :圖解如何基于決策樹對于紐約與San Francisco的房產進行分類
- A Gentle Guide to Machine Learning
- Machine Learning basics for a newbie
- What is machine learning, and how does it work?
Deep Learning
- 有趣的機器學習概念縱覽:從多元擬合,神經網絡到深度學習,給每個感興趣的人
-  [翻譯] 神經網絡的直觀解釋 :卷積神經網絡的講解非常通俗易懂。 
-  Deep-Learning-Papers-Reading-Roadmap :為每個對深度學習感興趣的朋友整理的論文閱讀路線圖 
-  程序員的深度學習入門指南 :來自費良宏在2016QCon全球軟件開發大會(上海)上的演講。 
Statistics
News:行業與新聞
Application:數據挖掘/機器學習/深度學習的實際應用案例
Resources:資源
Collections:資源匯總帖
- 機器學習入門資源不完全匯總 :本文是 機器學習日報的一個專題合集。
- Top-down learning path: Machine Learning for Software Engineers :針對軟件工程師的機器學習進階之路
Books:書籍
- 2014 - DataScience From Scratch
- 2012 - 李航:統計方法學
- 2015 - Data Mining, The Textbook
- 2016 - 周志華 機器學習
- 2012 - Machine Learning A Probabilistic Perspective
- 2012 - 深入淺出機器學習 中文版
- 南京大學計算機科學與技術系 數據挖掘課程
Video Courses:視頻教程
- University of Illinois at Urbana-Champaign:Text Mining and Analytics
- 臺大 機器學習技法
- 斯坦福 機器學習課程
- CS224d: Deep Learning for Natural Language Processing
-  Unsupervised Feature Learning and Deep Learning :來自斯坦福的無監督特征學習與深度學習系列教程 
- 小象 深度學習視頻教程
Blogs & Forum:博客與論壇
Methodology:方法論
Data Process:數據處理
Machine Learning:機器學習
Nature Language Processing:自然語言處理
Deep Learning:深度學習
Application:應用
Recommend System:推薦系統
CrawlerSE:爬蟲與搜索引擎
Crawler:爬蟲
Search Engine:搜索引擎
Toolkits:工具
Language
Python
- Jupyter :交互式編程與數據展示
- data-science-ipython-notebooks :一系列基于IPython的數據科學代碼展示
- The Open Source Data Science Masters
Java
Matlab
R
ClusterComputing
- Madout 
    - MLib ## DeepLearning:深度學習工具集
 
- Evaluation of Deep Learning Toolkits
- 代碼解析深度學習系統編程模型:TensorFlow vs. CNTK
- tensorflow-playground :Play with neural networks!  
- dl-docker:將常用的深度學習工具打包在了一個Docker鏡像中
- deep-learning-models:Keras code and weights files for popular deep learning models.
- Top Deep Learning Projects -
Data Visual:數據可視化
Books:書籍
Video Courses:視頻教程
Toolkits:工具
Data Sets
Collections:資源匯總帖
- awesome-public-datasets :An awesome list of high-quality open datasets in public domains (on-going).
- Wikimedia Dumps :Wiki上的數據打包下載
- Reddit Datasets :Reddit上關于數據集的討論板塊 | Militarized Interstate Disputes | Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes. | Multiple datasets, e.g., 962KB, 179KB | http://www.correlatesofwar.org/data-sets/MIDs |
單一數據庫
- http://archive.ics.uci.edu/ml/
- http://crawdad.org/
- http://data.austintexas.gov
- http://snap.stanford.edu/data/index.html
- http://data.cityofchicago.org
- http://data.govloop.com
- http://data.gov.uk/data.gov.in
- http://data.medicare.gov
- http://www.dados.gov.pt/pt/catalogodados/catalogodados.aspx
- http://data.sfgov.org
- http://data.sunlightlabs.com
- https://datamarket.azure.com/
- http://econ.worldbank.org/datasets
- http://gettingpastgo.socrata.com
- http://public.resource.org/
- http://timetric.com/public-data/
- http://www.bls.gov/
- http://www.crunchbase.com/
- http://www.dartmouthatlas.org/
- http://www.data.gov/
- http://www.datakc.org
- http://dbpedia.org
- http://www.factual.com/
- http://www.freebase.com/
- http://www.infochimps.com
- http://build.kiva.org/
- http://www.imdb.com/interfaces
- http://knoema.com
- http://daten.berlin.de/
- http://www.qunb.com
- http://databib.org/
- http://datacite.org/
- http://data.reegle.info/
- http://data.wien.gv.at/
- http://data.gov.bc.ca
跨學科數據庫與搜索引擎
- https://www..com/datasets
- http://usgovxml.com
- http://aws.amazon.com/datasets
- http://databib.org
- http://datacite.org
- http://figshare.com
- http://linkeddata.org
- http://thewebminer.com/
- http://thedatahub.org
- http://ckan.net
- http://quandl.com
- Open Data Inception(這里有 2500+ 開源接口)
Text:文本
- 20 Newsgroups :The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB
- Amazon Reviews :Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB | SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ |
Social Network:社交網絡
- http://enigma.io
- http://www.ufindthem.com/
- http://NetworkRepository.com(有視覺互動分析的機器學習數據庫)
- http://MLvis.com
- Yahoo Instant Messenger Friends Connectivity Graph :Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access. 共 28MB。
Media:影音圖片
- Labeled Faces in the Wild :13,000 named faces for facial recognition. Multiple training and test sets. 共173MB
- Mushroom Identification :For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
- NORB 3D Object Recognition :Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
- One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
- Hate Speech Identification :A sampling of 推ter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
- Hidden Beauty of Flickr Pictures :15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
Recognition
| Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones |
Driving Data:駕駛數據
Domain:領域數據
Sports:體育
- Football Strategy :Thousands of scenarios to make the best coaching decisions. 共876KB
- Horses for Courses :Horse-racing data for predicting race results. 共 19MB
- NBA & MLB Stats :Current and past season stats for teams and players for fantasy sports predictions.
Medicines:醫藥
- National Survey on Drug Use and Health :Predict drug use based on health survey questions. 共2GB
- Prostate Cancer :Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
- Record of Heart Sound :Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共47.7MB
Alien:外星人
- UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。
Foods:飲食
- Wine Quality :Chemical properties of red and white wines (separately) and quality, for classification. 3個文件,共343KB。
Finance:金融
Others:其他
Competition:機器學習相關競賽
- 阿里天池 新人實戰賽
- Kaggle :官方新人賽,不錯的入門學習
- Kaggle Tutorial :基于旅館推薦比賽實例的完整Tutorial
- Driven Data
- Innocentive
- Crowdanalytix
- Tunedit
- DataFountain :DF,CCF指定中國專業的數據競賽平臺
Career:職業
來自:https://github.com/wxyyxc1992/DataScience-And-MachineLearning-Handbook-For-Coders/blob/master/DataScience-Reference.md