深度學習:CNN的反向求導及練習

jopen 11年前發布 | 133K 次閱讀 深度學習 機器學習

作者:tornadomeet

前言:

CNN作為DL中最成功的模型之一,有必要對其更進一步研究它。雖然在前面的博文Stacked CNN簡單介紹中 有大概介紹過CNN的使用,不過那是有個前提的:CNN中的參數必須已提前學習好。而本文的主要目的是介紹CNN參數在使用bp算法時該怎么訓練,畢竟 CNN中有卷積層和下采樣層,雖然和MLP的bp算法本質上相同,但形式上還是有些區別的,很顯然在完成CNN反向傳播前了解bp算法是必須的。本文的實 驗部分是參考斯坦福UFLDL新教程UFLDL:Exercise: Convolutional Neural Network里面的內容。完成的是對MNIST數字的識別,采用有監督的CNN網絡,當然了,實驗很多參數結構都按照教程上給的,這里并沒有去調整。


實驗基礎:

CNN反向傳播求導時的具體過程可以參考論文Notes on Convolutional Neural Networks, Jake Bouvrie,該論文講得很全面,比如它考慮了pooling層也加入了權值、偏置值及非線性激發(因為這2種值也需要learn),對該論文的解讀可參考zouxy09的博文CNN卷積神經網絡推導和實現。 除了bp算法外,本人認為理解了下面4個子問題,基本上就可以弄懂CNN的求導了(bp算法這里就不多做介紹,網上資料實在是太多了),另外為了講清楚一 些細節過程,本文中舉的例子都是簡化了一些條件的,且linux下文本和公式編輯太難弄了,文中難免有些地方會出錯,大家原諒下。

首先我們來看看CNN系統的目標函數,設有樣本(xi, yi)共m個,CNN網絡共有L層,中間包含若干個卷積層和pooling層,最后一層的輸出為f(xi),則系統的loss表達式為(對權值進行了懲罰,一般分類都采用交叉熵形式):

深度學習:CNN的反向求導及練習


問題一:求輸出層的誤差敏感項。

現在只考慮個一個輸入樣本(x, y)的情形,loss函數和上面的公式類似是用交叉熵來表示的,暫時不考慮權值規則項,樣本標簽采用one-hot編碼,CNN網絡的最后一層采用 softmax全連接(多分類時輸出層一般用softmax),樣本(x,y)經過CNN網絡后的最終的輸出用f(x)表示,則對應該樣本的loss值 為:

深度學習:CNN的反向求導及練習

其中f(x)的下標c的含義見公式:

深度學習:CNN的反向求導及練習

因為x通過CNN后得到的輸出f(x)是一個向量,該向量的元素值都是概率值,分別代表著x被分到各個類中的概率,而f(x)中下標c的意思就是輸出向量中取對應c那個類的概率值。

采用上面的符號,可以求得此時loss值對輸出層的誤差敏感性表達式為:

深度學習:CNN的反向求導及練習

其中e(y)表示的是樣本x標簽值的one-hot表示,其中只有一個元素為1,其它都為0.

其推導過程如下(先求出對輸出層某個節點c的誤差敏感性,參考Larochelle關于DL的課件:Output layer gradient),求出輸出層中c節點的誤差敏感值:

深度學習:CNN的反向求導及練習

由上面公式可知,如果輸出層采用sfotmax,且loss用交叉熵形式,則最后一層的誤差敏感值就等于CNN網絡輸出值f(x)減樣本標簽值 e(y),即f(x)-e(y),其形式非常簡單,這個公式是不是很眼熟?很多情況下如果model采用MSE的loss,即loss=1/2* (e(y)-f(x))^2,那么loss對最終的輸出f(x)求導時其結果就是f(x)-e(y),雖然和上面的結果一樣,但是大家不要搞混淆了,這2 個含義是不同的,一個是對輸出層節點輸入值的導數(softmax激發函數),一個是對輸出層節點輸出值的導數(任意激發函數)。而在使用MSE的 loss表達式時,輸出層的誤差敏感項為(f(x)-e(y)).*f(x)’,兩者只相差一個因子。

這樣就可以求出第L層的權值W的偏導數:

深度學習:CNN的反向求導及練習

輸出層偏置的偏導數:

深度學習:CNN的反向求導及練習

上面2個公式的e(y)和f(x)是一個矩陣,已經把所有m個訓練樣本考慮進去了,每一列代表一個樣本。


問題二:當接在卷積層的下一層為pooling層時,求卷積層的誤差敏感項。

假設第l(小寫的l,不要看成數字’1’了)層為卷積層,第l+1層為pooling層,且pooling層的誤差敏感項為: 深度學習:CNN的反向求導及練習 ,卷積層的誤差敏感項為:深度學習:CNN的反向求導及練習 , 則兩者的關系表達式為:

深度學習:CNN的反向求導及練習

這里符號●表示的是矩陣的點積操作,即對應元素的乘積。卷積層和unsample()后的pooling層節點是一一對應的,所以下標都是用j表示。后面的符號深度學習:CNN的反向求導及練習表示的是第l層第j個節點處激發函數的導數(對節點輸入的導數)。

其中的函數unsample()為上采樣過程,其具體的操作得看是采用的什么pooling方法了。但unsample的大概思想 為:pooling層的每個節點是由卷積層中多個節點(一般為一個矩形區域)共同計算得到,所以pooling層每個節點的誤差敏感值也是由卷積層中多個 節點的誤差敏感值共同產生的,只需滿足兩層見各自的誤差敏感值相等,下面以mean-pooling和max-pooling為例來說明。

假設卷積層的矩形大小為4×4, pooling區域大小為2×2, 很容易知道pooling后得到的矩形大小也為2*2(本文默認pooling過程是沒有重疊的,卷積過程是每次移動一個像素,即是有重疊的,后續不再聲 明),如果此時pooling后的矩形誤差敏感值如下:

深度學習:CNN的反向求導及練習

則按照mean-pooling,首先得到的卷積層應該是4×4大小,其值分布為(等值復制):

深度學習:CNN的反向求導及練習

因為得滿足反向傳播時各層間誤差敏感總和不變,所以卷積層對應每個值需要平攤(除以pooling區域大小即可,這里pooling層大小為2×2=4)),最后的卷積層值

分布為:

深度學習:CNN的反向求導及練習

mean-pooling時的unsample操作可以使用matalb中的函數kron()來實現,因為是采用的矩陣Kronecker乘積。C=kron(A, B)表示的是矩陣B分別與矩陣A中每個元素相乘,然后將相乘的結果放在C中對應的位置。比如:

深度學習:CNN的反向求導及練習

如果是max-pooling,則需要記錄前向傳播過程中pooling區域中最大值的位置,這里假設pooling層值1,3,2,4對應的pooling區域位置分別為右下、右上、左上、左下。則此時對應卷積層誤差敏感值分布為:

深度學習:CNN的反向求導及練習

當然了,上面2種結果還需要點乘卷積層激發函數對應位置的導數值了,這里省略掉。


問題三:當接在pooling層的下一層為卷積層時,求該pooling層的誤差敏感項。

假設第l層(pooling層)有N個通道,即有N張特征圖,第l+1層(卷積層)有M個特征,l層中每個通道圖都對應有自己的誤差敏感值,其計算依據為第l+1層所有特征核的貢獻之和。下面是第l+1層中第j個核對第l層第i個通道的誤差敏感值計算方法:

深度學習:CNN的反向求導及練習

符號★表示的是矩陣的卷積操作,這是真正意義上的離散卷積,不同于卷積層前向傳播時的相關操作,因為嚴格意義上來講,卷積神經網絡中的卷積操作本質 是一個相關操作,并不是卷積操作,只不過它可以用卷積的方法去實現才這樣叫。而求第i個通道的誤差敏感項時需要將l+1層的所有核都計算一遍,然后求和。 另外因為這里默認pooling層是線性激發函數,所以后面沒有乘相應節點的導數。

舉個簡單的例子,假設拿出第l層某個通道圖,大小為3×3,第l+1層有2個特征核,核大小為2×2,則在前向傳播卷積時第l+1層會有2個大小為2×2的卷積圖。如果2個特征核分別為:

深度學習:CNN的反向求導及練習 深度學習:CNN的反向求導及練習

反向傳播求誤差敏感項時,假設已經知道第l+1層2個卷積圖的誤差敏感值:

深度學習:CNN的反向求導及練習深度學習:CNN的反向求導及練習

離散卷積函數conv2()的實現相關子操作時需先將核旋轉180度(即左右翻轉后上下翻轉),但這里實現的是嚴格意義上的卷積,所以在用 conv2()時,對應的參數核不需要翻轉(在有些toolbox里面,求這個問題時用了旋轉,那是因為它們已經把所有的卷積核都旋轉過,這樣在前向傳播 時的相關操作就不用旋轉了。并不矛盾)。且這時候該函數需要采用’full’模式,所以最終得到的矩陣大小為3×3,(其中3=2+2-1),剛好符第l 層通道圖的大小。采用’full’模式需先將第l+1層2個卷積圖擴充,周圍填0,padding后如下:

深度學習:CNN的反向求導及練習 深度學習:CNN的反向求導及練習

擴充后的矩陣和對應的核進行卷積的結果如下情況:

深度學習:CNN的反向求導及練習

深度學習:CNN的反向求導及練習

可以通過手動去驗證上面的結果,因為是離散卷積操作,而離散卷積等價于將核旋轉后再進行相關操作。而第l層那個通道的誤差敏感項為上面2者的和,呼應問題三,最終答案為:

深度學習:CNN的反向求導及練習

那么這樣問題3這樣解的依據是什么呢?其實很簡單,本質上還是bp算法,即第l層的誤差敏感值等于第l+1層的誤差敏感值乘以兩者之間的權值,只不 過這里由于是用了卷積,且是有重疊的,l層中某個點會對l+1層中的多個點有影響。比如說最終的結果矩陣中最中間那個0.3是怎么來的呢?在用2×2的核 對3×3的輸入矩陣進行卷積時,一共進行了4次移動,而3×3矩陣最中間那個值在4次移動中均對輸出結果有影響,且4次的影響分別在右下角、左下角、右上 角、左上角。所以它的值為2×0.2+1×0.1+1×0.1-1×0.3=0.3, 建議大家用筆去算一下,那樣就可以明白為什么這里可以采用帶’full’類型的conv2()實現。


問題四:求與卷積層相連那層的權值、偏置值導數。

前面3個問題分別求得了輸出層的誤差敏感值、從pooling層推斷出卷積層的誤差敏感值、從卷積層推斷出pooling層的誤差敏感值。下面需要 利用這些誤差敏感值模型中參數的導數。這里沒有考慮pooling層的非線性激發,因此pooling層前面是沒有權值的,也就沒有所謂的權值的導數了。 現在將主要精力放在卷積層前面權值的求導上(也就是問題四)。

假設現在需要求第l層的第i個通道,與第l+1層的第j個通道之間的權值和偏置的導數,則計算公式如下:

深度學習:CNN的反向求導及練習

其中符號⊙表示矩陣的相關操作,可以采用conv2()函數實現。在使用該函數時,需將第l+1層第j個誤差敏感值翻轉。

例如,如果第l層某個通道矩陣i大小為4×4,如下:

深度學習:CNN的反向求導及練習

第l+1層第j個特征的誤差敏感值矩陣大小為3×3,如下:

深度學習:CNN的反向求導及練習

很明顯,這時候的特征Kij導數的大小為2×2的,且其結果為:

深度學習:CNN的反向求導及練習

而此時偏置值bj的導數為1.2 ,將j區域的誤差敏感值相加即可(0.8+0.1-0.6+0.3+0.5+0.7-0.4-0.2=1.2),因為b對j中的每個節點都有貢獻,按照多項式的求導規則(和的導數等于導數的和)很容易得到。

為什么采用矩陣的相關操作就可以實現這個功能呢?由bp算法可知,l層i和l+1層j之間的權值等于l+1層j處誤差敏感值乘以l層i處的輸入,而 j中某個節點因為是由i+1中一個區域與權值卷積后所得,所以j處該節點的誤差敏感值對權值中所有元素都有貢獻,由此可見,將j中每個元素對權值的貢獻 (尺寸和核大小相同)相加,就得到了權值的偏導數了(這個例子的結果是由9個2×2大小的矩陣之和),同樣,如果大家動筆去推算一下,就會明白這時候為什 么可以用帶’valid’的conv2()完成此功能。


實驗總結:

  1. 卷積層過后,可以先跟pooling層,再通過非線性傳播函數。也可以是先通過非線性傳播函數再經過pooling層。
  2. CNN的結構本身就是一種規則項。
  3. 實際上每個權值的學習率不同時優化會更好。
  4. 發現Ng以前的ufldl中教程里面softmax并沒有包含偏置值參數,至少他給的start code里面沒有包含,嚴格來說是錯誤的。
  5. 當輸入樣本為多個時,bp算法中的誤差敏感性也是一個矩陣。每一個樣本都對應有自己每層的誤差敏感性。
  6. 血的教訓啊,以后循環變量名不能與終止名太相似,一不小心引用下標是就弄錯,比如for filterNum = 1:numFilters 時一不小心把下標用numFilters去代替了,花了大半天去查這個debug.

7.  matlab中conv2()函數在卷積過程中默認是每次移動一個像素,即重疊度最高的卷積。


實驗結果:

按照網頁教程UFLDL:Exercise: Convolutional Neural NetworkUFLDL:Optimization: Stochastic Gradient Descent,對MNIST數據庫進行識別,完成練習中YOU CODE HERE部分后,該CNN網絡的識別率為:

95.76%

只采用了一個卷積層+一個pooling層+一個softmax層。卷積層的特征個數為20,卷積核大小為9×9, pooling區域大小為2×2.

在進行實驗前,需下載好本實驗的標準代碼:https://github.com/amaas/stanford_dl_ex

然后在common文件夾放入下載好的MNIST數據庫,見http://yann.lecun.com/exdb/mnist/.注意MNIST文件名需要和代碼中的保持一致。


實驗代碼:

cnnTrain.m:


%% Convolution Neural Network Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started in building a single.
%  layer convolutional nerual network. In this exercise, you will only
%  need to modify cnnCost.m and cnnminFuncSGD.m. You will not need to 
%  modify this file.

%%======================================================================
%% STEP 0: Initialize Parameters and Load Data
%  Here we initialize some parameters used for the exercise.

% Configuration
imageDim = 28;
numClasses = 10;  % Number of classes (MNIST images fall into 10 classes)
filterDim = 9;    % Filter size for conv layer,9*9
numFilters = 20;   % Number of filters for conv layer
poolDim = 2;      % Pooling dimension, (should divide imageDim-filterDim+1)

% Load MNIST Train
addpath ./common/; 
images = loadMNISTImages('./common/train-images-idx3-ubyte');
images = reshape(images,imageDim,imageDim,[]);
labels = loadMNISTLabels('./common/train-labels-idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10

% Initialize Parameters,theta=(2165,1),2165=9*9*20+20+100*20*10+10
theta = cnnInitParams(imageDim,filterDim,numFilters,poolDim,numClasses);

%%======================================================================
%% STEP 1: Implement convNet Objective
%  Implement the function cnnCost.m.

%%======================================================================
%% STEP 2: Gradient Check
%  Use the file computeNumericalGradient.m to check the gradient
%  calculation for your cnnCost.m function.  You may need to add the
%  appropriate path or copy the file to this directory.

DEBUG=false;  % set this to true to check gradient
%DEBUG = true;
if DEBUG
    % To speed up gradient checking, we will use a reduced network and
    % a debugging data set
    db_numFilters = 2;
    db_filterDim = 9;
    db_poolDim = 5;
    db_images = images(:,:,1:10);
    db_labels = labels(1:10);
    db_theta = cnnInitParams(imageDim,db_filterDim,db_numFilters,...
                db_poolDim,numClasses);

    [cost grad] = cnnCost(db_theta,db_images,db_labels,numClasses,...
                                db_filterDim,db_numFilters,db_poolDim);


    % Check gradients
    numGrad = computeNumericalGradient( @(x) cnnCost(x,db_images,...
                                db_labels,numClasses,db_filterDim,...
                                db_numFilters,db_poolDim), db_theta);

    % Use this to visually compare the gradients side by side
    disp([numGrad grad]);

    diff = norm(numGrad-grad)/norm(numGrad+grad);
    % Should be small. In our implementation, these values are usually 
    % less than 1e-9.
    disp(diff); 

    assert(diff < 1e-9,... 'Difference too large. Check your gradient computation again'); end; %%====================================================================== %% STEP 3: Learn Parameters % Implement minFuncSGD.m, then train the model. % 因為是采用的mini-batch梯度下降法,所以總共對樣本的循環次數次數比標準梯度下降法要少 % 很多,因為每次循環中權值已經迭代多次了 options.epochs = 3; options.minibatch = 256; options.alpha = 1e-1; options.momentum = .95; opttheta = minFuncSGD(@(x,y,z) cnnCost(x,y,z,numClasses,filterDim,... numFilters,poolDim),theta,images,labels,options); save('theta.mat','opttheta'); %%====================================================================== %% STEP 4: Test % Test the performance of the trained model using the MNIST test set. Your % accuracy should be above 97% after 3 epochs of training testImages = loadMNISTImages('./common/t10k-images-idx3-ubyte'); testImages = reshape(testImages,imageDim,imageDim,[]); testLabels = loadMNISTLabels('./common/t10k-labels-idx1-ubyte'); testLabels(testLabels==0) = 10; % Remap 0 to 10 [~,cost,preds]=cnnCost(opttheta,testImages,testLabels,numClasses,... filterDim,numFilters,poolDim,true); acc = sum(preds==testLabels)/length(preds); % Accuracy should be around 97.4% after 3 epochs fprintf('Accuracy is %f\n',acc);


cnnConvolve.m:


function convolvedFeatures = cnnConvolve(filterDim, numFilters, images, W, b)
%cnnConvolve Returns the convolution of the features given by W and b with
%the given images
%
% Parameters:
%  filterDim - filter (feature) dimension
%  numFilters - number of feature maps
%  images - large images to convolve with, matrix in the form
%           images(r, c, image number)
%  W, b - W, b for features from the sparse autoencoder,傳進來的w已經是矩陣的形式
%         W is of shape (filterDim,filterDim,numFilters)
%         b is of shape (numFilters,1)
%
% Returns:
%  convolvedFeatures - matrix of convolved features in the form
%                      convolvedFeatures(imageRow, imageCol, featureNum, imageNum)

numImages = size(images, 3);
imageDim = size(images, 1);
convDim = imageDim - filterDim + 1;

convolvedFeatures = zeros(convDim, convDim, numFilters, numImages);

% Instructions:
%   Convolve every filter with every image here to produce the 
%   (imageDim - filterDim + 1) x (imageDim - filterDim + 1) x numFeatures x numImages
%   matrix convolvedFeatures, such that 
%   convolvedFeatures(imageRow, imageCol, featureNum, imageNum) is the
%   value of the convolved featureNum feature for the imageNum image over
%   the region (imageRow, imageCol) to (imageRow + filterDim - 1, imageCol + filterDim - 1)
%
% Expected running times: 
%   Convolving with 100 images should take less than 30 seconds 
%   Convolving with 5000 images should take around 2 minutes
%   (So to save time when testing, you should convolve with less images, as
%   described earlier)


for imageNum = 1:numImages
  for filterNum = 1:numFilters

    % convolution of image with feature matrix
    convolvedImage = zeros(convDim, convDim);

    % Obtain the feature (filterDim x filterDim) needed during the convolution

    %%% YOUR CODE HERE %%%
    filter = W(:,:,filterNum);
    bc = b(filterNum);

    % Flip the feature matrix because of the definition of convolution, as explained later
    filter = rot90(squeeze(filter),2);

    % Obtain the image
    im = squeeze(images(:, :, imageNum));

    % Convolve "filter" with "im", adding the result to convolvedImage
    % be sure to do a 'valid' convolution

    %%% YOUR CODE HERE %%%
    convolvedImage = conv2(im, filter, 'valid');

    % Add the bias unit
    % Then, apply the sigmoid function to get the hidden activation
    %%% YOUR CODE HERE %%%
    convolvedImage = sigmoid(convolvedImage+bc);

    convolvedFeatures(:, :, filterNum, imageNum) = convolvedImage;
  end
end

end

function sigm = sigmoid(x)
    sigm = 1./(1+exp(-x));
end


cnnPool.m:


function pooledFeatures = cnnPool(poolDim, convolvedFeatures)
%cnnPool Pools the given convolved features
%
% Parameters:
%  poolDim - dimension of pooling region
%  convolvedFeatures - convolved features to pool (as given by cnnConvolve)
%                      convolvedFeatures(imageRow, imageCol, featureNum, imageNum)
%
% Returns:
%  pooledFeatures - matrix of pooled features in the form
%                   pooledFeatures(poolRow, poolCol, featureNum, imageNum)
%     

numImages = size(convolvedFeatures, 4);
numFilters = size(convolvedFeatures, 3);
convolvedDim = size(convolvedFeatures, 1);

pooledFeatures = zeros(convolvedDim / poolDim, ...
        convolvedDim / poolDim, numFilters, numImages);

% Instructions:
%   Now pool the convolved features in regions of poolDim x poolDim,
%   to obtain the 
%   (convolvedDim/poolDim) x (convolvedDim/poolDim) x numFeatures x numImages 
%   matrix pooledFeatures, such that
%   pooledFeatures(poolRow, poolCol, featureNum, imageNum) is the 
%   value of the featureNum feature for the imageNum image pooled over the
%   corresponding (poolRow, poolCol) pooling region. 
%   
%   Use mean pooling here.

%%% YOUR CODE HERE %%%
    % convolvedFeatures(imageRow, imageCol, featureNum, imageNum)
    % pooledFeatures(poolRow, poolCol, featureNum, imageNum)
    for numImage = 1:numImages
        for numFeature = 1:numFilters
            for poolRow = 1:convolvedDim / poolDim
                offsetRow = 1+(poolRow-1)*poolDim;
                for poolCol = 1:convolvedDim / poolDim
                    offsetCol = 1+(poolCol-1)*poolDim;
                    patch = convolvedFeatures(offsetRow:offsetRow+poolDim-1, ...
                        offsetCol:offsetCol+poolDim-1,numFeature,numImage); %取出一個patch
                    pooledFeatures(poolRow,poolCol,numFeature,numImage) = mean(patch(:));
                end
            end            
        end
    end

end


cnnCost.m:


function [cost, grad, preds] = cnnCost(theta,images,labels,numClasses,...
                                filterDim,numFilters,poolDim,pred)
% Calcualte cost and gradient for a single layer convolutional
% neural network followed by a softmax layer with cross entropy
% objective.
%                            
% Parameters:
%  theta      -  unrolled parameter vector
%  images     -  stores images in imageDim x imageDim x numImges
%                array
%  numClasses -  number of classes to predict
%  filterDim  -  dimension of convolutional filter                            
%  numFilters -  number of convolutional filters
%  poolDim    -  dimension of pooling area
%  pred       -  boolean only forward propagate and return
%                predictions
%
%
% Returns:
%  cost       -  cross entropy cost
%  grad       -  gradient with respect to theta (if pred==False)
%  preds      -  list of predictions for each example (if pred==True)


if ~exist('pred','var')
    pred = false;
end;


imageDim = size(images,1); % height/width of image
numImages = size(images,3); % number of images
lambda = 3e-3; % weight decay parameter     

%% Reshape parameters and setup gradient matrices

% Wc is filterDim x filterDim x numFilters parameter matrix
% bc is the corresponding bias

% Wd is numClasses x hiddenSize parameter matrix where hiddenSize
% is the number of output units from the convolutional layer
% bd is corresponding bias
[Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,numFilters,...
                        poolDim,numClasses); %the theta vector cosists wc,wd,bc,bd in order

% Same sizes as Wc,Wd,bc,bd. Used to hold gradient w.r.t above params.
Wc_grad = zeros(size(Wc));
Wd_grad = zeros(size(Wd));
bc_grad = zeros(size(bc));
bd_grad = zeros(size(bd));

%%======================================================================
%% STEP 1a: Forward Propagation
%  In this step you will forward propagate the input through the
%  convolutional and subsampling (mean pooling) layers.  You will then use
%  the responses from the convolution and pooling layer as the input to a
%  standard softmax layer.

%% Convolutional Layer
%  For each image and each filter, convolve the image with the filter, add
%  the bias and apply the sigmoid nonlinearity.  Then subsample the 
%  convolved activations with mean pooling.  Store the results of the
%  convolution in activations and the results of the pooling in
%  activationsPooled.  You will need to save the convolved activations for
%  backpropagation.
convDim = imageDim-filterDim+1; % dimension of convolved output
outputDim = (convDim)/poolDim; % dimension of subsampled output

% convDim x convDim x numFilters x numImages tensor for storing activations
activations = zeros(convDim,convDim,numFilters,numImages);

% outputDim x outputDim x numFilters x numImages tensor for storing
% subsampled activations
activationsPooled = zeros(outputDim,outputDim,numFilters,numImages);

%%% YOUR CODE HERE %%%
convolvedFeatures = cnnConvolve(filterDim, numFilters, images, Wc, bc); %前向傳播,已經經過了激發函數
activationsPooled = cnnPool(poolDim, convolvedFeatures);


% Reshape activations into 2-d matrix, hiddenSize x numImages,
% for Softmax layer
activationsPooled = reshape(activationsPooled,[],numImages);

%% Softmax Layer
%  Forward propagate the pooled activations calculated above into a
%  standard softmax layer. For your convenience we have reshaped
%  activationPooled into a hiddenSize x numImages matrix.  Store the
%  results in probs.

% numClasses x numImages for storing probability that each image belongs to
% each class.
probs = zeros(numClasses,numImages);

%%% YOUR CODE HERE %%%
%Wd=(numClasses,hiddenSize),probs的每一列代表一個輸出
M = Wd*activationsPooled+repmat(bd,[1,numImages]); 
M = bsxfun(@minus,M,max(M,[],1));
M = exp(M);
probs = bsxfun(@rdivide, M, sum(M)); %why rdivide?

%%======================================================================
%% STEP 1b: Calculate Cost
%  In this step you will use the labels given as input and the probs
%  calculate above to evaluate the cross entropy objective.  Store your
%  results in cost.

cost = 0; % save objective into cost

%%% YOUR CODE HERE %%%
% cost = -1/numImages*labels(:)'*log(probs(:));
% 首先需要把labels弄成one-hot編碼
groundTruth = full(sparse(labels, 1:numImages, 1));
cost = -1./numImages*groundTruth(:)'*log(probs(:))+(lambda/2.)*(sum(Wd(:).^2)+sum(Wc(:).^2)); %加入一個懲罰項
% cost = -1./numImages*groundTruth(:)'*log(probs(:));

% Makes predictions given probs and returns without backproagating errors.
if pred
    [~,preds] = max(probs,[],1);
    preds = preds';
    grad = 0;
    return;
end;

%% 將c步和d步合成在一起了
%======================================================================
% STEP 1c: Backpropagation
%  Backpropagate errors through the softmax and convolutional/subsampling
%  layers.  Store the errors for the next step to calculate the gradient.
%  Backpropagating the error w.r.t the softmax layer is as usual.  To
%  backpropagate through the pooling layer, you will need to upsample the
%  error with respect to the pooling layer for each filter and each image.  
%  Use the kron function and a matrix of ones to do this upsampling 
%  quickly.

% STEP 1d: Gradient Calculation
%  After backpropagating the errors above, we can use them to calculate the
%  gradient with respect to all the parameters.  The gradient w.r.t the
%  softmax layer is calculated as usual.  To calculate the gradient w.r.t.
%  a filter in the convolutional layer, convolve the backpropagated error
%  for that filter with each image and aggregate over images.

%%% YOUR CODE HERE %%%
%%% YOUR CODE HERE %%%
% 網絡結構: images--> convolvedFeatures--> activationsPooled--> probs
% Wd = (numClasses,hiddenSize)
% bd = (hiddenSize,1)
% Wc = (filterDim,filterDim,numFilters)
% bc = (numFilters,1)
% activationsPooled = zeros(outputDim,outputDim,numFilters,numImages);
% convolvedFeatures = (convDim,convDim,numFilters,numImages)
% images(imageDim,imageDim,numImges)
delta_d = -(groundTruth-probs); % softmax layer's preactivation,每一個樣本都對應有自己每層的誤差敏感性。
Wd_grad = (1./numImages)*delta_d*activationsPooled'+lambda*Wd;
bd_grad = (1./numImages)*sum(delta_d,2); %注意這里是要求和
delta_s = Wd'*delta_d; %the pooling/sample layer's preactivation
delta_s = reshape(delta_s,outputDim,outputDim,numFilters,numImages);

%進行unsampling操作,由于kron函數只能對二維矩陣操作,所以還得分開弄
%delta_c = convolvedFeatures.*(1-convolvedFeatures).*(1./poolDim^2)*kron(delta_s, ones(poolDim)); 
delta_c = zeros(convDim,convDim,numFilters,numImages);
for i=1:numImages
    for j=1:numFilters
        delta_c(:,:,j,i) = (1./poolDim^2)*kron(squeeze(delta_s(:,:,j,i)), ones(poolDim));
    end
end
delta_c = convolvedFeatures.*(1-convolvedFeatures).*delta_c;

% Wc_grad = convn(images,rot90(delta_c,2,'valid'))+ lambda*Wc;
for i=1:numFilters
    Wc_i = zeros(filterDim,filterDim);
    for j=1:numImages  
        Wc_i = Wc_i+conv2(squeeze(images(:,:,j)),rot90(squeeze(delta_c(:,:,i,j)),2),'valid');
    end
   % Wc_i = convn(images,rot180(squeeze(delta_c(:,:,i,:))),'valid');
    % add penalize
    Wc_grad(:,:,i) = (1./numImages)*Wc_i+lambda*Wc(:,:,i);

    bc_i = delta_c(:,:,i,:);
    bc_i = bc_i(:);
    bc_grad(i) = sum(bc_i)/numImages;
end

%% Unroll gradient into grad vector for minFunc
grad = [Wc_grad(:) ; Wd_grad(:) ; bc_grad(:) ; bd_grad(:)];

end

function X = rot180(X)
    X = flipdim(flipdim(X, 1), 2);
end


minFuncSGD.m:


function [opttheta] = minFuncSGD(funObj,theta,data,labels,...
                        options)
% Runs stochastic gradient descent with momentum to optimize the
% parameters for the given objective.
%
% Parameters:
%  funObj     -  function handle which accepts as input theta,
%                data, labels and returns cost and gradient w.r.t
%                to theta.
%  theta      -  unrolled parameter vector
%  data       -  stores data in m x n x numExamples tensor
%  labels     -  corresponding labels in numExamples x 1 vector
%  options    -  struct to store specific options for optimization
%
% Returns:
%  opttheta   -  optimized parameter vector
%
% Options (* required)
%  epochs*     - number of epochs through data
%  alpha*      - initial learning rate
%  minibatch*  - size of minibatch
%  momentum    - momentum constant, defualts to 0.9


%%======================================================================
%% Setup
assert(all(isfield(options,{'epochs','alpha','minibatch'})),...
        'Some options not defined');
if ~isfield(options,'momentum')
    options.momentum = 0.9;
end;
epochs = options.epochs;
alpha = options.alpha;
minibatch = options.minibatch;
m = length(labels); % training set size
% Setup for momentum
mom = 0.5;
momIncrease = 20;
velocity = zeros(size(theta));

%%======================================================================
%% SGD loop
it = 0;
for e = 1:epochs

    % randomly permute indices of data for quick minibatch sampling
    rp = randperm(m);

    for s=1:minibatch:(m-minibatch+1)
        it = it + 1;

        % increase momentum after momIncrease iterations
        if it == momIncrease
            mom = options.momentum;
        end;

        % get next randomly selected minibatch
        mb_data = data(:,:,rp(s:s+minibatch-1)); % 取出當前的mini-batch的訓練樣本
        mb_labels = labels(rp(s:s+minibatch-1));

        % evaluate the objective function on the next minibatch
        [cost grad] = funObj(theta,mb_data,mb_labels);

        % Instructions: Add in the weighted velocity vector to the
        % gradient evaluated above scaled by the learning rate.
        % Then update the current weights theta according to the
        % sgd update rule

        %%% YOUR CODE HERE %%%
        velocity = mom*velocity+alpha*grad; % 見ufldl教程Optimization: Stochastic Gradient Descent
        theta = theta-velocity;

        fprintf('Epoch %d: Cost on iteration %d is %f\n',e,it,cost);
    end;

    % aneal learning rate by factor of two after each epoch
    alpha = alpha/2.0;

end;

opttheta = theta;

end


computeNumericalGradient.m:


function numgrad = computeNumericalGradient(J, theta)
% numgrad = computeNumericalGradient(J, theta)
% theta: a vector of parameters
% J: a function that outputs a real-number. Calling y = J(theta) will return the
% function value at theta. 

% Initialize numgrad with zeros
numgrad = zeros(size(theta));

%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: 
% Implement numerical gradient checking, and return the result in numgrad.  
% (See Section 2.3 of the lecture notes.)
% You should write code so that numgrad(i) is (the numerical approximation to) the 
% partial derivative of J with respect to the i-th input argument, evaluated at theta.  
% I.e., numgrad(i) should be the (approximately) the partial derivative of J with 
% respect to theta(i).
%                
% Hint: You will probably want to compute the elements of numgrad one at a time. 

epsilon = 1e-4;

for i =1:length(numgrad)
    oldT = theta(i);
    theta(i)=oldT+epsilon;
    pos = J(theta);
    theta(i)=oldT-epsilon;
    neg = J(theta);
    numgrad(i) = (pos-neg)/(2*epsilon);
    theta(i)=oldT;
    if mod(i,100)==0
       fprintf('Done with %d\n',i);
    end;
end;


%% ---------------------------------------------------------------
end


cnnInitParams.m:


function theta = cnnInitParams(imageDim,filterDim,numFilters,...
                                poolDim,numClasses)
% Initialize parameters for a single layer convolutional neural
% network followed by a softmax layer.
%                            
% Parameters:
%  imageDim   -  height/width of image
%  filterDim  -  dimension of convolutional filter                            
%  numFilters -  number of convolutional filters
%  poolDim    -  dimension of pooling area
%  numClasses -  number of classes to predict
%
%
% Returns:
%  theta      -  unrolled parameter vector with initialized weights

%% Initialize parameters randomly based on layer sizes.
assert(filterDim < imageDim,'filterDim must be less that imageDim'); Wc = 1e-1*randn(filterDim,filterDim,numFilters); outDim = imageDim - filterDim + 1; % dimension of convolved image % assume outDim is multiple of poolDim assert(mod(outDim,poolDim)==0,... 'poolDim must divide imageDim - filterDim + 1'); outDim = outDim/poolDim; hiddenSize = outDim^2*numFilters; % we'll choose weights uniformly from the interval [-r, r] r = sqrt(6) / sqrt(numClasses+hiddenSize+1); Wd = rand(numClasses, hiddenSize) * 2 * r - r; bc = zeros(numFilters, 1); %初始化為0 bd = zeros(numClasses, 1); % Convert weights and bias gradients to the vector form. % This step will "unroll" (flatten and concatenate together) all % your parameters into a vector, which can then be used with minFunc. theta = [Wc(:) ; Wd(:) ; bc(:) ; bd(:)]; end


cnnParamsToStack.m:


function [Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,...
                                 numFilters,poolDim,numClasses)
% Converts unrolled parameters for a single layer convolutional neural
% network followed by a softmax layer into structured weight
% tensors/matrices and corresponding biases
%                            
% Parameters:
%  theta      -  unrolled parameter vectore
%  imageDim   -  height/width of image
%  filterDim  -  dimension of convolutional filter                            
%  numFilters -  number of convolutional filters
%  poolDim    -  dimension of pooling area
%  numClasses -  number of classes to predict
%
%
% Returns:
%  Wc      -  filterDim x filterDim x numFilters parameter matrix
%  Wd      -  numClasses x hiddenSize parameter matrix, hiddenSize is
%             calculated as numFilters*((imageDim-filterDim+1)/poolDim)^2 
%  bc      -  bias for convolution layer of size numFilters x 1
%  bd      -  bias for dense layer of size hiddenSize x 1

outDim = (imageDim - filterDim + 1)/poolDim;
hiddenSize = outDim^2*numFilters;

%% Reshape theta
indS = 1;
indE = filterDim^2*numFilters;
Wc = reshape(theta(indS:indE),filterDim,filterDim,numFilters);
indS = indE+1;
indE = indE+hiddenSize*numClasses;
Wd = reshape(theta(indS:indE),numClasses,hiddenSize);
indS = indE+1;
indE = indE+numFilters;
bc = theta(indS:indE);
bd = theta(indE+1:end);


end


2013.12.30:

微博網友@路遙_機器學習利用matlab自帶的優化函數conv2,實現的mean-pooling,可以大大加快速度,大家可以參考。cnnPool.m文件里面:

tmp = conv2(convolvedFeatures(:,:,numFeature,numImage), ones(poolDim),'valid'); 
pooledFeatures(:,:,numFeature,numImage) =1./(poolDim^2)*tmp(1:poolDim:end,1:poolDim:end);



參考資料:

Deep learning:三十八(Stacked CNN簡單介紹)

UFLDL:Convolutional Neural Network

UFLDL:Exercise: Convolutional Neural Network

UFLDL:Optimization: Stochastic Gradient Descent

zouxy09博文:Deep Learning論文筆記之(四)CNN卷積神經網絡推導和實現

論文Notes on Convolutional Neural Networks, Jake Bouvrie

Larochelle關于DL的課件:Output layer gradient

github.com/rasmusbergpalm/DeepLearnToolbox/blob/master/CNN/cnnbp.m

https://github.com/amaas/stanford_dl_ex

http://yann.lecun.com/exdb/mnist/


文章出處:http://www.cnblogs.com/tornadomeet
 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!