從頭開始在Python中開發深度學習字幕生成模型

dwef7416 8年前發布 | 42K 次閱讀深度學習 Python Python開發

本文從數據預處理開始詳細地描述了如何使用 VGG 和循環神經網絡構建圖像描述系統，對讀者使用 Keras 和 TensorFlow 理解與實現自動圖像描述很有幫助。本文的代碼都有解釋，非常適合圖像描述任務的入門讀者詳細了解這一過程。

圖像描述是一個有挑戰性的人工智能問題，涉及為給定圖像生成文本描述。

字幕生成是一個有挑戰性的人工智能問題，涉及為給定圖像生成文本描述。

一般圖像描述或字幕生成需要使用計算機視覺方法來了解圖像內容，也需要自然語言處理模型將對圖像的理解轉換成正確順序的文字。近期，深度學習方法在該問題的多個示例上獲得了頂尖結果。

深度學習方法在字幕生成問題上展現了頂尖的結果。這些方法最令人印象深刻的地方：給定一個圖像，我們無需復雜的數據準備和特殊設計的流程，就可以使用端到端的方式預測字幕。

本教程將介紹如何從頭開發能生成圖像字幕的深度學習模型。

完成本教程，你將學會：

如何為訓練深度學習模型準備圖像和文本數據。
如何設計和訓練深度學習字幕生成模型。
如何評估一個訓練后的字幕生成模型，并使用它為全新的圖像生成字幕。

教程概覽

該教程共分為 6 部分：

1. 圖像和字幕數據集

2. 準備圖像數據

3. 準備文本數據

4. 開發深度學習模型

5. 評估模型

6. 生成新的字幕

Python 環境

本教程假設你已經安裝了 Python SciPy 環境，該環境完美適合 Python 3。你必須安裝 Keras（2.0 版本或更高），TensorFlow 或 Theano 后端。本教程還假設你已經安裝了 scikit-learn、Pandas、NumPy 和 Matplotlib 等科學計算與繪圖軟件庫。

我推薦在 GPU 系統上運行代碼。你可以在 Amazon Web Services 上用廉價的方式獲取 GPU：如何在 AWS GPU 上運行 Jupyter noterbook ？

圖像和字幕數據集

圖像字幕生成可使用的優秀數據集有 Flickr8K 數據集。原因在于它逼真且相對較小，即使你的工作站使用的是 CPU 也可以下載它，并用于構建模型。

對該數據集的明確描述見 2013 年的論文《Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics》。

作者對該數據集的描述如下：

我們介紹了一種用于基于句子的圖像描述和搜索的新型基準集合，包括 8000 張圖像，每個圖像有五個不同的字幕描述對突出實體和事件提供清晰描述。

圖像選自六個不同的 Flickr 組，往往不包含名人或有名的地點，而是手動選擇多種場景和情形。

該數據集可免費獲取。你必須填寫一份申請表，然后就可以通過電子郵箱收到數據集。申請表鏈接：https://illinois.edu/fb/sec/1713398。

很快，你會收到電子郵件，包含以下兩個文件的鏈接：

Flickr8k_Dataset.zip（1 Gigabyte）包含所有圖像。
Flickr8k_text.zip（2.2 Megabytes）包含所有圖像文本描述。

下載數據集，并在當前工作文件夾里進行解壓縮。你將得到兩個目錄：

Flicker8k_Dataset：包含 8092 張 JPEG 格式圖像。
Flickr8k_text：包含大量不同來源的圖像描述文件。

該數據集包含一個預制訓練數據集（6000 張圖像）、開發數據集（1000 張圖像）和測試數據集（1000 張圖像）。

用于評估模型技能的一個指標是 BLEU 值。對于推斷，下面是一些精巧的模型在測試數據集上進行評估時獲得的大概 BLEU 值（來源：2017 年論文《Where to put the Image in an Image Caption Generator》）：

BLEU-1: 0.401 to 0.578.
BLEU-2: 0.176 to 0.390.
BLEU-3: 0.099 to 0.260.
BLEU-4: 0.059 to 0.170.

稍后在評估模型部分將詳細介紹 BLEU 值。下面，我們來看一下如何加載圖像。

準備圖像數據

我們將使用預訓練模型解析圖像內容，且目前有很多可選模型。在這種情況下，我們將使用 Oxford Visual Geometry Group 或 VGG（該模型贏得了 2014 年 ImageNet 競賽冠軍）。

Keras 可直接提供該預訓練模型。注意，第一次使用該模型時，Keras 將從互聯網上下載模型權重，大概 500Megabytes。這可能需要一段時間（時間長度取決于你的網絡連接）。

我們可以將該模型作為更大的圖像字幕生成模型的一部分。問題在于模型太大，每次我們想測試新語言模型配置（下行）時在該網絡中運行每張圖像非常冗余。

我們可以使用預訓練模型對「圖像特征」進行預計算，并保存至文件中。然后加載這些特征，將其饋送至模型中作為數據集中給定圖像的描述。在完整的 VGG 模型中運行圖像也是這樣，我們需要提前運行該步驟。

優化可以加快模型訓練過程，消耗更少內存。我們可以使用 VGG class 在 Keras 中運行 VGG 模型。我們將移除加載模型的最后一層，因為該層用于預測圖像的分類。我們對圖像分類不感興趣，我們感興趣的是分類之前圖像的內部表征。這些就是模型從圖像中提取出的「特征」。

Keras 還提供工具將加載圖像改造成模型的偏好大小（如 3 通道 224 x 224 像素圖像）。

下面是 extract_features() 函數，即給出一個目錄名，該函數將加載每個圖像、為 VGG 準備圖像數據，并從 VGG 模型中收集預測到的特征。圖像特征是包含 4096 個元素的向量，該函數向圖像特征返回一個圖像標識符（identifier）詞典。

# extract features from each photo in the directory
def extract_features(directory):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # summarize
    print(model.summary())
    # extract features from each photo
    features = dict()
    for name in listdir(directory):
        # load an image from file
        filename = directory + '/' + name
        image = load_img(filename, target_size=(224, 224))
        # convert the image pixels to a numpy array
        image = img_to_array(image)
        # reshape data for the model
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        # prepare the image for the VGG model
        image = preprocess_input(image)
        # get features
        feature = model.predict(image, verbose=0)
        # get image id
        image_id = name.split('.')[0]
        # store feature
        features[image_id] = feature
        print('>%s' % name)
    return features

我們調用該函數為模型測試準備圖像數據，然后將詞典保存至 features.pkl 文件。

完整示例如下：

from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model

# extract features from each photo in the directory
def extract_features(directory):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # summarize
    print(model.summary())
    # extract features from each photo
    features = dict()
    for name in listdir(directory):
        # load an image from file
        filename = directory + '/' + name
        image = load_img(filename, target_size=(224, 224))
        # convert the image pixels to a numpy array
        image = img_to_array(image)
        # reshape data for the model
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        # prepare the image for the VGG model
        image = preprocess_input(image)
        # get features
        feature = model.predict(image, verbose=0)
        # get image id
        image_id = name.split('.')[0]
        # store feature
        features[image_id] = feature
        print('>%s' % name)
    return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

運行該數據準備步驟可能需要一點時間，時間長度取決于你的硬件，帶有 CPU 的現代工作站可能需要一個小時。

運行結束時，提取出的特征將存儲在 features.pkl 文件中以備后用。該文件大概 127 Megabytes 大小。

準備文本數據

該數據集中每個圖像有多個描述，文本描述需要進行最低限度的清洗。首先，加載包含所有文本描述的文件。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)

每個圖像有一個獨有的標識符，該標識符出現在文件名和文本描述文件中。

接下來，我們將逐步對圖像描述進行操作。下面定義一個 load_descriptions() 函數：給出一個需要加載的文本文檔，該函數將返回圖像標識符詞典。每個圖像標識符映射到一或多個文本描述。

# extract descriptions for images
def load_descriptions(doc):
    mapping = dict()
    # process lines
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        if len(line) < 2:
            continue
        # take the first token as the image id, the rest as the description
        image_id, image_desc = tokens[0], tokens[1:]
        # remove filename from image id
        image_id = image_id.split('.')[0]
        # convert description tokens back to string
        image_desc = ' '.join(image_desc)
        # create the list if needed
        if image_id not in mapping:
            mapping[image_id] = list()
        # store description
        mapping[image_id].append(image_desc)
    return mapping

# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

下面，我們需要清洗描述文本。因為描述已經經過符號化，所以它十分易于處理。

我們將用以下方式清洗文本，以減少需要處理的詞匯量：

所有單詞全部轉換成小寫。
移除所有標點符號。
移除所有少于或等于一個字符的單詞（如 a）。
移除所有帶數字的單詞。

下面定義了 clean_descriptions() 函數：給出描述的圖像標識符詞典，遍歷每個描述，清洗文本。

import string

def clean_descriptions(descriptions):
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for key, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            # tokenize
            desc = desc.split()
            # convert to lower case
            desc = [word.lower() for word in desc]
            # remove punctuation from each token
            desc = [w.translate(table) for w in desc]
            # remove hanging 's' and 'a'
            desc = [word for word in desc if len(word)>1]
            # remove tokens with numbers in them
            desc = [word for word in desc if word.isalpha()]
            # store as string
            desc_list[i] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)

清洗后，我們可以總結詞匯量。

理想情況下，我們希望使用盡可能少的詞匯而得到強大的表達性。詞匯越少則模型越小、訓練速度越快。

對于推斷，我們可以將干凈的描述轉換成一個集，將它的規模打印出來，這樣就可以了解我們的數據集詞匯量的大小了。

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
    # build a list of all description strings
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc

# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))

最后，我們保存圖像標識符詞典和描述至一個新文本 descriptions.txt，該文件中每行只有一個圖像和一個描述。

下面我們定義了 save_doc() 函數，即給出一個包含標識符和描述之間映射的詞典和文件名，將該映射保存至文件中。

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# save descriptions
save_doc(descriptions, 'descriptions.txt')

匯總起來，完整的函數定義如下所示：

import string

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# extract descriptions for images
def load_descriptions(doc):
    mapping = dict()
    # process lines
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        if len(line) < 2:
            continue
        # take the first token as the image id, the rest as the description
        image_id, image_desc = tokens[0], tokens[1:]
        # remove filename from image id
        image_id = image_id.split('.')[0]
        # convert description tokens back to string
        image_desc = ' '.join(image_desc)
        # create the list if needed
        if image_id not in mapping:
            mapping[image_id] = list()
        # store description
        mapping[image_id].append(image_desc)
    return mapping

def clean_descriptions(descriptions):
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for key, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            # tokenize
            desc = desc.split()
            # convert to lower case
            desc = [word.lower() for word in desc]
            # remove punctuation from each token
            desc = [w.translate(table) for w in desc]
            # remove hanging 's' and 'a'
            desc = [word for word in desc if len(word)>1]
            # remove tokens with numbers in them
            desc = [word for word in desc if word.isalpha()]
            # store as string
            desc_list[i] =  ' '.join(desc)

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
    # build a list of all description strings
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')

運行示例首先打印出加載圖像描述的數量（8092）和干凈詞匯量的規模（8763 個單詞）。

Loaded: 8,092
Vocabulary Size: 8,763

最后，把干凈的描述寫入 descriptions.txt。

查看文件，我們能夠看到該描述可用于建模。文件中描述的順序可能會發生改變。

2252123185_487f21e336 bunch on people are seated in stadium
2252123185_487f21e336 crowded stadium is full of people watching an event
2252123185_487f21e336 crowd of people fill up packed stadium
2252123185_487f21e336 crowd sitting in an indoor stadium
2252123185_487f21e336 stadium full of people watch game
...

開發深度學習模型

本節我們將定義深度學習模型，在訓練數據集上進行擬合。本節分為以下幾部分：

1. 加載數據。

2. 定義模型。

3. 擬合模型。

4. 完成示例。

加載數據

首先，我們必須加載準備好的圖像和文本數據來擬合模型。

我們將在訓練數據集中的所有圖像和描述上訓練數據。訓練過程中，我們計劃在開發數據集上監控模型性能，使用該性能確定什么時候保存模型至文件。

訓練和開發數據集已經預制好，并分別保存在 Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt 文件中，二者均包含圖像文件名列表。從這些文件名中，我們可以提取圖像標識符，并使用它們為每個集過濾圖像和描述。

如下所示，load_set() 函數將根據訓練或開發集文件名加載一個預定義標識符集。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

現在，我們可以使用預定義訓練或開發標識符集加載圖像和描述了。

下面是 load_clean_descriptions() 函數，該函數從給定標識符集的 descriptions.txt 中加載干凈的文本描述，并向文本描述列表返回標識符詞典。

我們將要開發的模型能夠生成給定圖像的字幕，一次生成一個單詞。先前生成的單詞序列作為輸入。因此，我們需要一個 first word 來開啟生成步驟和一個 last word 來表示字幕生成結束。

我們將使用字符串 startseq 和 endseq 完成該目的。這些標記被添加至加載描述，像它們本身就是加載出的那樣。在對文本進行編碼之前進行該操作非常重要，這樣這些標記才能得到正確編碼。

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

接下來，我們可以為給定數據集加載圖像特征。

下面定義了 load_photo_features() 函數，該函數加載了整個圖像描述集，然后返回給定圖像標識符集你感興趣的子集。

這不是很高效，但是，這可以幫助我們啟動，快速運行。

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

我們可以在這里暫停一下，測試目前開發的所有內容。

完整的代碼示例如下：

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

運行該示例首先在測試數據集中加載 6000 張圖像標識符。這些特征之后將用于加載干凈描述文本和預計算的圖像特征。

Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000

描述文本在作為輸入饋送至模型或與模型預測進行對比之前需要先編碼成數值。

編碼數據的第一步是創建單詞到唯一整數值之間的持續映射。Keras 提供 Tokenizer class，可根據加載的描述數據學習該映射。

下面定義了用于將描述詞典轉換成字符串列表的 to_lines() 函數，和對加載圖像描述文本擬合 Tokenizer 的 create_tokenizer() 函數。

# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

我們現在對文本進行編碼。

每個描述將被分割成單詞。我們向該模型提供一個單詞和圖像，然后模型生成下一個單詞。描述的前兩個單詞和圖像將作為模型輸入以生成下一個單詞，這就是該模型的訓練方式。

例如，輸入序列「a little girl running in field」將被分割成 6 個輸入-輸出對來訓練該模型：

X1,       X2 (text sequence),                         y (word)
photo   startseq,                                   little
photo   startseq, little,                           girl
photo   startseq, little, girl,                     running
photo   startseq, little, girl, running,            in
photo   startseq, little, girl, running, in,        field
photo   startseq, little, girl, running, in, field, endseq

稍后，當模型用于生成描述時，生成的單詞將被連結起來，遞歸地作為輸入以生成圖像字幕。

下面是 create_sequences() 函數，給出 tokenizer、最大序列長度和所有描述和圖像的詞典，該函數將這些數據轉換成輸入-輸出對來訓練模型。該模型有兩個輸入數組：一個用于圖像特征，一個用于編碼文本。模型輸出是文本序列中編碼的下一個單詞。

輸入文本被編碼為整數，被饋送至詞嵌入層。圖像特征將被直接饋送至模型的另一部分。該模型輸出的預測是所有單詞在詞匯表中的概率分布。

因此，輸出數據是每個單詞的 one-hot 編碼，它表示一種理想化的概率分布，即除了實際詞位置之外所有詞位置的值都為 0，實際詞位置的值為 1。

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
        # walk through each description for the image
        for desc in desc_list:
            # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

我們需要計算最長描述中單詞的最大數量。下面是一個有幫助的函數 max_length()。

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

現在我們可以為訓練和開發數據集加載數據，并將加載數據轉換成輸入-輸出對來擬合深度學習模型。

定義模型

我們將根據 Marc Tanti, et al. 在 2017 年論文中描述的「merge-model」定義深度學習模型。

Where to put the Image in an Image Caption Generator，2017
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?，2017

論文作者提供了該模型的簡圖，如下所示：

我們將從三部分描述該模型：

圖像特征提取器：這是一個在 ImageNet 數據集上預訓練的 16 層 VGG 模型。我們已經使用 VGG 模型（沒有輸出層）對圖像進行預處理，并將使用該模型預測的提取特征作為輸入。
序列處理器：合適一個詞嵌入層，用于處理文本輸入，后面是長短期記憶（LSTM）循環神經網絡層。
解碼器：特征提取器和序列處理器輸出一個固定長度向量。這些向量由密集層（Dense layer）融合和處理，來進行最終預測。

圖像特征提取器模型的輸入圖像特征是維度為 4096 的向量，這些向量經過全連接層處理并生成圖像的 256 元素表征。

序列處理器模型期望饋送至嵌入層的預定義長度（34 個單詞）輸入序列使用掩碼來忽略 padded 值。之后是具備 256 個循環單元的 LSTM 層。

兩個輸入模型均輸出 256 元素的向量。此外，輸入模型以 50% 的 dropout 率使用正則化，旨在減少訓練數據集的過擬合情況，因為該模型配置學習非常快。

解碼器模型使用額外的操作融合來自兩個輸入模型的向量。然后將其饋送至 256 個神經元的密集層，然后輸送至最終輸出密集層，從而在所有輸出詞匯上對序列中的下一個單詞進行 softmax 預測。

下面的 define_model() 函數定義和返回要擬合的模型。

# define the captioning model
def define_model(vocab_size, max_length):
    # feature extractor model
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

要了解模型結構，特別是層的形狀，請參考下表中的總結。

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 34)            0
____________________________________________________________________________________________________
input_1 (InputLayer)             (None, 4096)          0
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 34, 256)       1940224     input_2[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           input_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 34, 256)       0           embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           1048832     dropout_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 256)           525312      dropout_2[0][0]
____________________________________________________________________________________________________
add_1 (Add)                      (None, 256)           0           dense_1[0][0]
                                                                   lstm_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 256)           65792       add_1[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 7579)          1947803     dense_2[0][0]
====================================================================================================
Total params: 5,527,963
Trainable params: 5,527,963
Non-trainable params: 0
____________________________________________________________________________________________________

我們還創建了一幅圖來可視化網絡結構，幫助理解兩個輸入流。

圖像字幕生成深度學習模型示意圖。

擬合模型

現在我們已經了解如何定義模型了，那么接下來我們要在訓練數據集上擬合模型。

該模型學習速度快，很快就會對訓練數據集產生過擬合。因此，我們需要在留出的開發數據集上監控訓練模型的泛化情況。如果模型在開發數據集上的技能在每個 epoch 結束時有所提升，則我們將整個模型保存至文件。

在運行結束時，我們能夠使用訓練數據集上具備最優技能的模型作為最終模型。

通過在 Keras 中定義 ModelCheckpoint，使之監控驗證數據集上的最小損失，我們可以實現以上目的。然后將該模型保存至文件名中包含訓練損失和驗證損失的文件中。

# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

之后，通過 fit() 中的 callbacks 參數指定檢查點。我們還需要 fit() 中的 validation_data 參數指定開發數據集。

我們僅擬合模型 20 epoch，給出一定量的訓練數據，在一般硬件上每個 epoch 可能需要 30 分鐘。

# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

完成示例

在訓練數據上擬合模型的完整示例如下：

from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
        # walk through each description for the image
        for desc in desc_list:
            # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

# define the captioning model
def define_model(vocab_size, max_length):
    # feature extractor model
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

# train dataset

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)

# dev dataset

# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)

# fit model

# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

運行該示例首先打印加載訓練和開發數據集的摘要。

Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000
Vocabulary Size: 7,579
Description Length: 34
Dataset: 1,000
Descriptions: test=1,000
Photos: test=1,000

之后，我們可以了解訓練和驗證（開發）輸入-輸出對的整體數量。

Train on 306,404 samples, validate on 50,903 samples

然后運行模型，將最優模型保存至.h5 文件。

在運行過程中，我把最優驗證結果的模型保存至文件中：

model-ep002-loss3.245-val_loss3.612.h5

該模型在第 2 個 epoch 中結束時被保存，在訓練數據集上的損失為 3.245，在開發數據集上的損失為 3.612，每個人的具體結果不同。如果你在 AWS 中運行上述示例，那么將模型文件復制回你當前的工作文件夾。

評估模型

模型擬合之后，我們可以在留出的測試數據集上評估它的預測技能。

使模型對測試數據集中的所有圖像生成描述，使用標準代價函數評估預測，從而評估模型。

首先，我們需要使用訓練模型對圖像生成描述。輸入開始描述的標記『startseq『，生成一個單詞，然后遞歸地用生成單詞作為輸入啟用模型直到序列標記到『endseq『或達到最大描述長度。

下面的 generate_desc() 函數實現該行為，并基于給定訓練模型和作為輸入的準備圖像生成文本描述。它啟用 word_for_id() 函數以映射整數預測至單詞。

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for i in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

我們將為測試數據集和訓練數據集中的所有圖像生成預測。

下面的 evaluate_model() 基于給定圖像描述數據集和圖像特征評估訓練模型。收集實際和預測描述，使用語料庫 BLEU 值對它們進行評估。語料庫 BLEU 值總結了生成文本和期望文本之間的相似度。

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # store actual and predicted
        references = [d.split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

BLEU 值用于在文本翻譯中評估譯文和一或多個參考譯文的相似度。

這里，我們將每個生成描述與該圖像的所有參考描述進行對比，然后計算 1、2、3、4 等 n 元語言模型的 BLEU 值。

NLTK Python 庫在 corpus_bleu() 函數中實現了 BLEU 值計算。分值越接近 1.0 越好，越接近 0 越差。

我們可以結合前面加載數據部分中的函數。首先加載訓練數據集來準備 Tokenizer，以使我們將生成單詞編碼成模型的輸入序列。使用模型訓練時使用的編碼機制對生成單詞進行編碼非常關鍵。

然后使用這些函數加載測試數據集。完整示例如下：

from numpy import argmax
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for i in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    # step over the whole set
    for key, desc_list in descriptions.items():
        # generate description
        yhat = generate_desc(model, tokenizer, photos[key], max_length)
        # store actual and predicted
        references = [d.split() for d in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

# prepare tokenizer on train set

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)

# prepare test set

# load test set
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))

# load the model
filename = 'model-ep002-loss3.245-val_loss3.612.h5'
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

運行示例打印 BLEU 值。我們可以看到 BLEU 值處于該問題較優的期望范圍內，且接近最優水平。并且我們并沒有對選擇的模型配置進行特別的優化。

BLEU-1: 0.579114
BLEU-2: 0.344856
BLEU-3: 0.252154
BLEU-4: 0.131446

生成新的圖像字幕

現在我們了解了如何開發和評估字幕生成模型，那么我們如何使用它呢？

我們需要模型文件中全新的圖像，還需要 Tokenizer 用于對模型生成單詞進行編碼，生成序列和定義模型時使用的輸入序列最大長度。

我們可以對最大序列長度進行硬編碼。文本編碼后，我們就可以創建 tokenizer，并將其保存至文件，這樣我們可以在需要的時候快速加載，無需整個 Flickr8K 數據集。另一個方法是使用我們自己的詞匯文件，在訓練過程中將其映射到取整函數。

我們可以按照之前的方式創建 Tokenizer，并將其保存為 pickle 文件 tokenizer.pkl。完整示例如下：

from keras.preprocessing.text import Tokenizer
from pickle import dump

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

現在我們可以在需要的時候加載 tokenizer，無需加載整個標注訓練數據集。下面，我們來為一個新圖像生成描述，下面這張圖是我從 Flickr 中隨機選的一張圖像。

海灘上的狗

我們將使用模型為它生成描述。首先下載圖像，保存至本地文件夾，文件名設置為「example.jpg」。然后，我們必須從 tokenizer.pkl 中加載 Tokenizer，定義生成序列的最大長度，在對輸入數據進行填充時需要該信息。

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34

然后我們必須加載模型，如前所述。

# load the model
model = load_model('model-ep002-loss3.245-val_loss3.612.h5')

接下來，我們必須加載要描述和提取特征的圖像。

重定義該模型、向其中添加 VGG-16 模型，或者使用 VGG 模型來預測特征，使用這些特征作為現有模型的輸入。我們將使用后一種方法，使用數據準備階段所用的 extract_features() 函數的修正版本，該版本適合處理單個圖像。

# extract features from each photo in the directory
def extract_features(filename):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # load the photo
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    return feature

# load and prepare the photograph
photo = extract_features('example.jpg')

之后使用評估模型定義的 generate_desc() 函數生成圖像描述。為單個全新圖像生成描述的完整示例如下：

from pickle import load
from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model

# extract features from each photo in the directory
def extract_features(filename):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # load the photo
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    return feature

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for i in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('model-ep002-loss3.245-val_loss3.612.h5')
# load and prepare the photograph
photo = extract_features('example.jpg')
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
print(description)

這種情況下，生成的描述如下：

startseq dog is running across the beach endseq

移除開始和結束的標記，或許這就是我們希望模型生成的語句。至此，我們現在已經完整地使用模型為圖像生成文本描述，雖然這一實現非常基礎與簡單，但它是我們繼續學習強大圖像描述模型的基礎。我們也希望本文能帶領給為讀者實操地理解圖像描述模型。

來自：https://www.jiqizhixin.com/articles/2017-12-11-6

本文由用戶 dwef7416 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1513074398827.html

深度學習 Python Python開發

從頭開始在Python中開發深度學習字幕生成模型

相關經驗

相關資訊

相關文檔

目錄