Lucene 6.0 實戰-索引熱備份及恢復

l915349931 8年前發布 | 38K 次閱讀 Lucene 搜索引擎

索引備份的幾個關鍵問題

最簡單的備份方式是關閉IndexWriter，然后逐一拷貝索引文件，但是如果索引比較大，那么這種備份操作會持續較長時間，而在備份期間，程序無法對索引文件進行修改，很多搜索程序是不能接受索引操作期間如此長時間停頓的
那么不關閉IndexWriter又如何呢？這樣也不行，因為在拷貝索引期間，如果索引文件發生變化，會導致備份的索引文件損壞
另外一個問題就是如果原索引文件損壞的話，再備份它也毫無意義，所以一定要備份的是最后一次成功commit之后的索引文件
每次在備份之前，如果程序將要覆蓋上一個備份，需要先刪除備份中未出現在當前快照中的文件，因為這些文件已經不會被當前索引引用了；如果每次都更改備份路徑的話，那么就直接拷貝即可

索引熱備份

從Lucene 2.3版本開始，Lucene提供了一個熱備策略，就是SnapshotDeletionPolicy，這樣就能在不關閉IndexWriter的情況下，對程序最近一次索引修改提交操作時的文件引用進行備份，從而能建立一個連續的索引備份鏡像。那么你也許會有疑問，在備份期間，索引出現變化怎么辦呢？

這就是SnapshotDeletionPolicy的牛逼之處，在使用SnapshotDeletionPolicy.snapshot()獲取快照之后，索引更新提交時刻的所有文件引用都不會被IndexWriter刪除，只要IndexWriter并未關閉，即使IndexWriter在進行更新、優化操作等也不會刪除這些文件。如果說索引拷貝過程耗時較長也不會出現問題，因為被拷貝的文件時索引快照，在快照的有效期，其引用的文件會一直存在于磁盤上。

所以在備份期間，索引會比通常情況下占用更大的磁盤空間，當索引備份完成后，可以調用SnapshotDeletionPolicy.release (IndexCommit commit) 釋放指定的某次提交，以便讓IndexWriter刪除這些已被關閉或下次將要更新的文件。

需要注意的是，Lucene對索引文件的寫入操作是一次性完成的。這意味著你可以簡單通過文件名比對來完成對索引的增量備份備份，你不必查看每個文件的內容，也不必查看該文件上次被修改的時間戳，因為一旦程序從快照中完成文件寫入和引用操作，這些文件就不會改變了。

segments.gen文件在每次程序提交索引更新時都會被重寫，因此備份模塊必須要經常備份該文件，但是在Lucene 6.0中注意segments.gen已經從Lucene索引文件格式中移除，所以無需單獨考慮segments.gen的備份策略了。在備份期間，write.lock文件不用拷貝。

SnapshotDeletionPolicy類有兩個使用限制

該類在同一時刻只保留一個可用的索引快照，當然你也可以解除該限制，方法是通過建立對應的刪除策略來同時保留多個索引快照
當前快照不會保存到硬盤，這意味著你關閉舊的IndexWriter并打開一個新的IndexWriter，快照將會被刪除，因此在備份結束前是不能關閉IndexWriter的，否則也會報org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed異常。不過該限制也是很容易解除的：你可以將當前快照存儲到磁盤上，然后在打開新的IndexWriter時將該快照保護起來，這樣就能在關閉舊的IndexWriter和打開新IndexWriter時繼續進行備份操作

索引熱備份解決方案

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.;
import org.apache.lucene.index.;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;
 
import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Collection;
 
import static org.apache.lucene.document.Field.Store.YES;
 
/**

測試Lucene索引熱備
*/
public class TestIndexBackupRecovery {
 
    public static void main(String[] args) throws IOException, InterruptedException {
        String f = "D:/index_test";
        String d = "D:/index_back";
        IndexWriterConfigindexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
        indexWriterConfig.setIndexDeletionPolicy(new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()));
        IndexWriterwriter = new IndexWriter(FSDirectory.open(Paths.get(f)), indexWriterConfig);
        Documentdocument = new Document();
        document.add(new StringField("ID", "111", YES));
        document.add(new IntPoint("age", 111));
        document.add(new StoredField("age", 111));
        writer.addDocument(document);
        writer.commit();
        document = new Document();
        document.add(new StringField("ID", "222", Field.Store.YES));
        document.add(new IntPoint("age", 333));
        document.add(new StoredField("age", 333));
        writer.addDocument(document);
        document.add(new StringField("ID", "333", Field.Store.YES));
        document.add(new IntPoint("age", 555));
        document.add(new StoredField("age", 555));
        for (int i = 0; i 1000; i++) {
            document = new Document();
            document.add(new StringField("ID", "333", YES));
            document.add(new IntPoint("age", 1000000 + i));
            document.add(new StoredField("age", 1000000 + i));
            document.add(new StringField("desc", "ABCDEFG" + i, YES));
            writer.addDocument(document);
        }
        writer.deleteDocuments(new TermQuery(new Term("ID", "333")));
        writer.commit();
        backupIndex(writer, f, d);
        writer.close();
    }
 
    public static void backupIndex(IndexWriterindexWriter, String indexDir, String
            backupIndexDir) throws IOException {
        IndexWriterConfigconfig = (IndexWriterConfig) indexWriter.getConfig();
        SnapshotDeletionPolicysnapshotDeletionPolicy = (SnapshotDeletionPolicy) config.getIndexDeletionPolicy();
        IndexCommitsnapshot = snapshotDeletionPolicy.snapshot();
        //設置索引提交點，默認是null，會打開最后一次提交的索引點
        config.setIndexCommit(snapshot);
        CollectionfileNames = snapshot.getFileNames();
        File[] dest = new File(backupIndexDir).listFiles();
        String sourceFileName;
        String destFileName;
        if (dest != null && dest.length > 0) {
            //先刪除備份文件中的在此次快照中已經不存在的文件
            for (Filefile : dest) {
                boolean flag = true;
                //包括文件擴展名
                destFileName = file.getName();
                for (String fileName : fileNames) {
                    sourceFileName = fileName;
                    if (sourceFileName.equals(destFileName)) {
                        flag = false;
                        break;//跳出內層for循環
                    }
                }
                if (flag) {
                    file.delete();//刪除
                }
            }
            //然后開始備份快照中新生成的文件
            for (String fileName : fileNames) {
                boolean flag = true;
                sourceFileName = fileName;
                for (Filefile : dest) {
                    destFileName = file.getName();
                    //備份中已經存在無需復制，因為Lucene索引是一次寫入的，所以只要文件名相同不要要hash檢查就可以認為它們的數據是一樣的
                    if (destFileName.equals(sourceFileName)) {
                        flag = false;
                        break;
                    }
                }
                if (flag) {
                    Filefrom = new File(indexDir + File.separator + sourceFileName);//源文件
                    Fileto = new File(backupIndexDir + File.separator + sourceFileName);//目的文件
                    FileUtils.copyFile(from, to);
                }
            }
        } else {
            //備份不存在，直接創建
            for (String fileName : fileNames) {
                Filefrom = new File(indexDir + File.separator + fileName);//源文件
                Fileto = new File(backupIndexDir + File.separator + fileName);//目的文件
                FileUtils.copyFile(from, to);
            }
        }
        snapshotDeletionPolicy.release(snapshot);
        //刪除已經不再被引用的索引提交記錄
        indexWriter.deleteUnusedFiles();
    }
}
</code></pre> 
索引文件格式 
首先索引里都存了些什么呢？一個索引包含一個documents的序列，一個document是一個fields的序列，一個field是一個有名的terms序列，一個term是一個比特序列。 
根據 Summary of File Extensions 的說明，目前Lucene 6.0中存在的索引格式如下 
 
  
   
   Name 
   Extension 
   Brief Description 
   
  
  
   
   Segments File 
   segments_N 
   Stores information about a commit point 
   
   
   Lock File 
   write.lock 
   The Write lock prevents multiple IndexWriters from writing to the same file 
   
   
   Segment Info 
   .si 
   Stores metadata about a segment 
   
   
   Compound File 
   .cfs, .cfe 
   An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles 
   
   
   Fields 
   .fnm 
   Stores information about the fields 
   
   
   Field Index 
   .fdx 
   Contains pointers to field data 
   
   
   Field Data 
   .fdt 
   The stored fields for documents 
   
   
   Term Dictionary 
   .tim 
   The term dictionary, stores term info 
   
   
   Term Index 
   .tip 
   The index into the Term Dictionary 
   
   
   Frequencies 
   .doc 
   Contains the list of docs which contain each term along with frequency 
   
   
   Positions 
   .pos 
   Stores position information about where a term occurs in the index 
   
   
   Payloads 
   .pay 
   Stores additional per-position metadata information such as character offsets and user payloads 
   
   
   Norms 
   .nvd, .nvm 
   Encodes length and boost factors for docs and fields 
   
   
   Per-Document Values 
   .dvd, .dvm 
   Encodes additional scoring factors or other per-document information 
   
   
   Term Vector Index 
   .tvx 
   Stores offset into the document data file 
   
   
   Term Vector Documents 
   .tvd 
   Contains information about each document that has term vectors 
   
   
   Term Vector Fields 
   .tvf 
   The field level info about term vectors 
   
   
   Live Documents 
   .liv 
   Info about what files are live 
   
   
   Point values 
   .dii, .dim 
   Holds indexed points, if any 
   
  
 
在Lucene索引結構中，既保存了正向信息，也保存了反向信息。 
正向信息存儲在：段(segments_N)->field(.fnm/.fdx/.fdt)->term(.tvx/.tvd/.tvf) 
反向信息存儲在：詞典(.tim)->倒排表(.doc/.pos) 
恢復索引 
恢復索引步驟如下 
 
 關閉索引目錄下的全部reader和writer，這樣才能進行文件恢復。對于Windows系統來說，如果還有其它進程在使用這些文件，那么備份程序仍然不能覆蓋這些文件 
 刪除當前索引目錄下的所有文件，如果刪除過程出現“訪問被拒絕”（Access is denied）錯誤，那么再次檢查上一步是否已完成 
 從備份目錄中拷貝文件至索引目錄。程序需要保證該拷貝操作不會碰到任何錯誤，如磁盤空間已滿等，因為這些錯誤會損壞索引文件 
 對于損壞索引，可以使用CheckIndex（org.apache.lucene.index）進行檢查并修復 
 
通常Lucene能夠很好地避免大多數常見錯誤，如果程序遇到磁盤空間已滿或者OutOfMemoryException異常，那么它只會丟失內存緩沖中的文檔，已經編入索引的文檔將會完好保存下來，并且索引也會保持原樣。這個結果對于以下情況同樣適用：如出現JVM崩潰，或者程序碰到無法控制的異常，或者程序進程被終止，或者操作系統崩潰，或者計算機突然斷電等。 
/**

@param source 索引源
@param dest 索引目標
@param indexWriterConfig 配置相關
*/public static void recoveryIndex(String source, String dest, IndexWriterConfigindexWriterConfig) {
    IndexWriterindexWriter = null;
try {
        indexWriter = new IndexWriter(FSDirectory.open(Paths.get(dest)), indexWriterConfig);
  } catch (IOException e) {
        log.error("", e);
  } finally {
        //說明IndexWriter正常打開了，無需恢復
  if (indexWriter != null && indexWriter.isOpen()) {
            try {
                indexWriter.close();
  } catch (IOException e) {
                log.error("", e);
  }
        } else {
            //說明IndexWriter已經無法打開，使用備份恢復索引
//此處簡單操作，先清空損壞的索引文件目錄，如果索引特別大，可以比對每個文件，不必全部刪除  try {
                FileUtils.deleteDirectory(new File(dest));
  FileUtils.copyDirectory(new File(source), new File(dest));
  } catch (IOException e) {
                log.error("", e);
  //使用備份恢復出錯，那么就使用最后一招修復索引
  log.info("Check index {} now!", dest);
try {
                    IndexUtils.checkIndex(dest);
  } catch (IOException | InterruptedExceptione1) {
                    log.error("Check index error!", e1);
  }
            }
        }
    }
}
</code></pre> 
修復索引 
當其它所有方法都無法解決索引損壞問題時，你的最后一個選項就是使用CheckIndex工具了。該工具除了能匯報索引細節狀況以外，還能完成修復索引的工作。該工具會強制刪除索引中出現問題的段，需要注意的是，該操作還會全部刪除這些段包含的文檔，該工具的使用目標應主要著眼于能夠在緊急狀況下讓搜索程序再次運行起來，一旦我們進行了索引備份，并且備份完好，應優先使用恢復索引，而不是修復索引。 
/**

CheckIndex會檢查索引中的每個字節，所以當索引比較大時，此操作會比較耗時
*
@throws IOException
@throws InterruptedException
*/
public void checkIndex(String indexFilePath) throws IOException, InterruptedException {
    CheckIndexcheckIndex = new CheckIndex(FSDirectory.open(Paths.get(indexFilePath)));
    checkIndex.setInfoStream(System.out);
    CheckIndex.Statusstatus = checkIndex.checkIndex();
    if (status.clean) {
        System.out.println("Check Index successfully！");
    } else {
        //產生索引中的某個文件之后再次測試
        System.out.println("Starting repair index files...");
        //該方法會向索引中寫入一個新的segments文件，但是并不會刪除不被引用的文件，除非當你再次打開IndexWriter才會移除不被引用的文件
        //該方法會移除所有存在錯誤段中的Document索引文件
        checkIndex.exorciseIndex(status);
        checkIndex.close();
        //測試修復完畢之后索引是否能夠打開
        IndexWriterindexWriter = new IndexWriter(FSDirectory.open(Paths.get(indexFilePath)), new IndexWriterConfig(new
                StandardAnalyzer()));
        System.out.println(indexWriter.isOpen());
        indexWriter.close();
    }
}
</code></pre> 如果索引完好，輸出如下信息： 
Segmentsfile=segments_2numSegments=2 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm4
1 of 2: name=_0maxDoc=1
version=6.0.0
id=2jlug2dgsc4tmgkf5rck1xgm1
codec=Lucene60
compound=true
numFiles=3
size (MB)=0.002
diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=OracleCorporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740092}
nodeletions
test: openreader………OK [took 0.051 sec]
test: checkintegrity…..OK [took 0.000 sec]
test: checklivedocs…..OK [took 0.000 sec]
test: fieldinfos………OK [2 fields] [took 0.000 sec]
test: fieldnorms………OK [0 fields] [took 0.000 sec]
test: terms, freq, prox…OK [1 terms; 1 terms/docspairs; 0 tokens] [took 0.007 sec]
test: storedfields…….OK [2 totalfieldcount; avg 2.0 fieldsperdoc] [took 0.008 sec]
test: termvectors……..OK [0 totaltermvectorcount; avg 0.0 term/freqvectorfieldsperdoc] [took 0.000 sec]
test: docvalues………..OK [0 docvaluesfields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points…………..OK [1 fields, 1 points] [took 0.001 sec]
2 of 2: name=_1maxDoc=1001
version=6.0.0
id=2jlug2dgsc4tmgkf5rck1xgm3
codec=Lucene60
compound=true
numFiles=4
size (MB)=0.023
diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=OracleCorporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740329}
hasdeletions [delGen=1]
test: openreader………OK [took 0.003 sec]
test: checkintegrity…..OK [took 0.000 sec]
test: checklivedocs…..OK [1000 deleteddocs] [took 0.000 sec]
test: fieldinfos………OK [3 fields] [took 0.000 sec]
test: fieldnorms………OK [0 fields] [took 0.000 sec]
test: terms, freq, prox…OK [1 terms; 1 terms/docspairs; 0 tokens] [took 0.008 sec]
test: storedfields…….OK [2 totalfieldcount; avg 2.0 fieldsperdoc] [took 0.012 sec]
test: termvectors……..OK [0 totaltermvectorcount; avg 0.0 term/freqvectorfieldsperdoc] [took 0.000 sec]
test: docvalues………..OK [0 docvaluesfields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points…………..OK [1 fields, 1001 points] [took 0.001 sec]
Noproblemsweredetectedwiththis index.
Took 0.159 sectotal.
CheckIndexsuccessfully！
 
在破壞索引之后（刪除了一個cfe文件），再次運行輸出 
Segmentsfile=segments_2numSegments=2 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm4
1 of 2: name=_0maxDoc=1
version=6.0.0
id=2jlug2dgsc4tmgkf5rck1xgm1
codec=Lucene60
compound=true
numFiles=3
FAILED
WARNING: exorciseIndex() wouldremovereferenceto this segment; fullexception:
java.nio.file.NoSuchFileException: D:index_test_0.cfe
atsun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79)
…
…
atjava.lang.reflect.Method.invoke(Method.java:497)
atcom.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
2 of 2: name=_1maxDoc=1001
version=6.0.0
id=2jlug2dgsc4tmgkf5rck1xgm3
codec=Lucene60
compound=true
numFiles=4
size (MB)=0.023
diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=OracleCorporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740329}
hasdeletions [delGen=1]
test: openreader………OK [took 0.059 sec]
test: checkintegrity…..OK [took 0.000 sec]
test: checklivedocs…..OK [1000 deleteddocs] [took 0.001 sec]
test: fieldinfos………OK [3 fields] [took 0.000 sec]
test: fieldnorms………OK [0 fields] [took 0.000 sec]
test: terms, freq, prox…OK [1 terms; 1 terms/docspairs; 0 tokens] [took 0.013 sec]
test: storedfields…….OK [2 totalfieldcount; avg 2.0 fieldsperdoc] [took 0.016 sec]
test: termvectors……..OK [0 totaltermvectorcount; avg 0.0 term/freqvectorfieldsperdoc] [took 0.000 sec]
test: docvalues………..OK [0 docvaluesfields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points…………..OK [1 fields, 1001 points] [took 0.002 sec]
WARNING: 1 brokensegments (containing 1 documents) detected
Took 0.165 sectotal.
Startingrepairindexfiles…
true
 
  
來自：http://blog.jobbole.com/108094/

Name	Extension	Brief Description
Segments File	segments_N	Stores information about a commit point
Lock File	write.lock	The Write lock prevents multiple IndexWriters from writing to the same file
Segment Info	.si	Stores metadata about a segment
Compound File	.cfs, .cfe	An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles
Fields	.fnm	Stores information about the fields
Field Index	.fdx	Contains pointers to field data
Field Data	.fdt	The stored fields for documents
Term Dictionary	.tim	The term dictionary, stores term info
Term Index	.tip	The index into the Term Dictionary
Frequencies	.doc	Contains the list of docs which contain each term along with frequency
Positions	.pos	Stores position information about where a term occurs in the index
Payloads	.pay	Stores additional per-position metadata information such as character offsets and user payloads
Norms	.nvd, .nvm	Encodes length and boost factors for docs and fields
Per-Document Values	.dvd, .dvm	Encodes additional scoring factors or other per-document information
Term Vector Index	.tvx	Stores offset into the document data file
Term Vector Documents	.tvd	Contains information about each document that has term vectors
Term Vector Fields	.tvf	The field level info about term vectors
Live Documents	.liv	Info about what files are live
Point values	.dii, .dim	Holds indexed points, if any


                    
                         本文由用戶 l915349931 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。
                         轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。
                         本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！
                         本文地址：http://www.baiduhome.net/lib/view/open1479974721215.html
                         Lucene 搜索引擎

Lucene 6.0 實戰-索引熱備份及恢復

索引備份的幾個關鍵問題

索引熱備份

索引文件格式

恢復索引

修復索引

相關經驗

相關資訊

相關文檔

目錄