使用Go解析超大XML文檔

jopen 13年前發布 | 47K 次閱讀 Google Go/Golang開發 Go

我最近在處理Wiki百科的一些XML文件，有一些非常大的XML文件，例如最新的修訂版文件時36G（未壓縮）。關于解析XML，我曾經在幾種語言中做過實驗，最終我發現Go非常的適合。

Go擁有一個通用的解析XML的庫，也能很方便的編碼。一個比較簡單的處理XML的辦法是一次性將文檔解析加載到內存中，然而這中辦發對于一個36G的東西來講是不可行的。

我們也可以采用流的方式解析，但是一些在線的例子比較簡單而缺乏，這里是我的解析wiki百科的示例代碼。(full example code at https://github.com/dps/go-xml-parse/blob/master/go-xml-parse.go)

這里有其中的維基xml片段。

// <page>
//     <title>Apollo 11</title>
//      <redirect title="Foo bar" />
//     ...
//     <revision>
//     ...
//       <text xml:space="preserve">
//       {{Infobox Space mission
//       |mission_name=&lt;!--See above--&gt;
//       |insignia=Apollo_11_insignia.png
//     ...
//       </text>
//     </revision>
// </page>

在我們的Go代碼中，我們定義了一個結構體（struct）來匹配<page>元素。

type Redirect struct {
    Title string `xml:"title,attr"`
} 

type Page struct {
    Title string `xml:"title"`
    Redir Redirect `xml:"redirect"`
    Text string `xml:"revision>text"`
}

現在我們告訴解析器wikipedia文檔包括一些<page>并且試著讀取文檔，這里讓我們看看他如何以流的方式工作。其實這是非常簡單的，如果你了解原理的話--遍歷文件中的標簽，遇到<page>標簽的startElement，然后使用神奇的 decoder.DecodeElement API解組為整個對象，然后開始下一個。

decoder := xml.NewDecoder(xmlFile) 

for {
    // Read tokens from the XML document in a stream.
    t, _ := decoder.Token()
    if t == nil {
        break
    }
    // Inspect the type of the token just read.
    switch se := t.(type) {
    case xml.StartElement:
        // If we just read a StartElement token
        // ...and its name is "page"
        if se.Name.Local == "page" {
            var p Page
            // decode a whole chunk of following XML into the
            // variable p which is a Page (se above)
            decoder.DecodeElement(&p, &se)
            // Do some stuff with the page.
            p.Title = CanonicalizeTitle(p.Title)
            ...
        }
...

我希望在你需要自己解析一個大的XML文件的時候，這些能節省你一些時間。

OSChina.NET原創翻譯/原文鏈接

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1340583480342.html

Google Go/Golang開發 Go

使用Go解析超大XML文檔

相關經驗

相關資訊

相關文檔

目錄