Python 按行讀取文本文件緩存和非緩存實現

SylArmenta 9年前發布 | 3K 次閱讀 Python

需求

最近項目中有個讀取文件的需求，數據量還挺大，10萬行的數量級。

java 使用緩存讀取文件是，會相應的創建一個內部緩沖區數組在java虛擬機內存中，因此每次處理的就是這一整塊內存。

簡單的想：就是如果不用緩存，每次都要硬盤–虛擬機緩存–讀取；有了緩存，提前讀了一段放在虛擬機緩存里，可以避免頻繁將硬盤上的數據讀到緩存里。

因為對內存的操作肯定是比硬盤的操作要快的。

對了，java還有映射內存，可以解決大文件讀寫的問題。

思路

大文件讀寫不能一次全部讀入內存，這樣會導致耗盡內存。（但是在內存允許的情況下，全部讀入內存是不是速度更快？？）
對于大文件可以一行一行讀取，因為我們處理完這行，就可以把它拋棄。

我們也可以一段一段讀取大文件，實現一種緩存處理。每次讀取一段文件，將這段文件放在緩存里，然后對這段處理。這會比一行一行快些。

方法1：一行一行讀取

我們可以打開一個文件，然后用for循環讀取每行，比如：

def method1(newName):
    s1 = time.clock()
    oldLine = '0'
    count = 0
    for line in open(newName):
        newLine =  line
        if (newLine != oldLine):
            #判斷是不是空行
            if newLine.strip():
                nu = newLine.split()[0]
                oldLine = newLine
                count += 1
    print "deal %s lines" %(count)
    e1 = time.clock()
    print "cost time " + str(e1-s1)

我們測試一下

fileName = 'E:\\pythonProject\\ruisi\\correct_re.txt'
method1(fileName)

輸出

deal 218376 lines
cost time 0.288900734402

方法1.1 一行一行讀取的變形

def method11(newName):
    s1 = time.clock()
    oldLine = '0'
    count = 0
    file = open(newName)
    while 1:
        line = file.readline()
        if not line:
            break
        else:
            if line.strip():
                newLine =  line
                if (newLine != oldLine):
                    nu = newLine.split()[0]
                    oldLine = newLine
                    count += 1
    print "deal %s lines" %(count)
    e1 = time.clock()
    print "cost time " + str(e1-s1)

deal 218376 lines
cost time 0.371977884619

耗時和方法1差不多，比方法1稍微多些。

方法2：一行一行，使用fileinput模塊

def method2(newName):
    s1 = time.clock()
    oldLine = '0'
    count = 0
    for line in fileinput.input(newName):
        newLine =  line
        if newLine.strip():
            if (newLine != oldLine):
                nu = newLine.split()[0]
                oldLine = newLine
                count += 1
    print "deal %s lines" %(count)
    e1 = time.clock()
    print "cost time " + str(e1-s1)

deal 218376 lines
cost time 0.514534051673

這兒的耗時差不多是方法1的兩倍。

借助緩存，每次讀取1000行

def method3(newName):
    s1 = time.clock()
    file = open(newName)
    oldLine = '0'
    count = 0
    while 1:
        lines = file.readlines(10*1024)
        #print len(lines)
        if not lines:
            break
        for line in lines:
            if line.strip():
                newLine =  line
                if (newLine != oldLine):
                    nu = newLine.split()[0]
                    oldLine = newLine
                    count += 1
    print "deal %s lines" %(count)
    e1 = time.clock()

Note
readlinessizehint() 參數是限定字節大小，不是行數。
注意默認有個內部緩沖區大小是8KB，如果設定值小于 8*1024。那么都是按照8KB來的。print len(lines)輸出大概都為290。
只有當設定值大于8KB，上面的print len(lines)才會發生變化。

deal 218376 lines
cost time 0.296652349397

這兒的性能還沒方法1，表現好。可以調整每次讀取的行數，比如500,1000等等，可以達到不同的耗時。

方法4 一次性全部讀到內存里

def method4(newName):
    s1 = time.clock()
    file = open(newName)
    oldLine = '0'
    count = 0
    for line in file.readlines():
        if line.strip():
            newLine =  line
            if (newLine != oldLine):
                nu = newLine.split()[0]
                oldLine = newLine
                count += 1
    print "deal %s lines" %(count)
    e1 = time.clock()
    print "cost time " + str(e1-s1)

輸出

deal 218376 lines
cost time 0.30108883108

結論

推薦使用

with open('foo.txt', 'r') as f:
    for line in f:
        # do_something(line)

對于大文件可以使用索引，這個索引記錄下每行開頭的位置，之后就可以用file.seek()定位了。如果文件內容修改了，還需要重新建立索引。這個索引可以有很多種方法建立，但是都需要將文件遍歷一次。

參考資料

python的readlines返回行數問題

Python按行讀文件

本文由用戶 SylArmenta 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/code/view/1454997926558

Python

Python 按行讀取文本文件緩存和非緩存實現

需求

思路

方法1：一行一行讀取

方法1.1 一行一行讀取的變形

方法2：一行一行，使用fileinput模塊

借助緩存，每次讀取1000行

方法4 一次性全部讀到內存里

結論

參考資料

相關代碼

相關文檔

相關經驗

目錄

Python 按行讀取文本文件 緩存 和 非緩存實現

需求

思路

方法1：一行一行讀取

方法1.1 一行一行讀取的變形

方法2：一行一行，使用fileinput模塊

借助緩存，每次讀取1000行

方法4 一次性全部讀到內存里

結論

參考資料

相關代碼

相關文檔

相關經驗

目錄

Python 按行讀取文本文件緩存和非緩存實現