MySQL 如何存儲大數據
最近,在工作中遇到了MySQL中如何存儲長度較長的字段類型問題,于是花了一周多的時間抽空學習了一下,并且記錄下來。
MySQL大致的邏輯存儲結構在這篇文章中有介紹,做為基本概念: InnoDB 邏輯存儲結構
注:文中所指的大數據指的是長度較長的數據字段,包括varchar/varbinay/text/blob。
Compact行格式
我們首先來看一下行格式為Compact是如何存儲大數據的:
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.1.73 |
+-----------+
1 row in set (0.01 sec)
mysql> show table status like 'row'\G;
*************************** 1. row ***************************
Name: row
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 1
Avg_row_length: 81920
Data_length: 81920
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2017-01-04 21:46:02
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
我們建立一張測試表,插入數據:
CREATE TABLE `row` (
`content` varchar(65532) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> insert into row(content) select repeat('a',65532);
Query OK, 1 row affected (0.03 sec)
Records: 1 Duplicates: 0 Warnings: 0
我們使用 py_innodb_page_info.py 工具來查看表中的頁分布:
[root@localhost mysql]# python py_innodb_page_info.py -v com/row.ibd
page offset 00000000, page type <File Space Header>
page offset 00000001, page type <Insert Buffer Bitmap>
page offset 00000002, page type <File Segment inode>
page offset 00000003, page type <B-tree Node>, page level <0000>
page offset 00000004, page type <Uncompressed BLOB Page>
page offset 00000005, page type <Uncompressed BLOB Page>
page offset 00000006, page type <Uncompressed BLOB Page>
page offset 00000007, page type <Uncompressed BLOB Page>
Total number of page: 8:
Insert Buffer Bitmap: 1
Uncompressed BLOB Page: 4
File Space Header: 1
B-tree Node: 1
File Segment inode: 1
可以看出,第4頁的 <B-tree Node>, page level <0000> 格式為數據頁,存放著MySQL的行數據。 <Uncompressed BLOB Page> 可以理解為MySQL存放大數據的地方,暫且叫作外部存儲頁。Compact格式沒有將大數據全部放在數據頁中,而是將一部分數據放在了外部存儲頁中。那么,是全部數據在外部存儲頁中,還是一部分數據。假如是一部分數據,這一部分是多少呢?
我們使用 hexdump -Cv row.ibd 查看一下數據頁 <B-tree Node>, page level <0000> ,也就是第4頁:
3073 0000c000 8c 25 17 57 00 00 00 03 ff ff ff ff ff ff ff ff |.%.W....????????|
3074 0000c010 00 00 00 00 00 07 3a b8 45 bf 00 00 00 00 00 00 |......:?E?......|
3075 0000c020 00 00 00 00 00 02 00 02 03 a6 80 03 00 00 00 00 |.........?......|
3076 0000c030 00 7f 00 05 00 00 00 01 00 00 00 00 00 00 00 00 |................|
3077 0000c040 00 00 00 00 00 00 00 00 00 13 00 00 00 02 00 00 |................|
3078 0000c050 00 02 00 f2 00 00 00 02 00 00 00 02 00 32 01 00 |...?.........2..|
3079 0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......|
3080 0000c070 73 75 70 72 65 6d 75 6d 14 c3 00 00 10 ff f1 00 |supremum.?...??.|
3081 0000c080 00 00 00 04 03 00 00 00 00 13 12 80 00 00 00 2d |...............-|
3082 0000c090 01 10 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |..aaaaaaaaaaaaaa|
3083 0000c0a0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
3084 0000c0b0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
3085 0000c0c0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
....
....
3128 0000c370 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
3129 0000c380 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
3130 0000c390 61 61 00 00 00 02 00 00 00 04 00 00 00 26 00 00 |aa...........&..|
3131 0000c3a0 00 00 00 00 fc fc 00 00 00 00 00 00 00 00 00 00 |....??..........|
3132 0000c3b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
3133 0000c3c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
3134 0000c3d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...
...
4093 0000ffc0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
4094 0000ffd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
4095 0000ffe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
4096 0000fff0 00 00 00 00 00 70 00 63 01 a1 6c 2b 00 07 3a b8 |.....p.c.?l+..:?|
我們可以看出,數據頁中存儲了一部分數據,算下來一共是768字節,然后剩余部分存儲在外部存儲頁中。那么數據頁與外部存儲頁、外部存儲頁與外部存儲頁是如何連接在一起的呢?
我們觀察這一行:
3130 0000c390 61 61 00 00 00 02 00 00 00 04 00 00 00 26 00 00 |aa...........&..|
3131 0000c3a0 00 00 00 00 fc fc 00 00 00 00 00 00 00 00 00 00 |................|
這一行是前綴768字節的結尾。注意最后的20個字節:
- 00 00 00 02:4字節,代表外部存儲頁所在的space id
- 00 00 00 04:4字節,代表第一個外部頁的Page no
- 00 00 00 26:4字節,值為38,指向blob頁的header
- 00 00 00 00 00 00 fc fc:8字節,代表該列存在外部存儲頁的總長度。此處的值為64764,加上前綴768正好是65532。(注意一點,雖然表示BLOB長度的是8字節,實際只有4個字節能使用,所有對于BLOB字段,存儲數據的最大長度為4GB。)
驗證下第一個外部存儲頁的頭部信息:
4097 00010000 cd c3 b6 8e 00 00 00 04 00 00 00 00 00 00 00 00 |??.............|
4098 00010010 00 00 00 00 00 06 b8 a2 00 0a 00 00 00 00 00 00 |......??........|
4099 00010020 00 00 00 00 00 02 00 00 3f ca 00 00 00 05 61 61 |........??....aa|
4100 00010030 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
...
...
前38個字節為File Header(關于InnoDB數據頁的詳細結構請參見《MySQL技術內幕 InnoDB存儲引擎》4.4),這個簡單提一下:
- cd c3 b6 8e:4字節,該頁的checksum。
- 00 00 00 04:4字節,頁偏移,此頁為表空間中的第5個頁。
- 00 00 00 00:4字節,當前頁的上一個頁。此頁為 <Uncompressed BLOB Page> ,所以沒有上一頁。
- 00 00 00 00:4字節,當前頁的下一個頁。此頁為 <Uncompressed BLOB Page> ,所以沒有下一頁。
- 00 00 00 00 00 06 b8 a2:8字節,該頁最后被修改的日志序列位置LSN。
- 00 0a:2字節,頁類型,0x000A代表BLOB頁。
- 00 00 00 00 00 00 00 00:8字節,略過。
- 00 00 00 02:頁屬于哪個表空間,此處指表空間的ID為2。
之后是4字節的 00 00 3f ca ,這里的值為16330,代表此BLOB頁的有效數據的字節數。 00 00 00 05 代表下一個BLOB頁的page number。
我們看最后一個 <Uncompressed BLOB Page> ,第8個頁:
7169 0001c000 fa 78 9b 27 00 00 00 07 00 00 00 00 00 00 00 00 |?x.'............|
7170 0001c010 00 00 00 00 00 07 3a b8 00 0a 00 00 00 00 00 00 |......:?........|
7171 0001c020 00 00 00 00 00 02 00 00 3d 9e ff ff ff ff 61 61 |........=.????aa|
7172 0001c030 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
7173 0001c040 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
...
...
最后一頁的有效數據大小為0x00003d9e=15774,768+16330*3+15774 = 65532字節,符合初始插入數據的大小。
由于這是最后一個 <Uncompressed BLOB Page> ,所以指向下一個 <Uncompressed BLOB Page> 的指針為ff ff ff ff。
由此我們可以很清晰的看出數據頁與BLOB頁的連接關系(引用淘寶數據庫月報上的一張圖):

我們來再看一個比較有意思的例子。:
CREATE TABLE `testblob` (
`blob1` blob NOT NULL,
`blob2` blob NOT NULL,
`blob3` blob NOT NULL,
`blob4` blob NOT NULL,
`blob5` blob NOT NULL,
`blob6` blob NOT NULL,
`blob7` blob NOT NULL,
`blob8` blob NOT NULL,
`blob9` blob NOT NULL,
`blob10` blob NOT NULL,
`blob11` blob NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
mysql> insert into testblob select repeat('a',1000),repeat('b',1000),repeat('c',1000),repeat('d',1000),repeat('e',1000),repeat('f',1000),repeat('g',1000),repeat('h',1000),repeat('i',1000),repeat('j',1000),repeat('k',1000);
ERROR 1030 (HY000): Got error 139 from storage engine
我們建立一張新表,有11個blob字段。然后向每個字段插入1000字節的數據,MySQL會提示 ERROR 1030 (HY000): Got error 139 from storage engine ,什么意思呢?
InnoDB是以B+樹來組織數據的,假如每一行數據都占據一整個Page頁,那么B+樹將退化為單鏈表,所以InnoDB規定了一個Page必須包含兩行數據。也就是一行數據存儲在Page上的大小大概為8000字節。
而上面的例子,一行數據有11個1000字節的數據,Page層肯定放不下,所以在Page層留下768*11=8448字節,已經超過了8000字節,所以MySQL會提示 ERROR 1030 (HY000): Got error 139 from storage engine 。我們很輕松的定義一個字段,來存儲11000個字節,但是卻無法將他們分成11個字段來存儲,有點意思!
那么如何解決上面的問題呢?
- 將行格式轉為接下來要說的Dynamic格式。此種格式只用20字節指向外部存儲空間。
- 將多個blob字段轉為一個blob字段。多個字段可以用數組存儲,然后json_encode打包進blob。
我們向表中插入一條有效記錄:
mysql> insert into testblob(blob1,blob2,blob3,blob4,blob5,blob6,blob7,blob8,blob9) select repeat('a',8000),repeat('b',8000),repeat('c',8000),repeat('d',8000),repeat('e',8000),repeat('f',8000),repeat('g',8000),repeat('h',8000),repeat('i',8000);
Query OK, 1 row affected (0.12 sec)
Records: 1 Duplicates: 0 Warnings: 0
[root@localhost mysql]# python py_innodb_page_info.py -v com/testblob.ibd
page offset 00000000, page type <File Space Header>
page offset 00000001, page type <Insert Buffer Bitmap>
page offset 00000002, page type <File Segment inode>
page offset 00000003, page type <B-tree Node>, page level <0000>
page offset 00000004, page type <Uncompressed BLOB Page>
page offset 00000005, page type <Uncompressed BLOB Page>
page offset 00000006, page type <Uncompressed BLOB Page>
page offset 00000007, page type <Uncompressed BLOB Page>
page offset 00000008, page type <Uncompressed BLOB Page>
page offset 00000009, page type <Uncompressed BLOB Page>
page offset 0000000a, page type <Uncompressed BLOB Page>
page offset 0000000b, page type <Uncompressed BLOB Page>
page offset 0000000c, page type <Uncompressed BLOB Page>
Total number of page: 13:
Insert Buffer Bitmap: 1
Uncompressed BLOB Page: 9
File Space Header: 1
B-tree Node: 1
File Segment inode: 1
我們可以看出這一行數據有9個外部存儲頁,而我們一共就插入了9列數據,是不是當每一列的數據在page頁放不下,都單獨申請一個外部存儲頁,而互相之前不共享外部存儲頁。我們看一下page頁的結構就知道了:
3130 0000c390 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa|
3131 0000c3a0 61 61 61 61 00 00 00 05 00 00 00 04 00 00 00 26 |aaaa...........&|
...
...
3180 0000c6b0 62 62 62 62 62 62 62 62 00 00 00 05 00 00 00 05 |bbbbbbbb........|
3181 0000c6c0 00 00 00 26 00 00 00 00 00 00 1c 40 63 63 63 63 |...&.......@cccc|
...
...
3229 0000c9c0 63 63 63 63 63 63 63 63 63 63 63 63 00 00 00 05 |cccccccccccc....|
3230 0000c9d0 00 00 00 06 00 00 00 26 00 00 00 00 00 00 1c 40 |.......&.......@|
...
...
根據前面的分析,我們現在可以看出,外部存儲頁是不共享的,即使一個列的數據多出一個字節,這一個字節也是獨占一個16KB空間的大小,這很浪費存儲空間。(當然,這對現代計算機可能不是問題,呵呵)。
說了這么多,總結下Compact格式存儲大數據的缺點:
- 由于存在768字節的前綴在Page頁,所以會存在能定義一個字段,存儲11000字節,但是不能定義11個字段,每個字段存儲1000字節的"bug"。
- 外部存儲頁不共享,即使多余一個字節也是獨享16KB的頁面。
Dynamic行格式
接著我們首先看一下行格式為Dynamic是如何存儲大數據的:
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.7.14 |
+-----------+
1 row in set (0.00 sec)
mysql> show table status like 'row'\G;
*************************** 1. row ***************************
Name: row
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 0
Avg_row_length: 0
Data_length: 16384
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2017-01-03 22:45:16
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
創建和compact格式一樣的表:
CREATE TABLE `row` (
`content` varchar(65532) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1
insert into row(content) select repeat('a',65532);
Query OK, 1 row affected (0.03 sec)
Records: 1 Duplicates: 0 Warnings: 0
看下頁分布:
[root@localhost mysql]# python py_innodb_page_info.py -v row.ibd
page offset 00000000, page type <File Space Header>
page offset 00000001, page type <Insert Buffer Bitmap>
page offset 00000002, page type <File Segment inode>
page offset 00000003, page type <B-tree Node>, page level <0000>
page offset 00000004, page type <Uncompressed BLOB Page>
page offset 00000005, page type <Uncompressed BLOB Page>
page offset 00000006, page type <Uncompressed BLOB Page>
page offset 00000007, page type <Uncompressed BLOB Page>
page offset 00000008, page type <Uncompressed BLOB Page>
Total number of page: 9:
Insert Buffer Bitmap: 1
Uncompressed BLOB Page: 5
File Space Header: 1
B-tree Node: 1
File Segment inode: 1
第4頁是數據頁,第5-9頁是二進制頁。我們直接看磁盤中第4頁的數據:
3073 0000c000 dc 2d b0 f5 00 00 00 03 ff ff ff ff ff ff ff ff |.-..............|
3074 0000c010 00 00 00 00 00 a3 4b 59 45 bf 00 00 00 00 00 00 |......KYE.......|
3075 0000c020 00 00 00 00 00 36 00 02 00 a6 80 03 00 00 00 00 |.....6..........|
3076 0000c030 00 7f 00 05 00 00 00 01 00 00 00 00 00 00 00 00 |................|
3077 0000c040 00 00 00 00 00 00 00 00 00 64 00 00 00 36 00 00 |.........d...6..|
3078 0000c050 00 02 00 f2 00 00 00 36 00 00 00 02 00 32 01 00 |.......6.....2..|
3079 0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......|
3080 0000c070 73 75 70 72 65 6d 75 6d 14 c0 00 00 10 ff f1 00 |supremum........|
3081 0000c080 00 00 00 02 00 00 00 00 00 07 07 a7 00 00 01 1b |................|
3082 0000c090 01 10 00 00 00 36 00 00 00 04 00 00 00 26 00 00 |.....6.......&..|
3083 0000c0a0 00 00 00 00 ff fc 00 00 00 00 00 00 00 00 00 00 |................|
3084 0000c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
3085 0000c0c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
3086 0000c0d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
3087 0000c0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...
...
...
和Compact格式有著明顯的不同,當大數據在Page頁存放不下時,Dynamic行格式不會留768字節在Page頁,并且將全部大數據都放在外部存儲頁。具體的數據頁和外部存儲頁的連接關系同Compact格式一樣。
我們再看看Dynamic格式的外部存儲頁是不是每一個列獨享外部存儲空間,還是同Compact格式實驗過程一樣:
CREATE TABLE `testblob` (
`blob1` blob NOT NULL,
`blob2` blob NOT NULL,
`blob3` blob NOT NULL,
`blob4` blob NOT NULL,
`blob5` blob NOT NULL,
`blob6` blob NOT NULL,
`blob7` blob NOT NULL,
`blob8` blob NOT NULL,
`blob9` blob NOT NULL,
`blob10` blob NOT NULL,
`blob11` blob NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
mysql> insert into testblob(blob1,blob2,blob3,blob4,blob5,blob6,blob7,blob8,blob9,blob10,blob11) select repeat('a',8000),repeat('b',8000),repeat('c',8000),repeat('d',8000),repeat('e',8000),repeat('f',8000),repeat('g',8000),repeat('h',8000),repeat('i',8000),repeat('j',8000),repeat('k',8000);
Query OK, 1 row affected (0.10 sec)
Records: 1 Duplicates: 0 Warnings: 0
看一下外部存儲頁數據:
4599 00011f60 61 61 61 61 61 61 61 61 61 61 61 61 61 61 00 00 |aaaaaaaaaaaaaa..|
4600 00011f70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
好的,可以不用向下看其他列的了,Dynamic的外部存儲頁也不是共享的。
但是MySQL為什么要這么設計呢?可能是為了實現簡單吧,沿著鏈表通過有效數據大小就能讀取blob的全部數據。假如多個字段的blob混在一起,可能設計更復雜,要更新每個字段的偏移量之類的,更新的話頁數據管理也比較麻煩。我的個人猜測,呵呵。
總結下Dynamic格式存儲大數據的特點:
- 當數據頁放不下時,MySQL會將大數據全部放在外部存儲頁,數據頁只留指向外部存儲頁的指針。
- 外部存儲頁不共享,即使多余一個字節也是獨享16KB的頁面。
將列放入外部存儲頁的標準
當一行中的數據不能在數據頁中放下,需要申請外部存儲頁時,MySQL需要決定將哪一列的數據放到外部存儲頁,遵循的規則如下:
- 長度固定的字段不會被放到外部存儲頁(int、char(N)等)
- 長度小于20字節的字段不會被放到外部存儲頁。(假如放到外部存儲頁,不僅會單獨占據16KB,還要額外的20字節指針,沒有必要)
- 對于Compact和REDUNDANT格式的行數據,長度小于768字節的字段不會被放到外部存儲頁。(這個原因很顯然,本來就不夠768字節的前綴,總不能生搬硬湊吧)。
當有多個大數據字段滿足上面條件,需要被放到外部存儲頁時,MySQL會優先選擇大的字段放到外部存儲頁,因為這樣可以最大限度的省下數據頁的空間,使得更多的字段能夠被放到數據頁。
由于有較多的實驗過程,所以顯得比較亂,建議看到這篇文章人自己實踐一遍,畢竟自己動手會思考更多的問題與細節,理解的也比較深刻,哈哈哈。
參考資料: http://dev.mysql.com/doc/refman/5.7/en/column-count-limit.html
http://mysqlserverteam.com/externally-stored-fields-in-innodb/
https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/
http://mysql.taobao.org/monthly/2016/02/01/
來自:https://github.com/zhangyachen/zhangyachen.github.io/issues/96