PostgreSQL 百萬級每秒的流式實時統計應用
PipelineDB是基于PostgreSQL研發的一種流式關系數據庫(0.8.1基于9.4.4),這種數據庫的特點是自動處理流式數據,不存儲原始數據,只存儲處理后的數據,所以非常適合當下流行的實時流式數據處理,例如網站流量統計,IT服務的監控統計,APPStore的瀏覽統計等等。
http://www.postgresql.org/about/news/1596/ PipelineDB, an open-source relational streaming-SQL database, publicly released version (0.7.7) today and made the product available as open-source via their website and GitHub. PipelineDB is based on, and is wire compatible with, PostgreSQL 9.4 and has added functionality including continuous SQL queries, probabilistic data structures, sliding windowing, and stream-table joins. For a full description of PipelineDB and its capabilities see their technical documentation. PipelineDB’s fundamental abstraction is what is called a continuous view. These are much like regular SQL views, except that their defining SELECT queries can include streams as a source to read from. The most important property of continuous views is that they only store their output in the database. That output is then continuously updated incrementally as new data flows through streams, and raw stream data is discarded once all continuous views have read it. Let's look at a canonical example: CREATE CONTINUOUS VIEW v AS SELECT COUNT(*) FROM stream Only one row would ever physically exist in PipelineDB for this continuous view, and its value would simply be incremented for each new event ingested. For more information on PipelineDB as a company, product and for examples and benefits, please check out their first blog post on their new website.
例子:
創建動態流視圖,不需要對表進行定義,太棒了,這類似活生生的NoSQL。
pipeline=# CREATE CONTINUOUS VIEW v0 AS SELECT COUNT(*) FROM stream; CREATE CONTINUOUS VIEW pipeline=# CREATE CONTINUOUS VIEW v1 AS SELECT COUNT(*) FROM stream; CREATE CONTINUOUS VIEW
激活流視圖
pipeline=# ACTIVATE; ACTIVATE 2
往流寫入數據
pipeline=# INSERT INTO stream (x) VALUES (1); INSERT 0 1 pipeline=# SET stream_targets TO v0; SET pipeline=# INSERT INTO stream (x) VALUES (1); INSERT 0 1 pipeline=# SET stream_targets TO DEFAULT; SET pipeline=# INSERT INTO stream (x) VALUES (1); INSERT 0 1
-- 如果不想接收流數據了,停止即可
pipeline=# DEACTIVATE; DEACTIVATE 2
查詢流視圖
pipeline=# SELECT count FROM v0; count ------- 3 (1 row) pipeline=# SELECT count FROM v1; count ------- 2 (1 row) pipeline=#
在本地虛擬機進行試用
安裝
[root@digoal soft_bak]# rpm -ivh pipelinedb-0.8.1-centos6-x86_64.rpm Preparing... ########################################### [100%] 1:pipelinedb ########################################### [100%] /sbin/ldconfig: /opt/gcc4.9.3/lib/libstdc++.so.6.0.20-gdb.py is not an ELF file - it has the wrong magic bytes at the start. /sbin/ldconfig: /opt/gcc4.9.3/lib64/libstdc++.so.6.0.20-gdb.py is not an ELF file - it has the wrong magic bytes at the start. ____ _ ___ ____ ____ / __ \(_)___ ___ / (_)___ ___ / __ \/ __ ) / /_/ / / __ \/ _ \/ / / __ \/ _ \/ / / / __ | / ____/ / /_/ / __/ / / / / / __/ /_/ / /_/ / /_/ /_/ .___/\___/_/_/_/ /_/\___/_____/_____/ /_/ PipelineDB successfully installed. To get started, initialize a database directory: pipeline-init -D <data directory> where <data directory> is a nonexistent directory where you'd like all of your database files to live. You can find the PipelineDB documentation at: http://docs.pipelinedb.com
配置
[root@digoal soft_bak]# cd /usr/lib/pipelinedb [root@digoal pipelinedb]# ll total 16 drwxr-xr-x 2 root root 4096 Oct 15 10:47 bin drwxr-xr-x 5 root root 4096 Oct 15 10:47 include drwxr-xr-x 6 root root 4096 Oct 15 10:47 lib drwxr-xr-x 4 root root 4096 Oct 15 10:47 share [root@digoal pipelinedb]# useradd pdb [root@digoal pipelinedb]# vi /home/pdb/.bash_profile # add by digoal export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=1953 export PGDATA=/data01/pg_root_1953 export LANG=en_US.utf8 export PGHOME=/usr/lib/pipelinedb export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=$PGDATA export PGDATABASE=pipeline export PGUSER=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi [root@digoal pipelinedb]# mkdir /data01/pg_root_1953 [root@digoal pipelinedb]# chown pdb:pdb /data01/pg_root_1953 [root@digoal pipelinedb]# chmod 700 /data01/pg_root_1953 [root@digoal pipelinedb]# su - pdb pdb@digoal-> which psql /usr/lib/pipelinedb/bin/psql
初始化數據庫
pdb@digoal-> psql -V psql (PostgreSQL) 9.4.4 pdb@digoal-> cd /usr/lib/pipelinedb/bin/ pdb@digoal-> ll total 13M -rwxr-xr-x 1 root root 62K Sep 18 01:01 clusterdb -rwxr-xr-x 1 root root 62K Sep 18 01:01 createdb -rwxr-xr-x 1 root root 66K Sep 18 01:01 createlang -rwxr-xr-x 1 root root 63K Sep 18 01:01 createuser -rwxr-xr-x 1 root root 44K Sep 18 01:02 cs2cs -rwxr-xr-x 1 root root 58K Sep 18 01:01 dropdb -rwxr-xr-x 1 root root 66K Sep 18 01:01 droplang -rwxr-xr-x 1 root root 58K Sep 18 01:01 dropuser -rwxr-xr-x 1 root root 776K Sep 18 01:01 ecpg -rwxr-xr-x 1 root root 28K Sep 18 00:57 gdaladdo -rwxr-xr-x 1 root root 79K Sep 18 00:57 gdalbuildvrt -rwxr-xr-x 1 root root 1.3K Sep 18 00:57 gdal-config -rwxr-xr-x 1 root root 33K Sep 18 00:57 gdal_contour -rwxr-xr-x 1 root root 188K Sep 18 00:57 gdaldem -rwxr-xr-x 1 root root 74K Sep 18 00:57 gdalenhance -rwxr-xr-x 1 root root 131K Sep 18 00:57 gdal_grid -rwxr-xr-x 1 root root 83K Sep 18 00:57 gdalinfo -rwxr-xr-x 1 root root 90K Sep 18 00:57 gdallocationinfo -rwxr-xr-x 1 root root 42K Sep 18 00:57 gdalmanage -rwxr-xr-x 1 root root 236K Sep 18 00:57 gdal_rasterize -rwxr-xr-x 1 root root 25K Sep 18 00:57 gdalserver -rwxr-xr-x 1 root root 77K Sep 18 00:57 gdalsrsinfo -rwxr-xr-x 1 root root 49K Sep 18 00:57 gdaltindex -rwxr-xr-x 1 root root 33K Sep 18 00:57 gdaltransform -rwxr-xr-x 1 root root 158K Sep 18 00:57 gdal_translate -rwxr-xr-x 1 root root 168K Sep 18 00:57 gdalwarp -rwxr-xr-x 1 root root 41K Sep 18 01:02 geod -rwxr-xr-x 1 root root 1.3K Sep 18 00:51 geos-config lrwxrwxrwx 1 root root 4 Oct 15 10:47 invgeod -> geod lrwxrwxrwx 1 root root 4 Oct 15 10:47 invproj -> proj -rwxr-xr-x 1 root root 20K Sep 18 01:02 nad2bin -rwxr-xr-x 1 root root 186K Sep 18 00:57 nearblack -rwxr-xr-x 1 root root 374K Sep 18 00:57 ogr2ogr -rwxr-xr-x 1 root root 77K Sep 18 00:57 ogrinfo -rwxr-xr-x 1 root root 283K Sep 18 00:57 ogrlineref -rwxr-xr-x 1 root root 47K Sep 18 00:57 ogrtindex -rwxr-xr-x 1 root root 30K Sep 18 01:01 pg_config -rwxr-xr-x 1 root root 30K Sep 18 01:01 pg_controldata -rwxr-xr-x 1 root root 33K Sep 18 01:01 pg_isready -rwxr-xr-x 1 root root 39K Sep 18 01:01 pg_resetxlog -rwxr-xr-x 1 root root 183K Sep 18 01:02 pgsql2shp lrwxrwxrwx 1 root root 4 Oct 15 10:47 pipeline -> psql -rwxr-xr-x 1 root root 74K Sep 18 01:01 pipeline-basebackup lrwxrwxrwx 1 root root 9 Oct 15 10:47 pipeline-config -> pg_config -rwxr-xr-x 1 root root 44K Sep 18 01:01 pipeline-ctl -rwxr-xr-x 1 root root 355K Sep 18 01:01 pipeline-dump -rwxr-xr-x 1 root root 83K Sep 18 01:01 pipeline-dumpall -rwxr-xr-x 1 root root 105K Sep 18 01:01 pipeline-init -rwxr-xr-x 1 root root 50K Sep 18 01:01 pipeline-receivexlog -rwxr-xr-x 1 root root 56K Sep 18 01:01 pipeline-recvlogical -rwxr-xr-x 1 root root 153K Sep 18 01:01 pipeline-restore -rwxr-xr-x 1 root root 6.2M Sep 18 01:01 pipeline-server lrwxrwxrwx 1 root root 15 Oct 15 10:47 postmaster -> pipeline-server -rwxr-xr-x 1 root root 49K Sep 18 01:02 proj -rwxr-xr-x 1 root root 445K Sep 18 01:01 psql -rwxr-xr-x 1 root root 439K Sep 18 01:02 raster2pgsql -rwxr-xr-x 1 root root 62K Sep 18 01:01 reindexdb -rwxr-xr-x 1 root root 181K Sep 18 01:02 shp2pgsql -rwxr-xr-x 1 root root 27K Sep 18 00:57 testepsg -rwxr-xr-x 1 root root 63K Sep 18 01:01 vacuumdb pdb@digoal-> pipeline-init -D $PGDATA -U postgres -E UTF8 --locale=C -W pdb@digoal-> cd $PGDATA pdb@digoal-> ll total 108K drwx------ 5 pdb pdb 4.0K Oct 15 10:57 base drwx------ 2 pdb pdb 4.0K Oct 15 10:57 global drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_clog drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_dynshmem -rw------- 1 pdb pdb 4.4K Oct 15 10:57 pg_hba.conf -rw------- 1 pdb pdb 1.6K Oct 15 10:57 pg_ident.conf drwx------ 4 pdb pdb 4.0K Oct 15 10:57 pg_logical drwx------ 4 pdb pdb 4.0K Oct 15 10:57 pg_multixact drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_notify drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_replslot drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_serial drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_snapshots drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_stat drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_stat_tmp drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_subtrans drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_tblspc drwx------ 2 pdb pdb 4.0K Oct 15 10:57 pg_twophase -rw------- 1 pdb pdb 4 Oct 15 10:57 PG_VERSION drwx------ 3 pdb pdb 4.0K Oct 15 10:57 pg_xlog -rw------- 1 pdb pdb 88 Oct 15 10:57 pipelinedb.auto.conf -rw------- 1 pdb pdb 23K Oct 15 10:57 pipelinedb.conf
和流處理相關的參數,例如設置內存大小,是否同步,合并的batch,工作進程數等等。
pipelinedb.conf #------------------------------------------------------------------------------ # CONTINUOUS VIEW OPTIONS #------------------------------------------------------------------------------ # size of the buffer for storing unread stream tuples #tuple_buffer_blocks = 128MB # synchronization level for combiner commits; off, local, remote_write, or on #continuous_query_combiner_synchronous_commit = off # maximum amount of memory to use for combiner query executions #continuous_query_combiner_work_mem = 256MB # maximum memory to be used by the combiner for caching; this is independent # of combiner_work_mem #continuous_query_combiner_cache_mem = 32MB # the default fillfactor to use for continuous views #continuous_view_fillfactor = 50 # the time in milliseconds a continuous query process will wait for a batch # to accumulate # continuous_query_max_wait = 10 # the maximum number of events to accumulate before executing a continuous query # plan on them #continuous_query_batch_size = 10000 # the number of parallel continuous query combiner processes to use for # each database #continuous_query_num_combiners = 2 # the number of parallel continuous query worker processes to use for # each database #continuous_query_num_workers = 2 # allow direct changes to be made to materialization tables? #continuous_query_materialization_table_updatable = off # inserts into streams should be synchronous? #synchronous_stream_insert = off # continuous views that should be affected when writing to streams. # it is string with comma separated values for continuous view names. #stream_targets = ''
啟動數據庫,可以看到原生是支持postgis的,吐個槽,這個項目是專門為NASA研發的么?
pdb@digoal-> pipeline-ctl start pdb@digoal-> psql pipeline postgres psql (9.4.4) Type "help" for help. pipeline=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+---------+-------+----------------------- pipeline | postgres | UTF8 | C | C | template0 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres (3 rows) pipeline=# \dx List of installed extensions Name | Version | Schema | Description ------------------+----------+------------+--------------------------------------------------------------------- plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language postgis | 2.2.0dev | pg_catalog | PostGIS geometry, geography, and raster spatial types and functions postgis_topology | 2.2.0dev | topology | PostGIS topology spatial types and functions (3 rows)
查看pipelinedb加了哪些函數,有些是插件形式加入的,如POSTGIS,有些是我們可以借鑒,直接拿來用的。
pipeline=# select proname from pg_proc order by oid desc; ...... second minute hour day month year ...... cmsketch_empty tdigest_add tdigest_empty tdigest_empty bloom_add bloom_empty bloom_empty hll_add hll_empty hll_empty ......
可以看到pipelinedb加入了hll,bloom,tdigest,cmsketch算法,還有很多可以發掘,例如支持grouping set, 窗口查詢的流視圖等等。
在我自己的筆記本中的虛擬機中的性能測試:
創建5個動態流視圖,動態流視圖就是不需要建立基表的流視圖。
CREATE CONTINUOUS VIEW v0 AS SELECT COUNT(*) FROM stream; CREATE CONTINUOUS VIEW v1 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream; CREATE CONTINUOUS VIEW v001 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream1; CREATE CONTINUOUS VIEW v002 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream2; CREATE CONTINUOUS VIEW v003 AS SELECT sum(x::int),count(*),avg(y::int) FROM stream3;
激活流統計
activate;
查看數據字典
select relname from pg_class where relkind='C';
批量插入測試
pdb@digoal-> vi test.sql insert into stream(x,y,z) select generate_series(1,1000),1,1; insert into stream1(x,y,z) select generate_series(1,1000),1,1; insert into stream2(x,y,z) select generate_series(1,1000),1,1; insert into stream3(x,y,z) select generate_series(1,1000),1,1;
測試結果,注意這里需要使用simple或者extended , 如果用prepared會導致只有最后一條SQL起作用。現在不清楚是pipelinedb還是pgbench的BUG。
pdb@digoal-> /opt/pgsql/bin/pgbench -M extended -n -r -f ./test.sql -P 1 -c 10 -j 10 -T 100000 progress: 1.0 s, 133.8 tps, lat 68.279 ms stddev 58.444 progress: 2.0 s, 143.9 tps, lat 71.623 ms stddev 53.880 progress: 3.0 s, 149.5 tps, lat 66.452 ms stddev 49.727 progress: 4.0 s, 148.3 tps, lat 67.085 ms stddev 55.484 progress: 5.1 s, 145.7 tps, lat 68.624 ms stddev 67.795
每秒入庫約58萬條記錄,并完成5個流視圖的統計。
如果用物理機的話,估計可以到500萬每秒的級別。后面有時間再試試。
因為都在內存中完成,所以速度非常快。
pipelinedb使用了worker進程來處理數據合并。
壓測時的top如下:
top - 11:23:07 up 2:49, 4 users, load average: 1.83, 3.08, 1.78 Tasks: 177 total, 5 running, 172 sleeping, 0 stopped, 0 zombie Cpu(s): 11.6%us, 15.0%sy, 10.3%ni, 63.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 3916744k total, 605084k used, 3311660k free, 27872k buffers Swap: 1048572k total, 0k used, 1048572k free, 401748k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11469 pdb 25 5 405m 75m 67m R 52.9 2.0 1:56.45 pipeline: bgworker: worker0 [pipeline] 12246 pdb 20 0 400m 69m 67m S 14.3 1.8 0:10.55 pipeline: postgres pipeline [local] idle 12243 pdb 20 0 400m 69m 67m S 13.3 1.8 0:10.45 pipeline: postgres pipeline [local] idle 12248 pdb 20 0 400m 69m 67m S 13.3 1.8 0:10.40 pipeline: postgres pipeline [local] idle 12244 pdb 20 0 400m 69m 67m S 12.6 1.8 0:10.50 pipeline: postgres pipeline [local] idle 12237 pdb 20 0 400m 69m 67m R 12.3 1.8 0:10.52 pipeline: postgres pipeline [local] idle 12247 pdb 20 0 402m 70m 67m R 12.3 1.8 0:10.70 pipeline: postgres pipeline [local] idle 12245 pdb 20 0 401m 69m 67m S 12.0 1.8 0:10.78 pipeline: postgres pipeline [local] idle 12235 pdb 20 0 400m 69m 67m S 11.3 1.8 0:10.88 pipeline: postgres pipeline [local] idle 12239 pdb 20 0 400m 69m 67m S 11.0 1.8 0:10.79 pipeline: postgres pipeline [local] idle 12241 pdb 20 0 400m 69m 67m S 11.0 1.8 0:10.53 pipeline: postgres pipeline [local] idle 11466 pdb 20 0 119m 1480 908 R 5.3 0.0 0:58.39 pipeline: stats collector process 11468 pdb 25 5 401m 12m 9744 S 2.3 0.3 0:16.49 pipeline: bgworker: combiner0 [pipeline] 12228 pdb 20 0 678m 3408 884 S 2.3 0.1 0:02.36 /opt/pgsql/bin/pgbench -M extended -n -r -f ./test.sql -P 1 -c 10 -j 10 -T 100000 11464 pdb 20 0 398m 17m 16m S 1.7 0.4 0:10.47 pipeline: wal writer process 11459 pdb 20 0 398m 153m 153m S 0.0 4.0 0:00.37 /usr/lib/pipelinedb/bin/pipeline-server 11460 pdb 20 0 115m 852 424 S 0.0 0.0 0:00.02 pipeline: logger process 11462 pdb 20 0 398m 3336 2816 S 0.0 0.1 0:00.06 pipeline: checkpointer process 11463 pdb 20 0 398m 2080 1604 S 0.0 0.1 0:00.08 pipeline: writer process 11465 pdb 20 0 401m 4460 1184 S 0.0 0.1 0:00.33 pipeline: autovacuum launcher process 11467 pdb 20 0 398m 1992 1056 S 0.0 0.1 0:00.00 pipeline: continuous query scheduler process pdb@digoal-> psql psql (9.4.4) Type "help" for help. pipeline=# select * from v0; count --------- 9732439 (1 row) pipeline=# select * from v1; sum | count | avg ------------+---------+------------------------ 4923514276 | 9837585 | 1.00000000000000000000 (1 row) pipeline=# select * from v001; sum | count | avg --------------+----------+------------------------ 505023543131 | 11036501 | 1.00000000000000000000 (1 row) pipeline=# select * from v002; sum | count | avg ---------------+----------+------------------------ 1005065536319 | 12119513 | 1.00000000000000000000 (1 row) pipeline=# select * from v003; sum | count | avg -------------+----------+------------------------ 14948355485 | 29867002 | 1.00000000000000000000 (1 row)
在寫入 10 億 流數據后,數據庫的大小依舊只有13MB,因為流數據都在內存中,處理完就丟棄了。
pipeline=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+---------+-------+-----------------------+-------+------------+-------------------------------------------- pipeline | postgres | UTF8 | C | C | | 13 MB | pg_default | default administrative connection database template0 | postgres | UTF8 | C | C | =c/postgres +| 12 MB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | UTF8 | C | C | =c/postgres +| 12 MB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | (3 rows)
如果你的應用有類似場景,恭喜你,找到殺手锏了。
[參考]
https://github.com/pipelinedb/pipelinedb
https://www.pipelinedb.com/