Nginx日志歸檔logrotate配置失效不執行的問題排查記錄
某天晚上剛躺好準備去和周公歡談,報警短信伴隨著”悅耳“的鈴聲來到了。打開手機一看,竟然是某臺Web服務器的磁盤使用率超過90%了,只好爬起來打開電腦、上線V*N、登錄機器,開始處理問題。
首先看看是哪個分區有問題:
[tabalt@localhost ~]$ df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol03
20G 3.7G 16G 20% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sda1 97M 54M 39M 59% /boot
/dev/mapper/VolGroup00-LogVol01
1008M 34M 924M 4% /tmp
/dev/mapper/VolGroup00-LogVol02
4.0G 371M 3.4G 10% /var
/dev/mapper/VolGroup00-data
493G 449G 44G 91% /data
很明顯是數據卷(/data目錄)快被塞滿了,于是進去看看是各個目錄的大小:
[tabalt@localhost ~]$ cd /data/; sudo du -sh *
16K lost+found
392G nginx
4.2M php
...
[tabalt@localhost ~]$ sudo du -sh nginx/logs/
392G nginx/logs/
主要大小都集中在nginx目錄下的logs目錄,這個目錄主要存放Nginx的訪問日志和錯誤日志,來看看目錄下文件的大小:
[tabalt@localhost ~]$ cd nginx/logs/; ls -lh
-rw-r--r-- 1 nobody root 196G Jan 8 03:44 allweb.log
-rw-r--r-- 1 nobody root 0 Jan 7 23:55 www_domain_com_error.log
-rw-r--r-- 1 nobody root 49G Jan 8 03:44 www_domain_com.log
...
訪問日志文件都比較大,使用 head allweb.log 看了下是好幾天前的日志內容,我們對Nginx的日志是配置了logrotate每天做日志歸檔的,難道是配置有問題了? logrotate之前是由ops同學安裝配置的,先來找一下配置文件在哪:
[tabalt@localhost ~] locate logrotate.conf
/etc/logrotate.conf
/usr/local/nginx/conf/logrotate.conf
/usr/local/php/etc/logrotate.conf
/usr/share/man/man5/logrotate.conf.5.gz
可能相關的配置文件是前兩個,先看一下主配置文件:
[tabalt@localhost ~] cat /etc/logrotate.conf
# see "man logrotate" for details
# rotate log files weekly
weekly
# keep 4 weeks worth of backlogs
rotate 4
# create new (empty) log files after rotating old ones
create
# use date as a suffix of the rotated file
dateext
# uncomment this if you want your log files compressed
#compress
# RPM packages drop log rotation information into this directory
include /etc/logrotate.d
# no packages own wtmp and btmp -- we'll rotate them here
/var/log/wtmp {
monthly
create 0664 root utmp
minsize 1M
rotate 1
}
/var/log/btmp {
missingok
monthly
create 0600 root utmp
rotate 1
}
# system-specific logs may be also be configured here.
[tabalt@localhost ~] ll /etc/logrotate.d/
total 24
-rw-r--r--. 1 root root 103 Dec 8 2011 dracut
-rw-r--r--. 1 root root 135 Nov 11 2010 iptraf
-rw-r--r--. 1 root root 329 Aug 24 2010 psacct
-rw-r--r--. 1 root root 68 Aug 23 2010 sa-update
-rw-r--r--. 1 root root 210 Aug 3 2011 syslog
-rw-r--r--. 1 root root 100 Dec 9 2011 yum
主配置里沒有Nginx日志目錄相關的配置,再來看Nginx配置目錄下的logrotate配置文件:
[tabalt@localhost ~] cat /usr/local/nginx/conf/logrotate.conf
/data/nginx/logs/* {
dateext
dateformat -%Y%m%d-%s
rotate 120
maxage 7
olddir archive
missingok
nocreate
sharedscripts
postrotate
test ! -f /var/run/nginx.pid || kill -USR1 `cat /var/run/nginx.pid`
endscript
}
include /usr/local/nginx/conf/logrotate.d
這里面配置了歸檔/data/nginx/logs/,就是我們要找的。從配置內容看應該都是正常的,難道是logrotate程序沒有正常運行?來看下logrotate歸檔的記錄:
[tabalt@localhost ~] cat /var/lib/logrotate.status | grep '/data/nginx/logs/allweb.log'
"/data/nginx/logs/allweb.log" 2016-12-29
從輸出看果然是只有幾天前的做過歸檔,為啥后面幾天歸檔沒有執行呢?logrotate是由crontab定時執行的,經過摸索找到了2個crontab的配置:
[tabalt@localhost ~] sudo cat /etc/cron.daily/logrotate
#!/bin/sh
/usr/sbin/logrotate /etc/logrotate.conf
EXITVALUE=$?
if [ $EXITVALUE != 0 ]; then
/usr/bin/logger -t logrotate "ALERT exited abnormally with [$EXITVALUE]"
fi
exit 0
[tabalt@localhost ~] cat /etc/cron.d/nginx
55 23 * * * root sleep `perl -e "print int(rand(120))"` && /usr/sbin/logrotate -v -f /usr/local/nginx/conf/logrotate.conf
看來可以忽略主配置而只關注Nginx相關的配置了。來手動執行一下看看(多了個-d參數代表只執行預演調試而不實際執行歸檔操作):
[tabalt@localhost ~] sudo /usr/sbin/logrotate -v -f -d /usr/local/nginx/conf/logrotate.conf
reading config file /usr/local/nginx/conf/logrotate.conf
reading config info for /data/nginx/logs/*
olddir is now archive
including /usr/local/nginx/conf/logrotate.d
reading config file www_domain_com.conf
reading config info for /data/app/src/www_domain_com/logs/*
olddir is now archive
error: www_domain_com.conf:12 error verifying olddir path /data/app/src/www_domain_com/logs/archive: No such file or directory
error: found error in file www_domain_com.conf, skipping
removing last 1 log configs
removing last 2 log configs
竟然報錯了!!! 從輸出信息看是因為某種原因導致一個配置目錄里的歸檔目錄archive不存在了。于是趕緊創建了這個目錄,再測試發現報錯消失了,問題解決!
總結一下,因為對ops同學配置的logrotate不熟,所以在排查過程中耗費的時間比較多,好在磁盤空間報警并不是很嚴重的問題,手動刪除某個較大的文件就有充足的時間來找問題了。對于日志目錄下archive目錄無故消失的問題比較奇怪,猜測是有同學誤操作了。線上部署所需要的目錄或文件也可以做一些監控,更主動的發現一些問題。
來自:http://tabalt.net/blog/nginx-logrotate-execution-failure/