We use Zabbix to monitor our servers, but recently the monitoring agent has been causing some problems of it’s own.
About once a week we send a fairly large mailshot out to our users. Zabbix monitors the sizeof the postfix mail queue on each of our mail servers, and then stores this in its database so it can draw graphs and send us an alert if the mail queue gets too big. But here’s the problem: the action of counting the mail queue itself is quite intensive, and it seems to be locking up the server when it runs.
After some investigation I found (in /etc/zabbix/zabbix_agentd.conf) that we were using the following command to measure the mailq:
[root@mx1 ~]# time mailq | grep -c '^[0-9A-Z]'
34619
real 0m6.590s
user 0m2.144s
sys 0m0.289s
As you can see it took 6.59 seconds to run on a queue size of about 35,000. You could also run the postqueue command and look at the end of the output:
[root@mx1 ~]# time postqueue -p | tail -5
-- 158346 Kbytes in 34621 Requests.
real 0m5.668s
user 0m0.075s
sys 0m0.225s
But, again this takes over 5 seconds for 35,000 mails. So a much quicker way would be to use:
[root@mx1 ~]# time find /var/spool/postfix/deferred/ /var/spool/postfix/active/ /var/spool/postfix/maildrop/ | wc -l
34640
real 0m0.033s
user 0m0.030s
sys 0m0.022s
Using find is over 100 times faster than the other two methods. Each of those command reports a slightly different size of the mailq, but they are pretty close. If anyone knows of an even quicker way to measure the queue size then please let me know!