A journey to troubleshooting by sersync exclude bug

For the last half year, one of our file distributing syststem goes out of sync occasionally, which bothered me a lot.

The simplified system workflow looks like this:file distributing system

Files  are generated by the cms-node, which are then transfered to dist-node‘s /upload directory for furthur processing. On dist-node, there’s a sersync service which monitor file change events under /upload directory, when intrested event occurs, it will call some custom plugins to process the files, and then sync them to the /release directory.  Files under /release directory are all ready to publish, which will then be synced to all online-nodes by another sersync instance. Note that there are some files under dist-node‘s /upload directory which need special processing(say, /upload/special/*.html), it should be excluded from sersync’s monitor.

Now comes the problem.  Now and then we find that some files are different between dist-node‘s /release/special directory and online-nodes. Investigation shows that dist-node is the only system that has permission to publish files to online-nodes. After comparing the status & content of the different files,  we realized that  the file content on online-nodes is the result of  sersync processing on dist-node,  instead of special processing.   How can that be?

At first, we decide to  write a simple script to track the changes on dist-node‘s /release/special directory:

#!/usr/bin/python
############################################################################
#monitor_fs.py: a simple script to monitor directory for file change events.
############################################################################
import os, sys, time
import pyinotify

def log_msg(msg):
    now = time.strftime("%F %T")
    print "[%s]: %s" %( now, msg)
    sys.stdout.flush()

def stat(path):
    stat = os.stat(path)
    print path
    print "Acess Time:\t%s" % time.strftime("%F %T %z", time.localtime(stat.st_atime))  
    print "Change Time:\t%s" % time.strftime("%F %T %z", time.localtime(stat.st_ctime)) 
    print "Modify Time:\t%s" % time.strftime("%F %T %z", time.localtime(stat.st_mtime)) 
    print
    sys.stdout.flush()



class EventHandler(pyinotify.ProcessEvent):
    def __init__(self, watchManager, mask):
       self.wm = watchManager 
       self.mask = mask

    def process_default(self, event):
        log_msg("event: %s"  % event.maskname)
        filename = event.pathname
        if event.mask & pyinotify.IN_ISDIR:
            #add watch on new directory
            if event.mask & pyinotify.IN_CREATE:
                log_msg("created new dir:'%s'" % filename)
                self.wm.add_watch(filename, mask, rec=True)
            #remove watch on deleted directory
            elif event.mask & pyinotify.IN_DELETE:
                log_msg("deleted dir:'%s'" % filename)
                self.wm.del_watch(self.wm.get_wd(event.pathname))
        else:
			if not (event.mask & pyinotify.IN_DELETE):
				try:
					stat(filename)
				except OSError,e:
					log_msg(e)

if __name__ == '__main__': 
    wm = pyinotify.WatchManager() # Watch Manager
    mask = pyinotify.IN_CREATE | pyinotify.IN_CLOSE_WRITE | \
        pyinotify.IN_MOVED_TO | pyinotify.IN_DELETE | pyinotify.IN_ONLYDIR
    handler = EventHandler(wm, mask)
    notifier = pyinotify.Notifier(wm, handler)
    d = '/release/special'
    log_msg("watching %s" % d)
    wdd = wm.add_watch(d, mask, rec=True)

    notifier.loop()

We run monitor_fs.py > m.log for several days, then check m.log.

[2012-10-12 15:38:52]: event: IN_CREATE|IN_ISDIR
[2012-10-12 15:38:52]: created new dir:'/release/special/1e'
[2012-10-12 15:38:52]: event: IN_CREATE
[2012-10-12 15:38:52]: [Errno 2] No such file or directory: '/release/special/1e/.38.html.lmIk90'
[2012-10-12 15:38:52]: event: IN_CLOSE_WRITE
[2012-10-12 15:38:52]: [Errno 2] No such file or directory: '/release/special/1e/.38.html.lmIk90'
[2012-10-12 15:38:52]: event: IN_MOVED_TO
/release/special/1e/38.html
Acess Time:     2012-10-12 15:38:52 +0000
Change Time:    2012-10-12 15:38:52 +0000
Modify Time:    2012-10-12 15:42:51 +0000

At the very first sight of files like .38.html.lmIk90, we know for sure that it’s rsync’s tmp file, and we can confirm that it’s not the result of special processing, because result files of special process are moved directyl into /release/special, no tmp file should be created. The event sequence of special process look like this:

[2012-10-12 15:39:01]: event: IN_DELETE
[2012-10-12 15:39:01]: event: IN_CREATE
/release/special/1e/38.html
Acess Time:     2012-10-12 15:39:01 +0000
Change Time:    2012-10-12 15:39:01 +0000
Modify Time:    2012-10-12 15:39:01 +0000

[2012-10-12 15:39:01]: event: IN_CLOSE_WRITE
/release/special/1e/38.html
Acess Time:     2012-10-12 15:39:01 +0000
Change Time:    2012-10-12 15:39:01 +0000
Modify Time:    2012-10-12 15:39:01 +0000

From the stat time, we can see that sersync service process the file first(it is then published to online-nodes), then special process overrites the same file with its result. However, because the final file has an earlier mtime(guess why ^_^.), it can’t be published to online nodes. This explains our problem.

But how could this happen? we have excluded the special directory from sersync! it shouldn’t have gotten involved.

    <filter start="true">
    <exclude expression="(.*)\/special\/.*\.html$"></exclude>
    </filter>

In fact, we have changed files under /upload/special/ manually, and the monitor result tells us that they are correctly handled by special process. Meaning, correctly excluded by sersync.

So, Is it really a bug of sersync? Let’s check it out! Write a script to create/change a large number of files and see what will happen:

#!/usr/bin/perl
#########################################################
#trigger_sersync_bug.pl: try to create/update too many
# files to trigger possible sersync bug.
#########################################################

use File::Touch;
use File::Path;

use strict;
use warnings;

my $d = "/upload/special";

sub update_files{
	#for my $i ('aa' .. 'bb'){
	for my $i ('aa' .. 'zz'){
		for my $j ('aa' .. 'zz'){
			my $dir = "$d/$i/$j";
				print "create directory $dir\n";
				mkpath($dir);
				my $filename = int(rand(10000)) . ".html";
				print "create file: $filename\n";
				touch("$dir/$filename");
			
		}
	}
}

update_files();

With the above script, we are able to reproduce the problem! We managed to change a html file under /special and observed that it was processed by sersync(the exclude pattern failed!).
After repeated test, we find that with the following command sequence, sersync pattern would definitely fail:

mkdir /upload/special/tdir && touch /upload/special/tdir/t01.html

That is to say, A mkdir immediately followed by creating files under the new directory will cause sersync exclude pattern to fail! The funny part is, if given a little more delay after mkdir, eg:

mkdir /upload/special/tdir2 && sleep 1 && touch /upload/special/tdir2/t02.html

This doesn’t trigger the bug.

But I will just stop here now, because I don’t have time to read sersync code and fix the bug, maybe you can help to fix it, right?

Some tips from this story:

  • DO keep system time in sync between related system, or it will bite you sooner or later.
  • Some system programming knowledge and scripting skill may help you a lot in trouble shooting.
  • When you find a way to reproduce a problem, you are half done

Tested sersync version: sersync2.5

This entry was posted in Programming, System Administration and tagged , , , , . Bookmark the permalink.

Leave a Reply