GoogleIt Mail IT Print IT PermaLinkSNTT: An Agent To Diagnose A Nasty Problem
10:02:50 PM

It's been ages since I've posted anything for Show 'n Tell Thursday.

Yeah, I know... It's Tuesday. So sue me.

I think that in all the years I've been doing Lotus Notes and Domino, there has been one type of really nasty problem that has been the most difficult to diagnose. Even experienced experts occasionally slip up and cause the problem, and even teams of experts looking at the symptoms can easily be totally stumped. I recently solved the fourth recurrence of a problem that had been baffling members of my team for weeks, and this is no slouch team -- with more than 70 years of collective Notes/Domino experience. The root cause, that I finally did figure out, is exactly the type of problem that I'm talking about.

This type of problem can come and go without apparent explanation. That's what happened to us several times over the past few weeks. In one of the cases, faced with results that defied explanation for days, I actually came up with what appeared to be an explanation. It was a far-fetched explanation, but it at least had some grounding in experiences that I've had in Notes/Domino, so I was at least a little optimistic -- but when we tried deliberately to force conditions that would prove my explanation, we just couldn't make the problem occur! And besides, by the time we got to the point of testing my theory, the problem that we had been working on had gone away all by itself. Or so it seemed, until it came back again.

So, what was the problem? Before I tell you that, let me tell you what the symptom was.

We have a complex application, and key part of several of the agents in the application is code that does NotesDatabase.getDocumentByUnid() calls. It works like a charm, of course. This code has been shipping for a long time. We've been in the middle, however, of testing new code for a new version, and we've set up some challenging environments for those tests. When iterating from one build to the next, sometimes our code just stopped working in one of the test environments. The NotesDatabase.getDocumentByUnid() calls started returning "Invalid universal id". There were no ReaderNames fields. ACLs were correct. Groups in the Domino Directory were correct. Permissions in the Server Document were correct. Agent runtime settings were correct. Everything was double and triple-checked by mulitple sets of highly experienced eyes. Debugging code in the application was activated and bumped up to its highest level, and everything was working -- except the getDocumentByUnid. A document was opened in the database containing the agent.. A unid is read from a NotesItem in the document. The database where document pointed to by the unid lives was opened. But the getDocumentByUnid() call failed.

Much hair was pulled out. Tests were run. Databases were compacted and fixed up. The test environment was rebuilt, and everything started to work, but we're shipping a product and failures that have no explanation are not easily tolerated, even if we've worked around them. A Lotus support incident was opened. Yes... we were desperate for an explanation.

The diagnosis finally hit me when I was reviewing the results of some tests that one of our team members had done at the request of Lotus Support. These tests had worked while our code was failing. There were several differences between the tests and our production code. None of those differences seemed important, but I knew there had to be some difference that was significant, and I finally hit on it. When our code executed, the database was open, but it was the wrong database. The test code was using NotesSession.getDatabase(), but our code uses NotesDatabase.openByReplicaID(). We had more than one replica of the same database on the server, and the document with the unid we were looking for was in one replica, but NotesDatabase.openByReplicaID() was opening the other replica.

What makes this even more frustrating is that it isn't the first time I've been bitten by this problem. I've run into problems caused by multiple replicas of one database on a server several times over the years. Someone makes a "backup copy" of a database but keeps it inside the Domino\Data tree, and there you go. Behavior from that point on is unpredictable in several ways. The bad behavior that is most often seen is in replication with other servers, which can be baffling in its inconsistently freaky result; but code that accesses databases by replica ID is similiarly freaky, as our situation demonstrated. You just never know which replica of the database your code is going to hit. One agent runs and stores replica IDs and unids in documents, and another agent runs later and finds the database by replica ID but doesn't find the unid;.

Duplicate replicas are rare enough that you just don't think of it as an explanation automatically, but common enough that I suspect most 10+-year veterans of Notes/Domino have probably run into it at least once. And as evidenced by the man-hours that my team burned on it's maddeningly difficult to figure out when it happens.

So, what's to be done about this? No amount of reminding yourself "don't do it", is necessarily going to protect you from someone copying a directory full of databases to "make a backup" of the last release, or some such thing. Even very experienced developers and admins can slip up. Procedures that seem to be benign, and have not caused problems in the past, suddenly lead to utter chaos. My team's experience proves this. You can run the catalog task and look for duplicates by browsing through catalog.nsf -- but that's far from ideal. I've put in a request to Lotus to consider having the server generate warnings, but as of now Domino never warns you about this. A very simple agent, however, can detect duplicate replicas, wherever they are on your server. So that's what's to be done. That's what I've done, anyhow.

Here's the simple agent. It's just a few dozen lines of Java code. I considered naming it "Weirdness Detector", but settled on "Scan Replicas" instead. Download it, sign it, set the ACL and replicate it to your server, then run it via tell amgr run "scanreplicas.nsf" 'scan replicas'. It will print the filenames and and replica IDs of any duplicates to the console and to log.nsf, like this:

10/02/2007 10:05:10 PM Agent Manager: Agent printing: Duplicate Replica ID:
projects\webmail\test.nsf and projects\iris\webmail\test.nsf | 852565FD0075793B

As written, this agent won't solve the problem for you, but it can help you diagnose it -- if you know that weirdness is happening and you think to run it.

Detecting the problem before weirdness starts... that's the next step. Since duplicate replicas can be created any time by anyone with permissions or access to the server filesystem, it might be a good idea to modify the agent code to send out an email alert whenever it detects duplicates and schedule it to run hourly. That, however, is left as an exercise for the reader. At least for now.

This page has been accessed 780 times. .
Comments :v

1. Devin Olson10/03/2007 10:11:43 AM

Been there, done that, got the hat, and the shirt. And forgotten about it completely.

Thanks for reminding us all about this very sneaky problem.

Thanks also for the agent. It will help quite a bit.


2. Carol Anne10/03/2007 11:58:01 AM

Could you share what minimum version of N/D one needs to run the NSF? Thanks.

3. Richard Schwartz10/03/2007 02:28:11 PM

I've only tested it on 7.x, but I can't think of anything that I've used that would be incompatible with 6, or maybe even 5.


4. Ulrick Krause10/04/2007 01:07:13 AM


thanks for this. Great tipp!

5. Kevin Pettitt10/11/2007 11:50:04 AM

The OpenNTF project "DomainPatrol" has a reporting feature that will flag duplicate replicas. You'll have to run the scanner at least once to populate the database (basically with information from catalog.nsf), and run the report manually.

Haven't looked that closely, but I suspect it wouldn't take much to set up a scheduled agent based on this code to notify whenever a dupe is detected. I'm also wondering if there isn't a way to use event handlers somehow, perhaps as a way to trigger a scanning agent whenever a new database is added to the server. Of course, there may not be an "event" to handle if the database is added at the OS level. Some sort of hourly agent that isn't catalog.nsf dependent might be the best approach assuming you could keep any performance hit to a negligible level.

6. chenyingying10/17/2016 12:15:13 AM

7. chenjinyan11/22/2016 02:59:00 AM
Homepage: http://

8. dongdong811/17/2017 11:24:28 PM

9. yaoxuemei11/29/2017 12:57:54 AM

10. chenlixiang12/08/2017 05:50:48 AM

11. chenlixiang12/08/2017 06:45:17 AM

12. chenlina02/04/2018 09:27:02 PM


Enter Comments^

Email addresses provided are not made available on this site.

You can use UUB Code in your posts.

[b]bold[/b]  [i]italic[/i]  [u]underline[/u]  [s]strikethrough[/s]

URL's will be automatically converted to Links

:-x :cry: :laugh: :-( :cool: :huh: :-) :angry: :-D ;-) :-p :grin: :rolleyes: :-\ :emb: :lips: :-o
bold italic underline Strikethrough

Remember me    

About The Schwartz


All opinions expressed here are my own, and do not represent positions of my employer.