PermaLinkPurifying HTML In RSS Feeds
09:54:26 AM

In preparation for Lotusphere, I've been working on a revision of my Domino-based bloggregator RSS reader code, with the objective of putting a decent web UI onto it. The latest beta is here. (Comments welcome, but please note: it's a work in progress, and I'm doing things like changing the functionality, tweaking the CSS and blowing away all the data without warning, so if you try it and it's behaving very badly, that's the likely explanation.)

Spending time on the web UI again, however, has raised an issue that I knew existed in the previous version, but had mostly ignored. The web UI was so bad that I hardly ever used it, but now it's starting to look decent and I want to make it work well. The problem is that a lot of RSS feeds send garbled HTML in their artilce abstracts: e.g., unbalanced "div" or "span" tags (one too many closed tag, and suddenly one of my block level elements is prematurely closed!). And the flip side of this is some other RSS feeds that may send things like javascript and style sheets that can seriously mess up my presentation if they're not garbled! A related problem is that if I try to save some bandwidth by truncating the stream that comes from feeds that send full articles, perfectly good HTML can become garbled HTML, with unclosed tags, unclosed comments, unclosed quotes within tags, etc., and if that happens with an "img", an "a href", a "div", a "span" and in various other places it can have really nasty affects on the rendering of a page.

The solution is to "purify" the HTML, but how to do it? Right now I'm just doing an @ReplaceSubstring that finds certain opening and closing tags and inserts an underscore after the opening bracket. It deliberately mangles the HTML so that it is no longer HTML... but that approach is the sledghammer: it destroys some potentially useful markup as well as elminating harmful markup. It keeps my page's presentation intact, but it is distinctly unsatisfying to see the deliberately mangled HTML tags where the abstracts should be. Stripping all tags would probably be better, but that's also unsatisfying. An aggregator could do better than that. A more intelligent approach, however, is going to be a lot of work. Here's a list of things that I think a complete approach would need to do:

  • Detect and remove any tags that are closed without having been opened.
  • Detect and remove any script or style blocks.
  • Detect and fix any unclosed quotes for tag attributes
  • Detect and fix any truncated tags (missing the closing angle bracket)
  • Detect and fix any unclosed block-level tags, "img" tags, or "a href" tags

I actually think this could be done in column formulas now that we have the power of the ND6 formula language, but I doubt that it's a good idea to do it. I'm just guessing, but looping character-by-character through fields in a view that might have thousands of documents is probably a Bad Thing. The script that retrieves the RSS feed in the first place is probably the much better place to do it.

This page has been accessed 422 times. .
Comments :v

1. Stephan H. Wissel01/20/2005 12:39:05 PM

Hi Richard,
you might want to consider a staggered approach. The first scrub to appy to "dirty" HTML is jTidy:
You either use a JavaAgent or LS2J. The parameters allow you to receive clean xHTML with all problems fixed. After that you could apply a XLST transformation (since xHTML is XML) that filters out the elements that you don't want (this way it is configurable). Question here: use a positive or negative list.
Like the idea? let me know!

2. Richard Schwartz01/20/2005 01:17:26 PM

Great idea, Stephen! I'll probably try this after 'sphere. I figured there had to be something like that out there, but I didn't find it when I searched. LS2J in my feed retrieval agent, or a Java webqueryopen agent should be possible. I don't think I'll do the XSLT though. It makes my brain hurt almost as much as Perl does.


3. Stephan H. Wissel01/21/2005 07:48:23 AM

Hi Richerd,
(I'm ready to swap the the "e" for an "a" if you do the same *vbg*).
Glad you like the idea. And for the brain hurting: I could provide the sample XSLT sheets. I find XSLT cheaper that illegal drugs but the mind blowing effect is the same. Drop me an eMail when you are ready to go. And have fun on the Sphere.

4. Stephan H. Wissel01/21/2005 10:27:38 AM

Your quest might be shorter than you think. Once you figured out how to clean the html into valid xHTML using jTidy, head over to and retrieve the XSLT template that does the filtering for you:
Specially designed for you.

5. chenjinyan11/22/2016 02:52:13 AM
Homepage: http://

6. ylq jake08/17/2017 04:47:34 AM,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Enter Comments^

Email addresses provided are not made available on this site.

You can use UUB Code in your posts.

[b]bold[/b]  [i]italic[/i]  [u]underline[/u]  [s]strikethrough[/s]

URL's will be automatically converted to Links

:-x :cry: :laugh: :-( :cool: :huh: :-) :angry: :-D ;-) :-p :grin: :rolleyes: :-\ :emb: :lips: :-o
bold italic underline Strikethrough

Remember me    

Monthly Archive
Responses Elsewhere

About The Schwartz


All opinions expressed here are my own, and do not represent positions of my employer.