GoogleIt Mail IT Print IT PermaLink@XmlEncode: UTF-16 Character Entity Reference Encoding In Lotus Notes @Formula Language
08:44:31 PM

When you're dealing with XML, you may have to deal with character set encodings. Twice. First there's the character set encoding of the XML source data stream itself, and then there's the character set of the content. Yes, they can be different, and yes there are rules in the standards, but when you're not in complete control of what characters might need to be put into your XML content (e.g., an RSS feed), you're not in complete control of what tools users are going to use to parse your XML -- and some tools don't quite get the standards right, and when you're not even in complete control of exactly how your XML stream is generated, there's really only one safe thing to do: use XML character references.

Those are the things that look like this: &#0xC3A9;. I have frequently made the mistake of calling them "character entities", but in researching this article I found that the proper name is character entity references. My bad.

An important thing to know about XML character references is that they are hex string representations of Unicode character encodings in UTF-16 format. And an important roadblock that you may run into if you are a Lotus Notes and Domino programmer working with XML is that there is no easy way to generate UTF-16 in Notes @Formula language -- but there is a way!

Inspired partly by Andrew's post, and partly by the fact that just within the past week I've been wrestling with character encoding issues (not for XML, but with programming in Java, @Formula and LotusScript... and trying to make them all work together!), I've come up with a formula solution that can turn this:

ABC & Ḑו > GHI

into this:


Click here for the code. The code assumes that the input is a field called plainString, and the output is a field called xmlString.

Note: this is not completely tested. One obvious thing stands out about it: I should be turning spaces into   instead of into &#0x0020;. But it's a start, and even if it's not quite right on the XML encoding I am pretty sure that the UTF-16 encoding is right, and that's worth something on its own. Domino can do UTF-8 with the @URLEncode() function, but for some odd reason it doesn't seem to be able to do UTF-16. At the beginning of the formula I do the UTF-8 conversion, and then massage it to take care of a few named entities, then it enters the outer @While to walk through the string, find the UTF-8 and convert it to UTF-16. The two inner @While() loops are merely walking a hex portion (the first loop) or a plain text portion (the second loop) of the URLEncoded text. The "%" signs are stripped from the hex portions (because of course we have different conventions for encoding URLs and content... ain't standards just wonderful!?). Once a plain or hex token is found, an @If branches either into simple concatenation for the plain portion, or into the big @Do to take care of the UTF-8 to UTF-16 conversion. The @While loop within the @Do looks at two, four, or six hex digits at a time -- in UTF-8, a Unicode value may be expressed either as one byte, two bytes or three bytes, and the high bits of the first byte tell the code what to do. (See the table here (the third column, in particular) to see what the pattern is.)

This page has been accessed 1794 times. .
Comments :v

1. Kerr11/03/2006 09:48:32 AM


You say that XML character entity references "are hex string representations of Unicode character encodings in UTF-16 format." My understanding is that XML character entity references are simply Unicode code points and can be expressed either as hex or decimal. For code points up to up to 2^16 there is no practical difference between the UTF-16BE (Big Endian) and the Unicode code point. (Little endian UTF-16 will be different) For code points greater than 2^16 you will get surrogate pairs that will be way off.

Your conversion to UTF-16 is big endian, so you don't have a problem for almost anything that you could come across; code points greater than 2^16 being pretty rare. I do however think you are better to explain that the XML character references are Unicode code points and UTF-16 is a multi-byte encoding of Unicode that can have surrogate pairs.

For example UTF-16BE for Unicode x1D11E (119070) is D834DD1E.
My understanding is that the correct xml character entity reference for this is 𝄞 or 𝄞 not �& #x DD1E;

You probably know this, but it's worth repeating for your readers. The only five named entity references defined in xml are
& the ampersand (&)
&lt; the "less than" symbol (<)
&gt; the "greater than" symbol (>)
&quot; the double quote (")
&apos; the apostrophe / single quote (')

These are also the reserved characters that can't be used in all places. In general you can't substitute any other named entity reference, like &nbsp; unless it is declared in a DTD for the XML document. HTML and XHTML of course do define a huge selection of named entities.

In my experience as long as you use utf-8 as the charset for the xml document, you can put anything in it and not have a problem. Since all compliant xml processors must handle utf-8 and utf-16, you should never have a problem. Problems only occur when the xml is serialised in one charset and read in another. This is usually a simple configuration issue and I can see no good reason to not use utf-8 unless you predominantly use CJK code points, in which case use UTF-16.

2. Richard Schwartz11/03/2006 10:39:05 AM

Kerr -- thanks for the clarifications. Brain freeze on the nbsp thing. Switching back and forth from HTML to XML chills the circuits, I guess. Really there's no reason for my code to do anything at all with space chars. They're perfectly legal as is. They're only converted as an artifact of leveraging the @UrlEncode, and I should fix that by converting them back.

Obviously you're right that decimal notation is legal instead of hex in the character entity references. A loop that just uses Notes' @Ascii function to get the decimal values might have been a much easier way, presuming that (despite the name) @Ascii actually does deal with character values greater than 0xFF -- which is entirely possible and I should check to see.

As for Unicode code points above 0xFFFF, to be honest I wasn't even thinking of them -- but I thought I read somewhere that XML character entity references must be expressed spcifically in UTF-16, so I didn't think that code points above the 16 bit barrier were actually legal in XML. I could definitely be wrong about that. So many standards, so little time For a general purpose UTF-8 to Unicode solution that isn't tied to UTF-16, one needs to deal with UTF-8 sequences that go beyond 3 bytes, which my code won't do.

As for using UTF-8 for encoding of the XML document as the safest way, I definitely agree. Sometimes, however, you may find that you're not in full control of XML generation in the context in which your @Formula is going to run. Anyhow, the core of this function, the UTF-16 conversion, is the real reason I wrote it. The XML character entity reference syntax generation was just an excuse to post it

3. Kerr11/03/2006 11:49:03 AM

Clearly you've got something that could be useful in some circumstances, so kudos there. To that end you might want to look at the output you get from using "UTF-16BE" as the charset for @URLEncode. It's pretty funky, but not in a good way Still it might help you out.

The XML spec says:
If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with "&#", the digits up to the terminating ; provide a decimal representation of the character's code point.

I'll leave the differences between Unicode and ISO/IEC 10646 as an exercise for the reader, but basically they have synchronised code points, so for the purposes of this discussion they are the same.

Clearly it would be nice if there was some method that would give you the code point for a given character. To play devils advocate some more, the problem with an @XMLEncode function is: what do you encode? You obviously what to encode the reserved chars, but what about the rest? All characters have an equivalent entity reference and clearly you don't want to have your whole document encoded as entity references. So what characters do you need to encode? Any character that is outside a given charset?

I take your point that you might not be able to change the charset that is being used, to encode the xml, and that is clearly a reason to do this. In that situation the @XMLEncode method could very well take the encoding charset as an argument and escape any characters that where out of band.

For charsets in xml the final chapter of the O'Reilly XML in a Nutshell book is a good read.

4. Kerr11/03/2006 11:57:50 AM

Oh, and @Ascii's not going to help you. It just replaces any out of band chars into '?' chars.
You're after something that does the oposite of @Char, but I can't think af anything that does that.

5. Richard Schwartz11/03/2006 02:28:32 PM

More great info, Kerr... Thanks! I never would have thought of adding the "BE" to the end of the arg for @UrlEncode. Gave up after just trying "UTF-16". I'll have to take a look at the output from "UTF-16BE".

An @Unicode function that returned the code point would be useful. I may put in a request for it. We have Uni() in LotusScript, but that's different because LotusScript is a Unicode environment. An @Unicode function would have to do the LMBC to Unicode mapping. Not that they don't have ways to do that... of course they do... but wiring it into the formula engine may not be as easy as it appears to us on the outside.

6. Andrew Pollack11/04/2006 03:10:31 PM

After all these years, the few really key but obvious missing things in formula language remain in the realm of string & character set manipulation, and number / text conversion. Why there isn't an @Hex(#) and a string representation of hex 0x## for example is beyond me. @Hex, @Octal, etc.. or else an @String( Base ; #Value) would make huge sense. @TextToNumber should support &H or 0x, as well as &B, etc.

7. Alan Bell11/06/2006 07:57:12 AM

the other thing about &nbsp is that it is not the same thing as &#0x0020. "nb" is "non-breaking" so is semantically different to a space which can be word-wrapped. U+00A0 is the unicode for a non-break space.

8. Kerr11/06/2006 11:53:04 AM

There are also the &ensp;(#x2002), &emsp;(#x2003) and &thinsp;(#x2009) entity refs in XHTML if you really need them.

9. vesoftware11/05/2013 10:26:01 PM

Agen Bola Promo 100% SBOBET IBCBET Casino Poker Tangkas Online
ITUPOKER.COM AGEN POKER ONLINE INDONESIA TERPERCAYA : Toko belanja online murah, Promo heboh jual barang hanya Rp 1,-

10. 51316914103/03/2016 08:58:12 PM

11. lllllyuan05/16/2016 12:03:16 AM


12. zhuojian07/05/2016 09:45:19 AM


13. clibin00909/01/2016 09:38:31 PM

14. jianbin092709/27/2016 12:13:16 AM
Homepage: http://


15. 20161013 yuanyuan10/13/2016 05:05:38 AM
Homepage: http://

16. chenyingying10/17/2016 12:19:13 AM

17. chenjinyan11/22/2016 03:02:04 AM
Homepage: http://

18. 20161125caihuali11/25/2016 12:26:33 AM

19. chenyingying12/01/2016 09:25:17 PM

20. leilei391502/28/2017 02:28:15 AM

20170228 leilei3915

21. zzzzz02/28/2017 08:55:23 PM

22. dongdong803/06/2017 10:18:27 PM

23. dongdong806/29/2017 12:00:03 AM lauren.html

24. chenlina07/16/2017 10:25:47 PM
Homepage: http://

25. chenlixiang08/11/2017 09:06:34 AM

26. xiaozheng666610/18/2017 12:29:02 AM

2017-10-18 xiaozheng6666

27. dongdong811/17/2017 11:26:09 PM

28. yaoxuemei11/29/2017 12:58:00 AM

29. chenlixiang12/08/2017 06:47:17 AM

30. lzm00312/27/2017 01:20:01 AM kate spade outlet online pandora jewelry tory burch Jewelry Armoire - Official Wholesale Outlet Sale coach outlet burberry outlet kate spade outlet online coach outlet online kate spade outlet jimmy choo shoes yeezy boots 350 Nike Air Max Enfant Wave Prophecy 2 Shoes shop mlb oakley outlet chrome store nike outlet michael kors outlet ray ban sunglasses sale Ray Ban sunglasses Chaussures pour Femme ray-ban sunglasses prada outlet sale Nike Air Max Enfant coach outlet Air Jordan 11 Femme Wholesale womens autumn winter clothing jimmy choo uk coach outlet jimmy choo australia Nike Air Jordan Enfants ray ban australia coach outlet Nike Air Max Femme Nike Free Run burberry sale yeezy shoes boulder shoes sew repair Nike Air Max 1 coach outlet online michael kors handbags puma sneakers michael kors bags coach outlet online kate spade outlet sale rayban sunglasses Air Max 90 rayban prescription glasses birkenstock sandals canada goose outlet backlink burberry outlet sale armani outlet Louis Vuitton handbags louis vuitton outlet coach outlet store michael kors outlet online ugg boots Jordan Fusion Femme nike outlet Wedding Rings- Official rolex watch prada outlet online Nike Air Max 2017 hermes outlet michael kors bags online nike jordan shoes puma shoes breitling watches prada outlet online clearance school bags on sale michael kors outlet pandora australia Nike Air Force 1 Homme nike factory outlet Nike Free Run Lunette Oakley Nike Air Max Chase black friday michael kors Air Max Enfants under armour burberry factory outlet burberry outlet Brighton Jewelry - Official Brighton Jewelry - Official kate spade outlet Nike Air Max chaussure pas cher pandora bracelets charms pandoracharms kate spade outlet online handbags online sale ray ban australia michael kors gucci watches michael kors bags black friday kate spade outlet bags pandora jewelry uk timberland outlet michael kors factory outlet Premier Jewelry - Official Green Cleaned The Retail Compliance Association China wholesale kate spade outlet cheap ray bans Nike Air Max 90 pandora australia louis vuitton outlet ray ban polarized coach outlet online sale nike running nike outlet online versace outlet payless shoes online Family Name Research kate spade outlet store Adidas Outlet pandora australia mode damenschuhe tory burch Nike Air Jordan Enfants

31. xxx01/22/2018 03:41:54 AM

32. chenlina02/04/2018 09:27:11 PM


Enter Comments^

Email addresses provided are not made available on this site.

You can use UUB Code in your posts.

[b]bold[/b]  [i]italic[/i]  [u]underline[/u]  [s]strikethrough[/s]

URL's will be automatically converted to Links

:-x :cry: :laugh: :-( :cool: :huh: :-) :angry: :-D ;-) :-p :grin: :rolleyes: :-\ :emb: :lips: :-o
bold italic underline Strikethrough

Remember me    

Monthly Archive
Responses Elsewhere

About The Schwartz


All opinions expressed here are my own, and do not represent positions of my employer.