Wed, 13 Sep 2006, Jose wrote:
Unicode Technical Report #20 (Unicode in XML and other Markup
Languages) specifies that
Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
in the markup.
Yes, for affecting ligature and joining behavior. I mention this because
there is a popular word processor that uses ZWJ and ZWNJ quite
inappropriately for line break control.
course, the statement is of general nature: those characters are in
principle suitable for use in marked-up text. It does not guarantee or
prescribe that a particular markup system allows them or that they will be
interpreted by their Unicode semantics.
But when an xml file with the tags written in Malayalam
using ZWJs (In Malayalam ZWJ is used to form certain characters) an
error is reported that the tag contained an invalid character.
Reported by which program? I first suspected that you may have tried to
enter these characters but they do not appear correctly in the declared or
implied character encoding.
But reading again, I notice that you are referring to _tags_ and might
actually mean the use of characters in element or attribute names, as
opposite to their use in content between tags. UTR #20 discusses the
latter, i.e. what you can use in document content proper - together with
markup, not _inside_ markup (tags).
The use of characters in element and attribute names is governed by the
use of each markup language, basically in the _identifier_ syntax.
Generally, and in XML 1.0, control characters are excluded in that syntax,
and ZWJ and ZWNJ are control characters by definition (General Category:
Cf). Thus, an attempt to use them in element names would violate
well-formedness constraints, and an XML parser would report an error - not
about an invalid character per se but about a syntax error.
In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is probably
of little practical value.