Software & Application Miscellaneous: ZWJ&XML

  • jose / 116 / Sun, 31 Jan 2010 08:49:00 GMT / Comments (5)
  • Unicode Technical Report #20 (Unicode in XML and other Markup Languages) specifies that Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with in the markup. But when an xml file with the tags written in Malayalam using ZWJs (In Malayalam ZWJ is used to form certain characters) an error is reported that the tag contained an invalid character. Can anyone tell me what will be wrong?

    Scanned and protected by Email scanner
  • Keywords:

    zwj, xml, software, application

  • http://software.itags.org/software-application/240735/«« Last Thread - Next Thread »»
    1. Jose wrote:
      Unicode Technical Report #20 (Unicode in XML and other Markup Languages)

      <specifies that
      Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
      in the markup. But when an xml file with the tags written in Malayalam
      using ZWJs (In Malayalam ZWJ is used to form certain characters) an
      error is reported that the tag contained an invalid character.
      Can anyone tell me what will be wrong?

      Can you include or link to a short example? Note that ZWJ is legal in
      the content of an XML document, but not in the names of elements.

      simonmontagu | Mon, 05 May 2008 14:59:00 GMT |

    2. Wed, 13 Sep 2006, Jose wrote:

      Unicode Technical Report #20 (Unicode in XML and other Markup
      Languages) specifies that
      Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
      in the markup.

      Yes, for affecting ligature and joining behavior. I mention this because
      there is a popular word processor that uses ZWJ and ZWNJ quite
      inappropriately for line break control.

      course, the statement is of general nature: those characters are in
      principle suitable for use in marked-up text. It does not guarantee or
      prescribe that a particular markup system allows them or that they will be
      interpreted by their Unicode semantics.

      But when an xml file with the tags written in Malayalam
      using ZWJs (In Malayalam ZWJ is used to form certain characters) an
      error is reported that the tag contained an invalid character.

      Reported by which program? I first suspected that you may have tried to
      enter these characters but they do not appear correctly in the declared or
      implied character encoding.

      But reading again, I notice that you are referring to _tags_ and might
      actually mean the use of characters in element or attribute names, as
      opposite to their use in content between tags. UTR #20 discusses the
      latter, i.e. what you can use in document content proper - together with
      markup, not _inside_ markup (tags).

      The use of characters in element and attribute names is governed by the
      use of each markup language, basically in the _identifier_ syntax.
      Generally, and in XML 1.0, control characters are excluded in that syntax,
      and ZWJ and ZWNJ are control characters by definition (General Category:
      Cf). Thus, an attempt to use them in element names would violate
      well-formedness constraints, and an XML parser would report an error - not
      about an invalid character per se but about a syntax error.

      In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is probably
      of little practical value.

      jukkak_korpela | Mon, 05 May 2008 15:00:00 GMT |

    3. Jukka K. Korpela scripsit:

      In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is
      probably of little practical value.

      It has the merit that it allows identifiers to be spelled correctly
      in the various languages that *require* ZWJ or ZWNJ or both; Persian
      and several Indic languages come to mind.

      If you meant simply that XML 1.1 is not widely adopted, and it is
      therefore of little practical value to write documents in it, I
      must sadly agree.

      johncowan | Mon, 05 May 2008 15:01:00 GMT |

    4. As I recall, the problem with XML 1.1 adoption was that XML 1.1 was
      not fully backwards compatible with XML 1.0: there were XML 1.0
      documents that were not valid XML 1.1. That being the case, people
      just didn't see it worthwhile to have two incompatible parsers.

      As for ZWJ/NJ - the original intent was for these to not make any
      semantic difference. There is a UTC action to collect cases where they
      are being used to make a clear semantic difference (eg XXX means "sea
      gull" and XX<ZWNJ>X means "republican"), so any feedback on such cases
      would be useful.

      Mark

      9/13/06, John Cowan <cowan (AT) ccil (DOT) orgwrote:
      --
      Jukka K. Korpela scripsit:

      In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is
      probably of little practical value.

      It has the merit that it allows identifiers to be spelled correctly
      in the various languages that *require* ZWJ or ZWNJ or both; Persian
      and several Indic languages come to mind.

      If you meant simply that XML 1.1 is not widely adopted, and it is
      therefore of little practical value to write documents in it, I
      must sadly agree.

      markdavis | Mon, 05 May 2008 15:02:00 GMT |

    5. Mark Davis scripsit:

      As I recall, the problem with XML 1.1 adoption was that XML 1.1 was
      not fully backwards compatible with XML 1.0: there were XML 1.0
      documents that were not valid XML 1.1.

      In the sense that "XML 1.0" names a countably infinite set of abstract
      objects, true; in the sense that "XML 1.0" names a set
      of texts physically fixed in a tangible medium, I venture to doubt it.
      Specifically, I doubt that any Real World XML 1.0 documents contained
      any instances of U+007F through U+009F not as character references.

      In exactly the same sense, Unicode 2.0 was not backward compatible with
      Unicode 1.1, a fact which does not seem to have seriously impeded its
      adoption.

      The issues with XML 1.1 were in fact political; I say no more.

      As for ZWJ/NJ - the original intent was for these to not make any
      semantic difference. There is a UTC action to collect cases where they
      are being used to make a clear semantic difference (eg XXX means "sea
      gull" and XX<ZWNJ>X means "republican"), so any feedback on such cases
      would be useful.

      IIRC the leading case is the plural ending in Persian. It's not just
      a matter of a clear semantic difference: there is no semantic difference
      between "they're" and "theyre" in English, but the latter is unambiguously
      wrong in the standard orthography.

      johncowan | Mon, 05 May 2008 15:03:00 GMT |