regex - Regular expression for invalid characters in XML -
i trying figure out way can find invalid characters in xml. according w3 recommendation these valid characters in xml:
#x9 | #xa | #xd | [#x20-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff]
converting decimal:
9 10 13 32-55295 57344-65533 65536-1114111
are valid xml characters.
i trying search in notepad++ using appropriate regular expression invalid characters.
a snippet xml:
<custom-attribute attribute-id="iscontendfeed">fal  se</custom-attribute> <custom-attribute attribute-id="pagenofollow">fal  se</custom-attribute> <custom-attribute attribute-id="pagenoindex">fal se</custom-attribute> <custom-attribute attribute-id="rrrecommendable">false</custom-attribute>
from above example want regular expression finds 
, 
me because these not allowed in xml.
i not able construct regular expression this.
the regular expression made numeric ranges:
32-55295 : (3[2-9]|[4-9][0-9]|[1-9][0-9]{2,3}|[1-4][0-9]{4}|5[0-4][0-9]{3}|55[01][0-9]{2}|552[0-8][0-9]|5529[0-5]) 57344-65533 : (5734[4-9]|573[5-9][0-9]|57[4-9][0-9]{2}|5[89][0-9]{3}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-3]) 65536-1114111 : (6(5(5(3[6-9]|[4-9][0-9])|[6-9][0-9]{2})|[6-9][0-9]{3})|[7-9][0-9]{4}|[1-9][0-9]{5}|1(0[0-9]{5}|1(0[0-9]{4}|1([0-3][0-9]{3}|4(0[0-9]{2}|1(0[0-9]|1[01])))))))
these regular expression working if used separately not able make complete regex.
is there other way other regular expression can find invalid characters? if not, please me in constructing regular expression can find invalid characters present in xml.
first, literal text 
allowed in xml - not allowed (if list correct) character ascii-code 3. hope got right.
second. regular expression flavors allow search characters can defined \x00
(two hex digits) , \u0000
(4 hex digits). flavors allow \x{...}
- differs flavor flavor...
we start
[^\x09\x0a\x0d\x20-\ud7ff\ue000-\ufffd]
[^]
defines negated set of characters , character ranges (and more). fill allowed characters , ranges.
if flavor understands \x{}
, it's easy extend.
[^\x09\x0a\x0d\x20-\ud7ff\ue000-\ufffd\x{10000}-\x{10ffff}]
otherwise have search surrogate pairs characters character...
\x{10000}
same \ud800\udc00
\x{10ffff}
same \udbff\udfff
that not done in single set. no fun ;) it's negated version of
[\ud800-\udbff][\udc00-\udfff]| [\ud800-\udbff](?![\udc00-\udfff])| (?:[^\ud800-\udbff]|^)[\udc00-\udfff]
(from https://mathiasbynens.be/notes/javascript-unicode#matching-code-points)
Comments
Post a Comment