regex - Regular expression for invalid characters in XML -


i trying figure out way can find invalid characters in xml. according w3 recommendation these valid characters in xml:

#x9 | #xa | #xd | [#x20-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff] 

converting decimal:

9 10 13 32-55295 57344-65533 65536-1114111 

are valid xml characters.

i trying search in notepad++ using appropriate regular expression invalid characters.

a snippet xml:

        <custom-attribute attribute-id="iscontendfeed">fal &#11; se</custom-attribute>         <custom-attribute attribute-id="pagenofollow">fal &#3; se</custom-attribute>         <custom-attribute attribute-id="pagenoindex">fal &#13; se</custom-attribute>         <custom-attribute attribute-id="rrrecommendable">false</custom-attribute> 

from above example want regular expression finds &#11; , &#3; me because these not allowed in xml.

i not able construct regular expression this.

the regular expression made numeric ranges:

32-55295 : (3[2-9]|[4-9][0-9]|[1-9][0-9]{2,3}|[1-4][0-9]{4}|5[0-4][0-9]{3}|55[01][0-9]{2}|552[0-8][0-9]|5529[0-5]) 57344-65533 : (5734[4-9]|573[5-9][0-9]|57[4-9][0-9]{2}|5[89][0-9]{3}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-3]) 65536-1114111 : (6(5(5(3[6-9]|[4-9][0-9])|[6-9][0-9]{2})|[6-9][0-9]{3})|[7-9][0-9]{4}|[1-9][0-9]{5}|1(0[0-9]{5}|1(0[0-9]{4}|1([0-3][0-9]{3}|4(0[0-9]{2}|1(0[0-9]|1[01]))))))) 

these regular expression working if used separately not able make complete regex.

is there other way other regular expression can find invalid characters? if not, please me in constructing regular expression can find invalid characters present in xml.

first, literal text &#3; allowed in xml - not allowed (if list correct) character ascii-code 3. hope got right.

second. regular expression flavors allow search characters can defined \x00 (two hex digits) , \u0000 (4 hex digits). flavors allow \x{...} - differs flavor flavor...

we start

[^\x09\x0a\x0d\x20-\ud7ff\ue000-\ufffd]

[^] defines negated set of characters , character ranges (and more). fill allowed characters , ranges.

if flavor understands \x{}, it's easy extend.

[^\x09\x0a\x0d\x20-\ud7ff\ue000-\ufffd\x{10000}-\x{10ffff}] 

otherwise have search surrogate pairs characters character...

\x{10000} same \ud800\udc00

\x{10ffff} same \udbff\udfff

that not done in single set. no fun ;) it's negated version of

[\ud800-\udbff][\udc00-\udfff]| [\ud800-\udbff](?![\udc00-\udfff])| (?:[^\ud800-\udbff]|^)[\udc00-\udfff] 

(from https://mathiasbynens.be/notes/javascript-unicode#matching-code-points)


Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -