This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Section 7.6.1.1 Flags describes a flag "i" which places F&O pattern searches into case-insensitive mode. Patterns are mostly described in XML Schema, and include a capability to express a range in a bracket expression. Since XML Schema does not have case-insensitive matches, it does not define how a case-insensitive range works. This needs to be specified here. I can think of at least three possible definitions. 1. The first algorithm is to form the case-insensitive range of the first and second operands, then add in anything that is a case-insensitive version of something in this range. 2. In the second algorithm, let f be the lowercase version of the first operand, F be the uppercase equivalent of f, s be the lowercase version of the second operand, and S the uppercase equivalent of s. Let m be the minimum of f and F, and let M be the maximum of s and S. The range is m-M, everything between m and M inclusive. 3. The third algorithm is the case-sensitive range f-s union with the case-sentitive range F-S.
Excellent point. I think the rule that works best is to expand the range, e.g. [a-h] becomes [abcdefgh], and then match this with the "i" flag, applying the existing rule in the spec "a character in the input string matches a character specified by the pattern if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]." (Is this the same as your first suggestion?) As far as I can tell by experiment, this seems to be the way it works in Java (which is modelled on Perl). I'm having a bit more trouble divining the semantics for subtractions and negative groups: at present in Saxon matches('G','[A-Z-[f-h]]','i') and matches('G','[A-Z-[F-H]]','i') both return true, which is a little surprising, while matches('G','[A-Z-[F-Hf-h]]','i') returns false. And matches('G','[^G]','i') = false while matches('G','[^F-H]',i') = true I need to do a bit more investigation to see whether it's Java that's behaving this way, or whether its a consequence of the way I translate XPath regex to Java regex syntax (I use James Clark's code for this, modified to handle the XPath extensions to Schema regex syntax). If anyone can do some experiments with Perl, that would be useful... Michael Kay
A cheap temporary fix to some of these problems would be to say that it's an error to use the "i" flag with a regex that contains a negative character group, a character class subtraction, or a complemented category escape (such as "\P{Lu}"). That would keep our options open to get it right in the future.
Here are some observations from Java: (a) it appears that a character matches a range if any case-variant of the character matches the range: matches("D", "[A-Z]", "i") = true matches("d", "[A-Z]", "i") = true (b) this rule also works for subtractions: matches("D", "[A-Z-[D]]", "i") = true matches("d", "[A-Z-[D]]", "i") = true (c) the rule doesn't work for negative character groups. Here it appears that ^d removes both "d" and "D" from the group (whereas the rule above would suggest that it removes neither) matches("D", "[^d]", "i") = false matches("d", "[^d]", "i") = false (d) it appears that the "i" flag has no effect on character blocks. matches("D", "\p[Lu]", "i") = true; matches("d", "\p[Lu]", "i") = false; matches("D", "\P[Lu]", "i") = false; matches("d", "\P[Lu]", "i") = true; This is a terribly empirical way of approaching a specification!
I suspect empiricism here is telling us less about the language spec and more about how carefully the implementors thought all the weird cases through. It would be interesting to see if different JVMs are consistent here. I think we can go back to first principles a bit: we say "In case-insensitive mode, a character in the input string matches a character specified by the pattern if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]." In the case of a character range, I would take "a character specified by the pattern" to be every character in that character range, so if there is a default case mapping between the input string character and any of them, its a match. Likewise for negative character ranges and so on. That is, you don't mess with the pattern, you check the input string with case folding against the pattern as written. So I think (* = different from Java reported results): matches("D", "[A-Z]", "i") = true matches("d", "[A-Z]", "i") = true * matches("D", "[A-Z-[D]]", "i") = false * matches("d", "[A-Z-[D]]", "i") = false matches("D", "[^d]", "i") = false matches("d", "[^d]", "i") = false matches("D", "\p{Lu}", "i") = true * matches("d", "\p{Lu}", "i") = true matches("D", "\P{Lu}", "i") = false * matches("d", "\P{Lu}", "i") = false
The examples from Mike Kay's comment, matches('G','[A-Z-[f-h]]','i') and matches('G','[A-Z-[F-H]]','i') are not well-formed in Perl: the operands of "-" must be a character, not a range. Perl does not support range subtraction directly (see below)... So, [A-Z-[f-h]] ends up matching the literal [f-h] and nothing else as far as I can tell. the example matches('G','[A-Z-[F-Hf-h]]','i') is the same, matching the literal string [F-Hf-h] (I don't think it's specified that it works this way, so it's a bug that Perl doesn't trap this case I think) The example matches('G','[^F-H]','i') does not match in Perl, neither with nor without the /i Note that the pattern [A-Z] might or might not match both a and z: a common collation order on Linux at least for case insensitive matching is aAbBcCdD...zZ, so A-Z excludes the "a". This doesn't affect Perl by default, as it uses unicode codepoints unless you put use locale; in your Perl script (see man pages for perlre and perllocale, or run "perldoc perlre" to see them...) "G" does not match /[^G]/i in Perl Perl's nearest equivalent for range subtraction is the zero-width negative lookahead assertion, (?!e), which matches only if it is not immediately followed by something that matches the contained expression e. Hence, /(?![f-h])[A-Z]/i matches b and w but not g or G. I think the real question here is whether a range can introduce or exclude unexpected characters when case insensitive. I experimented, but the version of Perl I'm using doesn't like ranges in character classes if they are above codepoint 127 decimal for some reason, although it's otherwise 8-bit clean, and can match explicit characters in classes.
Mary, I'm having trouble understanding exactly what you mean by: Likewise for negative character ranges and so on. That is, you don't mess with the pattern, you check the input string with case folding against the pattern as written. I was originally going to propose a spec which might be what you're suggesting: Under the "i" flag, a string S matches a regex R if there is some case-variant S' of S such that S' matches R in the absence of the "i" flag. A string S' is a case-variant of S if the two strings are the same length and there is a default case mapping between each pair of corresponding characters in the two strings, as defined in section 3.13 of [The Unicode Standard]. This rule seems nice and simple, but it doesn't appear to be the same as Java or Perl, and one must ask whether it is (a) usable, and (b) implementable. It certainly has some surprises, for example "D" matches "[^D]" (because "d" matches "[^D]". I think I will go back to proposing that the tricky cases should be errors. The rule I propose is: when the "i" flag is used, the regex must not include any of the following: * a negative character group * a character class subtraction * a category escape (catEsc, complEsc, or charProp) * any of the multi-character escapes \c, \i, \C, \I * a back-reference If any of these is present when the "i" flag is used, error FORGNNNN is raised. The semantics of the "i" flag is then: A string S matches the regex R under the "i" flag if there exists a string S' that is a case-variant of S such that S' matches R in the absence of the "i" flag; with "case-variant" defined as above. In cases where it is necessary to know which characters matched (for example when $n appears in the replacement string of fn:replace()), the characters that matched are those from the original string S, not from S'. The definition of fn:replace() contains the rule: "If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative". I think it would be prudent to relax this rule so that when the "i" flag is used, it is implementation-dependent which match is chosen. That is, if the input string is "a" and the regex is "A|a", it's undefined whether the "A" or the "a" is matched. Michael Kay
Michael, I think I meant what your "I was going to propose..." text says. (As usual, of course, you said it better.) While the apparent difference with Perl and Java might be troubling, I think what we are discovering here is that they are both woefully underspecified. I do have at least one data point that the semantics you outline is implementable in XQuery regular expressions, FWIW. I am not happy with making these be an error, because we see plenty of scenarios where regular expressions are not literals, but constructed, and I think the semantics you suggest makes sense, even if it leads to results that may seem surprising until you think about it. I also think that you set of cases is too broad: By your rules patterns with such innocuous items as "[^\s] or or "\p{Zs}" cause errors in case-insensitive mode. I also don't see why patterns with back references should be errors. It gets to the point where you either have some pretty complex rules about when these constructs are "non-confusing" from a case-sensitivity point of view, and therefore OK, or you are limiting regular expressions in case-insensitive mode almost to the point of uselessness, where the workaround is, with a fair amount of pain, to reconstruct essentially the semantics proposed.
(This is a short proposal, but it's the result of a lot of work - the waste bin is full of my failed attempts. It's packed with meaning and needs to be read very carefully, with a close eye on the syntax in Schema Part 2.) PROPOSAL The detailed rules for the effect of the "i" flag are as follows. In these rules, one character is considered to be a *case-variant* of another character if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]. Note that the case-variants of a character under this definition are always single characters. 1. When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" expands to "[zZ]". 2. A character range (charRange) represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, "[A-Z]" expands to "[A-Za-z]". This rule applies also to a character range used in a character class subtraction (charClassSub): thus [A-Z-[IO]] expands to [A-Za-z-[IOio]]. It also applies to a character range used as part of a negative character group: thus [^Q] expands to [^Qq]. 3. A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used. 4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}" continues to match upper-case letters only. Michael Kay
First, I'd like to thank Michael for this proposal. It is certainly clear, and while there are behaviours that are perhaps unexpected, I think that is inevitable in this area. Acknowledging Michael's comments about the overflowing trashbin (and contributing a few crumpled sheets there myself), I nevertheless find myself unhappy with talking about "expanding" the regular expression and would prefer to shift to speaking about case-folding as applying to how the input string is matched. From an implementation point of view, expanding regular expressions has to be done on a case-by-case basis (no pun intended!). While it doesn't make it impossible to cache regular expressions (i.e. pre-analyze and parse them), it does make it trickier and less useful to do so, as the regular expression itself is no longer a sufficient key to what the analyzed regular expression is. A consequence of this shift would be that case-folding would apply uniformly, so that, for example: fn:matches( "d", "\p{Lu}", "i" ) = fn:matches( "d", "[A-Z]", "i" ) which is not the case under Michael's proposal. I would go on to argue that it would be good if both of these were true. One reason for making this so is that Datatypes says that "\P{Lu}" == [^\p{Lu}] and therefore you get some odd inconsistencies if you don't apply the case-folding to the category escapes as well. All of which sums up to putting an obligation on my to come up with a counter-proposal. My general tack on this is to tweak two statements in XML Schema Datatypes that define what set of strings a character denotes and what set of strings a character class denotes. But I think Michael's case by case exposition is most excellent and clear, and so I continue with that, tweaking the verbiage to avoid the "expands" phrasing, treating it as clarification because those two rules are sufficient, and adding the additional cases that Michael's proposal doesn't touch. COUNTER-PROPOSAL: The detailed rules for the effect of the "i" flag are as follows. In these rules, one character is considered to be a *case-variant* of another character if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]. Note that the case-variants of a character under this definition are always single characters. The rules for regular expressions in [XML Schema Part 2: Datatypes Second Edition] are modified under the influence of the "i" flag in the following way: 1. A normal character c denotes a set of strings that contains one single-character string "x" for each character x that is either c or a case-variant of c. 2. A character class C denotes a set of strings that contains one single-character string "x" for each character x that is either in the class or is a case-variant of some character in the class. Specifically, the application of these rules means: * When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" matches the same set of characters as "[zZ]". * A character range (charRange) is a character class, and therefore represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, "[A-Z]" matches the same set of characters as "[A-Za-z]". * A character range used in character class subtraction (charClassSub) also represents the set containing all the characters that it would match in the absense of the "i" flag, together with their case-variants. For example, "[A-Z-[IO]]" matches the same set of characters as "[A-Za-z-[IOio]]". * A negative character group (negCharGroup) is also a character class and the same rule applies. For example, "[^Q]" matches the same set of characters as "[^Qq]". * A category escape (catEsc) is also a character class and the same rule applies. For example, "\p{Lu}" matches all the upper case letters and their case-variants, and thus the string "d" would match "\p{Lu}". * A complement category escape (complEsc) is also a character class and the same rule applies. For example, "\P{Lu}" matches all letters that are neither upper case nor one of those character's case variants. Therefore "d" would not match "\P{Lu}". * The same rule applies to single-character (SingleCharEsc) and multi-character (MultiCharEsc) escapes, although in practice this will have no effect. * A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used.
Use of the word "expand" was perhaps a bit careless. I only used it in examples, and by saying "A expands to B" I was merely trying to find a shorter way of saying "A with the i flag set matches the same set of strings as B without the i flag set". It wasn't intended to describe an algorithm, let alone an implementation (though I probably had one at the back of my mind). I appreciate what you're trying to achieve, which I think I can paraphrase as "if matches(S, P, "") is true, then matches(V(S), P, "i") is true if and only if V(S) is a case-variant of S." However, I don't think your proposal achieves this, and in fact I don't think it's a good idea anyway. I think there are some problems with your proposal. It's not true that a character range (charRange) is a character class (charClass), and it's not true that a negative character group is a character class. It is true that "[^Q]" is a charClass, but if we accept your rule 2, then I think the consequence is that [^Q] matches every character: in the absence of the "i" flag it matches "q", therefore in the presence of the "i" flag it also matches "Q". I think the meaning [^qQ] is more intuitive, and that's why I decided to move the rule down to the level of a charRange. It would be possible to define that a charClassEsc (such as \p{Lu}) matches case-variants of its "normal" set of strings. The reason I didn't do this was again to do with complements and subtraction. If you widen \p{Lu} to include case-variants of its usual characters, do you retain the meaning that \P{Lu} is the complement of \p{Lu} (in which case it matches a smaller set of characters than it did before), or do you retain the meaning that it matches all the characters it would normally match plus their case-variants (a larger set than before)? I felt it was best to cop out here and say its meaning is unchanged. In practice, I don't think this is a big problem, because most of the character blocks already include case-variants of characters, and those that don't, like Lu and Ll, exclude them very deliberately. Michael Kay
If we rephrase "expands" I'm happier with your proposal, even if we touch nothing else, although I'd still prefer to state some general rule rather than take it by cases, but I could live without doing so. > I think there are some problems with your proposal. It's not true that a > character range (charRange) is a character class (charClass), and it's not true > that a negative character group is a character class. Uh, yes it is. It do say in XML Schema part 2: [11] charClass ::= charClassEsc | charClassExpr | WildcardEsc [12] charClassExpr ::= '[' charGroup ']' [13] charGroup ::= posCharGroup | negCharGroup | charClassSub [23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc ) I can fill in the posCharGroup and negCharGroup and so on, but I think you get the idea. Everything is a charClass. I see your point with \p{Lu} and \P{Lu}; let's think about that a bit out loud to see where we get: Let just say for abbreviation's sake that normally \p{Lu} denotes the set {"A","B"}. \P{Lu} = [^\p{Lu}] so sayeth Datatypes, so this includes a set of lots and lots of single-character strings, including "a" and "b". If instead of using the handy abbreviation \p{Lu} we had spelled it out: [AB], denoting the set {"A","B"} and the complement would be [^AB], denoting a set containing lots and lots of single-character strings, including "a" and "b", so this is all consistent. Under the rules of the "i" flag, if we say \p{Lu} means what it means with other character classes, it denotes the set {"A", "B", "a", "b"}. Following the equation from Datatypes we get that \P{Lu} denotes a set with lots and lots of characters but not "a" or "b". If we had written out \p{Lu} as [AB] that would also have denoted the set {"A","B","a","b"} and the complement [^AB] would have also denoted the set with lots and lots of characters but not "a" or "b". So again, this is entirely consistent. Suppose, however, that under the rules of the "i" flag, we leave \p{Lu} and \P{Lu} alone. The \p{Lu} denotes the set {"A","B"}, and \P{Lu} denotes the set with lots and lots of single character strings including "a" and "b". If, not knowing this handy abbreviation, I had written out \p{Lu} as [AB], I will denote a different set under the "i" flag: {"A","B","a","b"}. Likewise [^AB] will denote a set that does not include "a" and "b". I find this inconsistency pretty baffling to explain, and having to special case here makes implementation harder. So I think we should apply the rule consistently across all character classes.
Both of the recent proposals have had the example For example, "[A-Z]" expands to "[A-Za-z]". But I think that they would (both) imply [A-Za-zſK] If my understanding of the proposals (and http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) is correct. Both of these are listed as Common case mappings 017F; C; 0073; # LATIN SMALL LETTER LONG S 212A; C; 006B; # KELVIN SIGN Actually I'm fairly sure that the proposals imply that [a-z] expands to [A-Za-zſK] (as toLowercase() maps KELVIN SIGN to k) However in the case of the actual example [A-Z] it depends on the intended meaning of: one character is considered to be a *case-variant* of another character if there is a default case mapping between the two characters as defined in section 3.13 of [The Unicode Standard]. There is no case mapping of KELVIN sign into the range A-Z, only into the range a-z. However it would be pretty strange if [a-z] and [A-Z] did not denote the same set if i is set, so perhaps a "case variant" needs to be defined such that two characters are case variants if there are default unicode case mappings that map the characters to the same character, so K and KELVIN SIGN would be case variants as they both lower case to k.
Response to Mary: I said: * it's not true that a negative character group is a character class. You said: Uh, yes it is. It do say in XML Schema part 2: [11] charClass ::= charClassEsc | charClassExpr | WildcardEsc [12] charClassExpr ::= '[' charGroup ']' [13] charGroup ::= posCharGroup | negCharGroup | charClassSub [23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc ) I can fill in the posCharGroup and negCharGroup and so on, but I think you get the idea. Everything is a charClass. I say: oh no it isn't! A negative character group is a charGroup, and a charGroup *enclosed in square brackets* is a charClass. But a negative character group on its own, without the square brackets, is not a charClass. As regards \P{Lu}, you can maintain either one of two invariants (a) \P(Lu) == [^\p{Lu}] (b) if matches("X", P, "") then matches("x", P, "i") for any regex P but you can't maintain both. I think your logic is flawed here: "If we had written out \p{Lu} as [AB] that would also have denoted the set {"A","B","a","b"} and the complement [^AB] would have also denoted the set with lots and lots of characters but not "a" or "b". So again, this is entirely consistent." You're relying here on [^AB] meaning [^ABab]. But under your proposal that's not what it means. Under your proposal [^AB] matches every character. [^AB] is a charClass, therefore rule 2 applies, which says A character class C denotes a set of strings that contains one single-character string "x" for each character x that is either in the class or is a case-variant of some character in the class. If I'm reading that correctly (perhaps I'm not?) you're saying "a" is in the class [^AB], therefore "A" is also in the class [^AB]. In my proposal I'm breaking invariant (b): I'm saying that [^AB] is a *smaller* set of characters under the "i" flag than in the absence of the "i" flag. I think that's the right thing to do. Having already broken that invariant, I'm then retaining invariant (a) with my proposed treatment of charClassEsc. Michael Kay
David: Thanks for that comment, which is somewhat orthogonal to the rest of the thread. I did have slight worries that the definition based on Unicode default case mappings might be a little problematic in cases where it's non-symmetric (or non-transitive, etc). Let's get the other stuff sorted and come back to that.
Michael: OK, I see now where my logic is flawed, thank you. That makes sense.
Let's now address David's concern about how we define case-variants. I suggest that rather than appealing directly to Unicode, we instead define it in terms of our own lower-case() and upper-case() functions (which are themselves defined in terms of Unicode). This seems to give a better chance of getting them consistent. The rule that seems to work is: For characters C1 and C2, considered as strings of length one, C1 is a case-variant of C2 if (fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2)) when compared using the Unicode codepoint collation. Under this rule, x212A (Kelvin sign) is a case-variant of "k" and also of "K". So this leads to the revised proposal as follows: PROPOSAL v2 The detailed rules for the effect of the "i" flag are as follows. In these rules, one character C2 is considered to be a *case-variant* of another character C1 if the following XPath expression returns true, when the two characters are considered as strings of length one, and the Unicode codepoint collation is used: fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2) Note that the case-variants of a character under this definition are always single characters. 1. When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" will match both "z" and "Z". 2. A character range (charRange) represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, the regular expression "[A-Z]" will match all the letters A-Z and all the letters a-z. It will also match certain other characters such as x212A (KELVIN SIGN), since fn:lower-case("K") is "k". This rule applies also to a character range used in a character class subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A", "B", "a", and "b", but will not match "I", "O", "i", or "o". The rule also applies to a character range used as part of a negative character group: thus [^Q] will match every character except "Q" and "q" (these being the only case-variants of "Q" in Unicode). 3. A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used. 4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}" continues to match upper-case letters only. Michael Kay
> So this leads to the revised proposal as follows: This works for me. Only comment is that every character is a case variant of itself, so your rules 1 and 3 can be compressed to 1. When a normal character (Char) is used as an atom, it represents the set of case-variants of that character. For example, the regular expression "z" expands to "[zZ]". 3. A back-reference is compared using case-blind comparison: that is, each character must be a case-variant of the corresponding character of the previously matched string. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used. I started to write this comment thinking that the re-write would make things clearer, highlighting that the characters are treated uniformly and there aren't really two cases here. However having done it perhaps it relies too much on the definition and the bit of extra redundancy in your wording is clearer, leave it to the editors to judge...
Further information on Perl -- case insensitivity (both in ranges and elsewhere) only affects a-z and A-Z, and not, for example, e-acute. This clearly wouldn't work for us! Liam
The WGs decided on 9/27 to accept Michael Kay's proposal in comment #16. See below. The detailed rules for the effect of the "i" flag are as follows. In these rules, one character C2 is considered to be a *case-variant* of another character C1 if the following XPath expression returns true, when the two characters are considered as strings of length one, and the Unicode codepoint collation is used: fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2) Note that the case-variants of a character under this definition are always single characters. 1. When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" will match both "z" and "Z". 2. A character range (charRange) represents the set containing all the characters that it would match in the absence of the "i" flag, together with their case-variants. For example, the regular expression "[A-Z]" will match all the letters A-Z and all the letters a-z. It will also match certain other characters such as x212A (KELVIN SIGN), since fn:lower-case("K") is "k". This rule applies also to a character range used in a character class subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A", "B", "a", and "b", but will not match "I", "O", "i", or "o". The rule also applies to a character range used as part of a negative character group: thus [^Q] will match every character except "Q" and "q" (these being the only case-variants of "Q" in Unicode). 3. A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the "i" flag is used. 4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}" continues to match upper-case letters only.