Regex to match all words except a given list


Question

I am trying to write a replacement regular expression to surround all words in quotes except the words AND, OR and NOT.

I have tried the following for the match part of the expression:

(?i)(?<word>[a-z0-9]+)(?<!and|not|or)

and

(?i)(?<word>[a-z0-9]+)(?!and|not|or)

but neither work. The replacement expression is simple and currently surrounds all words.

"${word}"

So

This and This not That

becomes

"This" and "This" not "That"

2
18
11/2/2008 3:04:29 AM

Accepted Answer

This is a little dirty, but it works:

(?<!\b(?:and| or|not))\b(?!(?:and|or|not)\b)

In plain English, this matches any word boundary not preceded by and not followed by "and", "or", or "not". It matches whole words only, e.g. the position after the word "sand" would not be a match just because it is preceded by "and".

The space in front of the "or" in the zero-width look-behind assertion is necessary to make it a fixed length look-behind. Try if that already solves your problem.

EDIT: Applied to the string "except the words AND, OR and NOT." as a global replace with single quotes, this returns:

'except' 'the' 'words' AND, OR and NOT.
14
10/28/2008 12:13:43 PM

Exper Answer

Call me crazy, but I'm not a fan of fighting regex; I limit my patterns to simple things I can understand, and often cheat for the rest - for example via a MatchEvaluator:

    string[] whitelist = new string[] { "and", "not", "or" };
    string input = "foo and bar or blop";
    string result = Regex.Replace(input, @"([a-z0-9]+)",
        delegate(Match match) {
            string word = match.Groups[1].Value;
            return Array.IndexOf(whitelist, word) >= 0
                ? word : ("\"" + word + "\"");
        });

(edited for more terse layout)

3
10/28/2008 10:11:59 AM

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon