This week, I finally learned how to fix a regular expression problem that has long vexed me and stumped my Googling skills. The details are tricky to explain, but the simplest example is easy to describe. In many cases, a regular expression would get the last instance of a match in a string instead of the first instance. This has caused me all sorts of headaches when using preg_replace()
to touch up some HTML in Drupal.
Of course, it's rarely necessary (or even a good idea) to try parsing HTML with regular expressions, and QueryPath is often a better solution. That said, I ignored my own advice and decided to use a regular expression in an input filter anyway. Let's say I want to wrap the text of any header tag (h1
through h6
) in a span
tag, so that I can better style it with CSS. Here's a basic preg_replace()
call that you might expect would do the trick:
$html = "First Header
Some text. Second Header
";
$regex = "/()(.*)()/i";
$spanned = preg_replace($regex, '$1$2$3', $html);
- The regular expression looks for a header tag (h1 through h6):
(<h[1-6]>)
. This will become$1
in the replacement string. - Next, it grabs any text within the header tag: that's
(.*)
, which corresponds to$2
. - Third, it find the closing header tag (again, h1 through h6):
(<\/h[1-6]>)
, which corresponds to$3
. - And finally, I included the
i
at the end for case-insensitivity, in case the HTML contains<H2>
instead of<h2>
.
The replacement string is pretty simple: it just pieces the header back together. $1
is the opening tag, then an opening span, $2
is the text of the header tag, the closing span, then $3
is the closing tag.
Now, with all that in mind, you might expect the output to look like this:
<h2><span>First Header</span></h2> Some text. <h3><span>Second Header</span></h3>
But you would be wrong, just like I was wrong. What you would actually get is this:
<h2><span>First Header</h2> Some text. <h3>Second Header</span></h3>
The opening and closing span get split up across the string. Every time I needed to use a regular expression for something, this would happen, and I would curse under my breath a little bit.
The problem here is that the middle match, the (.*)
, is "greedy." It just keeps matching characters up until the last place that the third part, (<\/h[1-6]>)
, will match. Because, remember, that will match on </h2>
and </h3>
, and it's not smart enough to make sure that the number in the closing tag matches the number in the opening tag (if there's a way to do that, I haven't found it yet). So, the regular expression matches the first opening tag, and the last closing tag, and helpfully wraps everything in between in a span
tag. It sees our HTML string as containing only a single match.
The good news is that this is easy to fix. Like the i
I tacked on there for case insensitivity, I can also tack on a U
to make the regular expression "ungreedy," like so:
$html = "First Header
Some text. Second Header
";
$regex = "/()(.*)()/iU";
$spanned = preg_replace($regex, '$1$2$3', $html);
The only change here is the addition of the U at the end of the $regex
variable. With that in place, the regular expression will find two matches in the HTML, and I finally get what I wanted all along:
<h2><span>First Header</span></h2> Some text. <h3><span>Second Header</span></h3>
The U
modifier works for the entire regular expression, so it's good to use if you want your entire expression to be ungreedy. Just today, I learned from esteemed Lullabot James Sansbury that you can also be more specific about greediness by adding a ?
after a *
or +
to make that wildcard ungreedy. In our example, it would look like this:
$regex = "/()(.*?)()/i";
Placing the ?
after the .*
, I get the same result as I did when using the U
modifier at the end. In this case, I'm only using a single *
in my regular expression; if I had more than that one, I might want to use this method instead of the global modifier.
You can learn more about how i
, U
, and other regex modifiers work in the Pattern Modifiers documentation on php.net. There's also a handy tool called RegExr that will visualize the string as it's matched by a regular expression. Check out the original, greedy regex as compared to the revised, non-greedy alternative.