Aug 072008
 

I’ve spent a lot of time wrestling with regex for PHP (functions like preg_match, preg_replace, preg_match_all) in the last week. Regular expressions are a very powerful and compact way of searching through strings for certain patterns. They are also the most frustrating thing I’ve ever learned in programming. There are a number of tutorials and reference guides online, but while good for telling you the basic parts, don’t mention how the parts come together. I even looked in some PHP books and was disappointed. I found these two tutorials to be pretty good, but both left out two crucial trouble-shooting tips.

So, here’s my solutions to two major problems I had while learning this stuff:

1. Your expression matches and returns too much.

Say your expression is “/<b>(.*)</b>/” (meaning, capture everything between the <b> tags) and data is “<b>hey</b><b>blah</b>”. Simple as can be, except it captures everything all the way to very last </b>, so returns “hey</b><b>blah”. That’s because in PHP, regex is set to be “greedy” by default. That is, keep going till the very last possible match. The way to fix it is to add a ? in your regex so it now reads: “/<b>(.*?)</b>/”. The ? tells the preceding character to be “lazy” (not greedy), and will work for * and +.

2. Your expression doesn’t match or return anything, even though your expression is ridiculously simple.

I had a lot of trouble with this. You have to remember that the . matches any one character except newline characters. So, if you were using this pattern:

“/<b>(.*?)</b>/”

with this data:

“<b>

hey

</b>

<b>

blah

</b>”

and got no results, it’s because the . doesn’t work across newlines, and thus doesn’t capture anything. Annoying, however there’s a few ways to fix it. I choose to add a modifier to the end of the pattern, so that the regex engine will treat the whole thing as a single line (the output will maintain the newlines, just for regex, it’ll be one line). So, the fixed pattern looks like:

“/<b>(.*?)</b>/s”

Hope this helps. 🙂

I’ve been sorta MIA for a while. Like I’ve mentioned before, I left my last job and have been working on some projects and picking up some one-off paying gigs. For example, last week, I transcribed around 25,000 words in about 9 hours. That’s a lot of words to write. As you may have noticed I am no longer posting with my regularity here, and it will stay that way for the forseeable future. I’ll get back to regular posting at some point, probably, but perhaps at a new site or something. Cheers!

  One Response to “Regular Expressions: How I Hate|Love Thee”

  1. […] I continually find myself continually frustrated with regex as well as continually in a state of wonder at its power, I found the above […]

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)