Regular Expressions: How I Hate|Love Thee

August 7, 2008 on 8:33 am | In Tech | 1 Comment

I’ve spent a lot of time wrestling with regex for PHP (functions like preg_match, preg_replace, preg_match_all) in the last week. Regular expressions are a very powerful and compact way of searching through strings for certain patterns. They are also the most frustrating thing I’ve ever learned in programming. There are a number of tutorials and reference guides online, but while good for telling you the basic parts, don’t mention how the parts come together. I even looked in some PHP books and was disappointed. I found these two tutorials to be pretty good, but both left out two crucial trouble-shooting tips.

So, here’s my solutions to two major problems I had while learning this stuff:

1. Your expression matches and returns too much.

Say your expression is “/<b>(.*)</b>/” (meaning, capture everything between the <b> tags) and data is “<b>hey</b><b>blah</b>”. Simple as can be, except it captures everything all the way to very last </b>, so returns “hey</b><b>blah”. That’s because in PHP, regex is set to be “greedy” by default. That is, keep going till the very last possible match. The way to fix it is to add a ? in your regex so it now reads: “/<b>(.*?)</b>/”. The ? tells the preceding character to be “lazy” (not greedy), and will work for * and +.

2. Your expression doesn’t match or return anything, even though your expression is ridiculously simple.

I had a lot of trouble with this. You have to remember that the . matches any one character except newline characters. So, if you were using this pattern:

“/<b>(.*?)</b>/”

with this data:

“<b>

hey

</b>

<b>

blah

</b>”

and got no results, it’s because the . doesn’t work across newlines, and thus doesn’t capture anything. Annoying, however there’s a few ways to fix it. I choose to add a modifier to the end of the pattern, so that the regex engine will treat the whole thing as a single line (the output will maintain the newlines, just for regex, it’ll be one line). So, the fixed pattern looks like:

“/<b>(.*?)</b>/s”

Hope this helps. :)

I’ve been sorta MIA for a while. Like I’ve mentioned before, I left my last job and have been working on some projects and picking up some one-off paying gigs. For example, last week, I transcribed around 25,000 words in about 9 hours. That’s a lot of words to write. As you may have noticed I am no longer posting with my regularity here, and it will stay that way for the forseeable future. I’ll get back to regular posting at some point, probably, but perhaps at a new site or something. Cheers!

I really like comments, so please take a few seconds to leave one. If you enjoyed this post, make sure you subscribe to my RSS feed!

del.icio.us:Regular Expressions: How I Hate|Love Thee digg:Regular Expressions: How I Hate|Love Thee reddit:Regular Expressions: How I Hate|Love Thee fark:Regular Expressions: How I Hate|Love Thee Y!:Regular Expressions: How I Hate|Love Thee
Related Posts:
Regular Expressions Quotation
Photoreading Update
BUDGETS!
Today’s Shopping Adventure

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Internet marketing Halifax and Cheap Web Hosting by Web Savers.
Valid XHTML and CSS. ^Top^