SPL Deepdive: RegexIterator
Tagged with: [ regexiterator ] [ spl ]
If everything goes according to plan (which never is the case), I’ll try and highlight some of the fascinating stuff that can be found inside the SPL. I do a lot of presentations about the SPL, and one of the things I like to tell people is that even though the SPL, - iterators particularly - is a magnificent piece of code that is often underused and misunderstood, it does come with some quirks and glitches that aren’t documented properly.
Today, i’ll explain a bit in-depth the RegexIterator
. This iterator extends the FilterIterator
, meaning it can be
used to filter out unwanted entries from parent iterators.
A simple use-case would be to filter on certain names that are taken from a directoryIterator
. This iterator is very
simple in usage and pretty obvious for most people:
This iterator will now filter out all file names that do NOT start with “foo”.
How does it work:
First of all, the DirectoryIterator
returns by default SplFileInfo
objects, not file names. The RegexIterator::accept()
method, the method that does the filtering, will cast anything that is not a string into a string, since that’s
something we can apply our regular expression on. From there, it will do call the pcre_exec()
function, and either
return a boolean true
or false
depending on whether or not there are matches found. When a false is returned, the
regexIterator
will not pass this element to the foreach, but continue with the next value.
RegexIterator modes
IF you look at the php.net documentation for the regexIterator constructor, you’ll find that the iterator has 3
additional arguments that can be passed during initalization: $mode
, $flags
and $preg_flags
.
The mode can be one of the following modes that are defined as constants inside the regexIterator
: ALL_MATCHES
,
GET_MATCH
, MATCH
, REPLACE
and SPLIT
.
You can change this mode after you have constructed the iterator. You can use the setMode($new_mode)
on the iterator to
change this mode on the fly. It’s even possible to change this mode inside a foreach()
iteration if you like (even
though i can’t find any reason why you would like to do this).
The default regexIterator
mode is MATCH
. Meaning it will just do a check to see if there was something that actually
matched the regex. It doesn’t do anything with any results, it will just return true when it did, and false otherwise.
The GET_MATCH
mode behaves a bit differently. It will not only check to see if the regex matches on the current element,
but it will return also information about what capture groups.
Take for instance the following code:
It does not return directly the filtered elements, but an array with the first element the complete element that was matched, and optionally one or more capture groups (sometimes called sub patterns), which can be added inside your regular expression through ().
However, GET_MATCH
will only match once inside each element. If there are multiple matches available, you won’t find
them. For this, you can use ALL_MATCHES
:
There is a catch though: as you can see, empty elements or elements that do not match, will not get filtered by the
regexIterator
but they will show up as empty arrays. This is most likely a bug (as filed as bug #66703).
The SPLIT
mode, will split your elements through the given regular expression, just like preg_split()
does:
The SPLIT
mode does filter correctly. It will return an array with the split values, but there is no way of getting
the original value (like you have with GET_MATCH
or ALL_MATCHES
).
The last mode is REPLACE
, which allows you to replace values through regular expressions.
Ok, so there isn’t much replacement going on here. It seems that it just checks if there are matches, and if so, remove
those matches and return the result. This is because the default replacement string that is used for REPLACE
is actually
empty. You can change it manually, but this is implemented in a bit of hack’ish way with a public property on the
RegexIterator
that you can set:
I wasn’t kidding about the hack’ish way. The documentation suggests that the REPLACE
mode is actually still under
construction and not implemented fully. It does however, support capture groups and placeholders so something like this
is perfectly valid (and seems to work without problems):
Key or value?
Great! So I can filter out values through the RegexIterator
, as it will check the current values taken from the parent
iterator. But what if I want to filter out through the parent iterator keys instead? This is possible too: just use the
RegexIterator::USE_KEY
as $flags in the RegexIterator::__construct()
.
preg_flags
Besides modes and flags, there is a $preg_flags
argument inside the constructor (also available through setPregFlags()
).
The value of these flags depend on the actual mode that you are using. For instance, the PREG_PATTERN_ORDER
makes sense
when using ALL_MATCHES
, but not really when using the default MATCH
. Obviously, PREG_SPLIT_*
flags only make sense when
using the SPLIT
mode. See the documentation of pcre to find the flags and what they are actually doing.
If you are looking for some more information about the SPL, or any of the iterators, why not try my book? It’s available through amazon or through php|architect:http://www.phparch.com/books/mastering-the-spl-library/ and contains a full overview on the SPL and the iterators. |