SPL Deepdive: RegexIterator

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 12 Feb 2014
Tagged with: [ regexiterator ]  [ spl

If everything goes according to plan (which never is the case), I’ll try and highlight some of the fascinating stuff that can be found inside the SPL. I do a lot of presentations about the SPL, and one of the things I like to tell people is that even though the SPL, - iterators particularly - is a magnificent piece of code that is often underused and misunderstood, it does come with some quirks and glitches that aren’t documented properly.

Today, i’ll explain a bit in-depth the RegexIterator. This iterator extends the FilterIterator, meaning it can be used to filter out unwanted entries from parent iterators.

A simple use-case would be to filter on certain names that are taken from a directoryIterator. This iterator is very simple in usage and pretty obvious for most people:

$it = new \DirectoryIterator(".");
$it = new \RegexIterator($it, "/^foo/");

This iterator will now filter out all file names that do NOT start with “foo”.

How does it work:

First of all, the DirectoryIterator returns by default SplFileInfo objects, not file names. The RegexIterator::accept() method, the method that does the filtering, will cast anything that is not a string into a string, since that’s something we can apply our regular expression on. From there, it will do call the pcre_exec() function, and either return a boolean true or false depending on whether or not there are matches found. When a false is returned, the regexIterator will not pass this element to the foreach, but continue with the next value.

RegexIterator modes

IF you look at the php.net documentation for the regexIterator constructor, you’ll find that the iterator has 3 additional arguments that can be passed during initalization: $mode, $flags and $preg_flags.

The mode can be one of the following modes that are defined as constants inside the regexIteratorALL_MATCHES, GET_MATCH, MATCH, REPLACE and SPLIT.

You can change this mode after you have constructed the iterator. You can use the setMode($new_mode) on the iterator to change this mode on the fly. It’s even possible to change this mode inside a foreach() iteration if you like (even though i can’t find any reason why you would like to do this).

The default regexIterator mode is MATCH. Meaning it will just do a check to see if there was something that actually matched the regex. It doesn’t do anything with any results, it will just return true when it did, and false otherwise.

The GET_MATCH mode behaves a bit differently. It will not only check to see if the regex matches on the current element, but it will return also information about what capture groups.

Take for instance the following code:

$it = new \ArrayIterator(array("foo", "bar", "bazbar"));
$it = new \RegexIterator($it, "/^ba(.)/", \RegexIterator::GET_MATCH);
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
    [1] => Array
        (
            [0] => bar
            [1] => r
        )

    [2] => Array
        (
            [0] => baz
            [1] => z
        )
  )
*/

It does not return directly the filtered elements, but an array with the first element the complete element that was matched, and optionally one or more capture groups (sometimes called sub patterns), which can be added inside your regular expression through ().

However, GET_MATCH will only match once inside each element. If there are multiple matches available, you won’t find them. For this, you can use ALL_MATCHES:

$it = new \ArrayIterator(array("tmp", "foo", "bar", "bazbar"));
$it = new \RegexIterator($it, "/ba(.)/", \RegexIterator::ALL_MATCHES);
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
      [0] => Array
          (
              [0] => Array
                  (
                  )
              [1] => Array
                  (
                  )
          )
      [1] => Array
          (
              [0] => Array
                  (
                  )
              [1] => Array
                  (
                  )
          )
      [2] => Array
          (
              [0] => Array
                  (
                      [0] => bar
                  )
              [1] => Array
                  (
                      [0] => r
                  )
          )
      [3] => Array
          (
              [0] => Array
                  (
                      [0] => baz
                      [1] => bar
                  )
  
              [1] => Array
                  (
                      [0] => z
                      [1] => r
                  )
          )
  )*/

There is a catch though: as you can see, empty elements or elements that do not match, will not get filtered by the regexIterator but they will show up as empty arrays. This is most likely a bug (as filed as bug #66703).

The SPLIT mode, will split your elements through the given regular expression, just like preg_split() does:

$it = new \ArrayIterator(array("tmp", "foo", "bar", "bazbar"));
$it = new \RegexIterator($it, "/a/", \RegexIterator::SPLIT);
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
      [2] => Array
          (
              [0] => b
              [1] => r
          )
      [3] => Array
          (
              [0] => b
              [1] => zb
              [2] => r
          )
  )
*/

The SPLIT mode does filter correctly.  It will return an array with the split values, but there is no way of getting the original value (like you have with GET_MATCH or ALL_MATCHES).

The last mode is REPLACE, which allows you to replace values through regular expressions.

$it = new \ArrayIterator(array("tmp", "foo", "bar", "bazbar"));
$it = new \RegexIterator($it, "/a/", \RegexIterator::REPLACE);
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
      [2] => br
      [3] => bzbr
  )
*/

Ok, so there isn’t much replacement going on here. It seems that it just checks if there are matches, and if so, remove those matches and return the result. This is because the default replacement string that is used for REPLACE is actually empty. You can change it manually, but this is implemented in a bit of hack’ish way with a public property on the RegexIterator that you can set:

$it = new \ArrayIterator(array("tmp", "foo", "bar", "bazbar"));
$it = new \RegexIterator($it, "/a/", \RegexIterator::REPLACE);
$it->replacement = "!";
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
      [2] => b!r
      [3] => b!zb!r
  )
*/

I wasn’t kidding about the hack’ish way. The documentation suggests that the REPLACE mode is actually still under construction and not implemented fully. It does however, support capture groups and placeholders so something like this is perfectly valid (and seems to work without problems):

$it = new \ArrayIterator(array("foo-123", "bar-456", "baz-789", "something-else"));
$it = new \RegexIterator($it, "/(.+)-(.+)/", \RegexIterator::REPLACE);
$it->replacement = "$2 -> $1";
print_r(iterator_to_array($it));

/*
  Output:
  Array
  (
      [0] => 123 -> foo
      [1] => 456 -> bar
      [2] => 789 -> baz
      [3] => else -> something
  )
*/

Key or value?

Great! So I can filter out values through the RegexIterator, as it will check the current values taken from the parent iterator. But what if I want to filter out through the parent iterator keys instead? This is possible too: just use the RegexIterator::USE_KEY as $flags in the RegexIterator::__construct().

$it = new \ArrayIterator(array("foo" => "123", "bar" => "456", "baz" => "789"));
$it = new \RegexIterator($it, "/^ba/");
print_r(iterator_to_array($it));

/*
  Output:
  Array 
  (
  )
*/

$it = new \ArrayIterator(array("foo" => "123", "bar" => "456", "baz" => "789"));
$it = new \RegexIterator($it, "/^ba/", \RegexIterator::MATCH, RegexIterator::USE_KEY);
print_r(iterator_to_array($it));

/* 
   Output:
   Array
   (
      [bar] => 456
      [baz] => 789
   )
*/

preg_flags

Besides modes and flags, there is a $preg_flags argument inside the constructor (also available through setPregFlags()). The value of these flags depend on the actual mode that you are using. For instance, the PREG_PATTERN_ORDER makes sense when using ALL_MATCHES, but not really when using the default MATCH. Obviously, PREG_SPLIT_* flags only make sense when using the SPLIT mode. See the documentation of pcre to find the flags and what they are actually doing.

If you are looking for some more information about the SPL, or any of the iterators, why not try my book? It’s available through amazon or through php|architect:http://www.phparch.com/books/mastering-the-spl-library/ and contains a full overview on the SPL and the iterators.