Getting N Random Elements out of an Iterator – RandomIterator

hakre - random iterator

Considering there is an Iterator or Traversable with an unknown number of elements, I wondered if it is possible to get one or more random iterations out of it.


This first of all needs some iterations. Let’s create an example class that is able to provide N iterations – here from one up to 25 for simplicity:

class NIterations implements IteratorAggregate
{
    public function getIterator() {

        return new ArrayIterator(range(1, mt_rand(1, 25)));
    }
}

As simple usage example of it would be (nothing fancy, just to have it here):

$count = 0;
foreach (new NIterations() as $iteration) {
    $prefix = $count++ ? ', ' : '';
    printf("%s%s", $prefix, $iteration);
}
printf("\nNumber of Iterations: %d\n", $count);

# 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
# Number of Iterations: 16

After this has been setup, let’s introduce randomness to it.

Random with the help of an Array

To get one random of all the iterations, the most simple thing in PHP one could do would be to convert it into an array, shuffle it and pick the first element or use array_rand to pick by the key:

$cache  = iterator_to_array(new NIterations());
$random = $cache[array_rand($cache)];

echo "Random pick: ", $random, " (out of ", count($cache), " Iterations) \n";

# Random pick: 15 (out of 19 Iterations)

This would also allow to pick N number of elements as this can be configured. If the iteration is large however, the cache grows and grows and this method requires to cache all iterations.

Random by ca. 1.5 Iterations

To overcome this, a first idea I had was to count all elements inside the iterator (see iterator_count), then pick a random number based on this count and go there with a LimitIterator. This worked in my case, and so it does in this case if, and only if this operates on the same iterator and that iterator does not change the behavior between multiple iterations:

$iterator = (new NIterations())->getIterator();
$count    = iterator_count($iterator);
$offset   = mt_rand(1, $count) - 1;
$limit    = new LimitIterator($iterator, $offset, 1);
$limit->rewind();
$random = $limit->current();

echo "Random pick: ", $random, " (out of ", $count, " Iterations) \n";

# Random pick: 3 (out of 5 Iterations) 

This worked in my case, but what if you can not repeat the iteration? For example a NoRewindIterator would not allow this, even working directly with the NIterations object would not do it. Next to that it has the potential downside to iterate circa 1.5 times on average.

hakre - random iterator - separator

Using Reservoir Sampling

There is a technique named Reservoir Sampling that allows to do this in a single iteration.

It takes care that iteration is only done once to get N random iterations out of all steps in it. Only the randomly picked elements are kept as a copy. Because the total number of elements is not know upfront, it works by overwriting earlier picks if the probability would have chosen the current element (as each next element is also the current element, it’s probably easier to understand saying the next element). This in the end distributes the same chance across all elements.

  • If the list has one element, the chance that it is picked is 1:1
  • If the list has two elements, the chance for the first is still 1:1 because it can be overwritten by the second with a chance of 1:2
  • If the list has three elements, the chances for the first two are as outlined previously plus the chance to be overwritten by the third with a chance of 1:3
  • And so on, and so forth …

As this list outlines, the only thing that is need to be known is the current count and the previously picked “random” element. So only a cache per each random element is needed, the Reservoir.

A code example to pick one random element would be:

$reservoir = null;
$count     = 0;
foreach (new NIterations() as $iteration) {
    $count++;
    $take = mt_rand(1, $count) === 1;
    if ($take) $reservoir = $iteration;
}
$random = $reservoir;

echo "Random pick: ", $random, " (out of ", $count, " Iterations) \n";
# Random pick: 8 (out of 15 Iterations) 

It is also possible to increase the size of the reservoir so to pick more than one random element within the same iteration. This technique is able to work within an (existing) iteration and also requires only a single iteration.

So all I did was implementing such into a Traversable and called it RandomIterator. This implementation comes as well with selecting multiple elements, I take three random elements at once in the following example:

$randomIterations = new RandomIterator(new NIterations(), 3);
foreach ($randomIterations as $random)
{
    echo " * Random pick: ",  $random, "\n";
}
echo " (out of ", $randomIterations->getCount(), " Iterations) \n";

# * Random pick: 16
# * Random pick: 5
# * Random pick: 2
# (out of 24 Iterations) 

The source-code of the RandomIterator class is available as a gist.

hakre - random iterator

The Story Behind and some Discussion

First of all I did this out of interest. I had the smell that it should be possible to do within one iteration and wanted to find out how. The example finally showed that it works.

Kudos go to Alexander who was able to translate my words into the termini which then gave me more confidence with the technique.

Based on feedback by Ircmaxell and NikiC that I should check if it’s not biased, I’ve also run tests (see these exemplary tests). Turned out it’s not. Internally this is based on mt_rand which I was told is not biased, but that must not mean that ones implementation won’t be biased. This has been tested and there is also more information about the general principle of Reservoir sampling on Wikipedia.

So far for the plain algorithmic and implementation side.

Originally I was playing with DatePeriod which is traversable. Picking a random date from such a period could result in a pretty large array, so the idea was born to solve this by traversal. However to tell the full story I must say that I was under the wrong impression that DatePeriod would be any better than simple UNIX timestamps. But as it turned out later (after finishing all this), it is not. So solving the problem iteratively or via an array is a waste of resources in that exemplary case:

$start    = new Datetime('1st October 2012');
$end      = new Datetime('1st Jan 2013');

$random   = new DateTime('@' . mt_rand($start->getTimestamp(), $end->getTimestamp()));

This basically is done with the speed of light (choosing one from 7 948 800 possibilities took 0.00004 seconds), while the iterative approach:

...
$interval = new DateInterval('PT1S'); // Resolution: 1 Second
$period   = new DatePeriod($start, $interval, $end);

$random = new RandomIterator($period);

needs to go through 7 948 800 iterations and exemplary took 35.68737 seconds to finish. For those who are curious, doing the same with an array:

...

$cache  = iterator_to_array($period);
$random = $cache[array_rand($cache)];

easily hits the memory limit.

So as it turned out in the end, if you can spare the whole iteration, you might want to look into that first ;) Also if you’ve got enough memory, you might want to make use of an array to cache things first, too. Only if you really need Reservoir sampling, you should use it. The RandomIterator gives it an easy to use interface.

hakre - random iterator - separator

Resolution Count Random Value Array Random Iterator
Seconds 7 948 800 0.00004 -/- [1] 35.68737
Minutes 132 480 0.00004 0.47806 0.60578
Days 92 0.00004 0.00033 0.00044
Months 3 -/- [2] 0.00004 0.00005

[1] The system did not have enough memory to finish getting the random number.
[2] Using Unix timestamps and setting the resolution to one month, it is not possible to express this validly over a period of multiple months because the information is missing how long a month is. Otherwise the exemplary time (again) is 0.00004.


Feedback welcome, including finding a better name if there is one. I also thought about having a type called ReservoirSampling just to do manual sampling as well:

$reservoir = new ReservoirSampling(3);

$reservoir->add($element);
$reservoir->addIterations($period);
$reservoir->getSize();
$reservoir->getCount();
$reservoir->getIterator();

The original random date from a date-period question is here.

Images based on a work by Jago Bahaya; CC BY-SA 3.0

About these ads
This entry was posted in Developing, Hakre's Tips, PHP Development, Pressed and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s