Programming is a wonderful mix of art and science; source code is both a poem and a math problem. It should be as simple and elegant as it is functional and fast. This blog is about that (along with whatever else I feel like writing about).

Saturday, February 23, 2008

PHP: Reading Key/Value Pairs from a File

I'm working on an application in which I have to parse a file with key-value pairs, looking for a single value. I decided to make the datafile look like this:
key0|val0
key1|val1
key2|val2
...
key2000|val2000
I don't expect to have more than 2000 values in the file, but just in case I set up a second sample file with 100000 values just to see how that would affect things. (That kind of growth in usage would be awesome.)

My initial thoughts on this are to keep it simple. Performance is important, but I don't think spending the time for a binary search file seek algorithm would be worth it; also, that would be premature optimization. So I want to check out the performance of a couple simpler options first.

My first option was to loop through each line, explode on the | character, and compare the keys. Like this:
function searchLinearly($filename, $key) {
$lines = file($filename);

foreach ($lines as $line) {
list($k, $v) = explode('|', $line);

if ($k == $key) {
return $v;
}
}

return false;
}
That's just about as simple as you can get. But how does it perform? I searched for a few keys throughout the file, and measured how long it took to find them.

In a 2000 line datafile:
Key 10: 0.00215s
Key 150: 0.00233s
Key 900: 0.00555s
Key 1700: 0.00866s
In a 100000 line datafile:
Key 10: 0.11047s
Key 1500: 0.04429s
Key 50000: 0.11832s
Key 99000: 0.18710s
What's really strange about this is that key 1500 is consistently faster than key 10, whether I search for key 10 first or second, or just search once per run (in an attempt to separate caching issues). I can't explain that.

Another option is to use fscanf() instead of just using file() to get all the lines. The idea is to define a regex that matches the format of the lines in your file, and "scan" through the file, extracting the pieces of data along the way. That version looks like this:
function searchFscanf($filename, $key) {
$f = fopen($filename, 'r');
while ($line = fscanf($f, "%[^\|]|%[^\|]\n")) {
list($k, $v) = $line;

if ($k == $key) {
fclose($f);
return $v;
}
}

fclose($f);
return false;
}
It's only slightly more complicated; does it perform faster?

In a 2000 line datafile:
Key 10: 0.00052s
Key 150: 0.00101s
Key 900: 0.00572s
Key 1700: 0.01033s
In a 100000 line datafile:
Key 10: 0.00052s
Key 1500: 0.00947s
Key 50000: 0.12980s
Key 99000: 0.20901s
I think these results are pretty interesting. Using fscanf() can actually be considerably faster, and it isn't affected by the size of the file. However, it is drastically affected by the key you're searching for; if you have to scan through a lot of lines, it slows down significantly and steadily.

In contrast, the file() method is more affected by the filesize than by which key you're searching for (searching for later keys does take longer, but it's not nearly as significant).

So which option is better? As the number of records increases, it appears the fscanf() method becomes more and more advantageous (assuming that early records are hit as often as later records, considerably less time is spent searching than with the file() method). The same consideration remains true even if the number of records remains constant, it just isn't as noticeable or significant at that level.

Unsurprisingly, fscanf() beats file(). But not by as much as I'd like. In order to maximize performance, another method is going to have to be devised; either a binary search that minimizes the number of comparisons made (O(ln n) versus O(n)), or perhaps just using something like memcached to keep the datafile in memory at all times. Maybe both.

Something to think about.

1 comment:

Anonymous said...

parse_ini_file will work pretty well. The syntax is slightly different, but it's fast.