I'm working on an application in which I have to parse a file with key-value pairs, looking for a single value. I decided to make the datafile look like this:
My initial thoughts on this are to keep it simple. Performance is important, but I don't think spending the time for a binary search file seek algorithm would be worth it; also, that would be premature optimization. So I want to check out the performance of a couple simpler options first.
My first option was to loop through each line, explode on the | character, and compare the keys. Like this:
In a 2000 line datafile:
Another option is to use fscanf() instead of just using file() to get all the lines. The idea is to define a regex that matches the format of the lines in your file, and "scan" through the file, extracting the pieces of data along the way. That version looks like this:
In a 2000 line datafile:
In contrast, the file() method is more affected by the filesize than by which key you're searching for (searching for later keys does take longer, but it's not nearly as significant).
So which option is better? As the number of records increases, it appears the fscanf() method becomes more and more advantageous (assuming that early records are hit as often as later records, considerably less time is spent searching than with the file() method). The same consideration remains true even if the number of records remains constant, it just isn't as noticeable or significant at that level.
Unsurprisingly, fscanf() beats file(). But not by as much as I'd like. In order to maximize performance, another method is going to have to be devised; either a binary search that minimizes the number of comparisons made (O(ln n) versus O(n)), or perhaps just using something like memcached to keep the datafile in memory at all times. Maybe both.
Something to think about.
key0|val0I don't expect to have more than 2000 values in the file, but just in case I set up a second sample file with 100000 values just to see how that would affect things. (That kind of growth in usage would be awesome.)
key1|val1
key2|val2
...
key2000|val2000
My initial thoughts on this are to keep it simple. Performance is important, but I don't think spending the time for a binary search file seek algorithm would be worth it; also, that would be premature optimization. So I want to check out the performance of a couple simpler options first.
My first option was to loop through each line, explode on the | character, and compare the keys. Like this:
function searchLinearly($filename, $key) {That's just about as simple as you can get. But how does it perform? I searched for a few keys throughout the file, and measured how long it took to find them.
$lines = file($filename);
foreach ($lines as $line) {
list($k, $v) = explode('|', $line);
if ($k == $key) {
return $v;
}
}
return false;
}
In a 2000 line datafile:
Key 10: 0.00215sIn a 100000 line datafile:
Key 150: 0.00233s
Key 900: 0.00555s
Key 1700: 0.00866s
Key 10: 0.11047sWhat's really strange about this is that key 1500 is consistently faster than key 10, whether I search for key 10 first or second, or just search once per run (in an attempt to separate caching issues). I can't explain that.
Key 1500: 0.04429s
Key 50000: 0.11832s
Key 99000: 0.18710s
Another option is to use fscanf() instead of just using file() to get all the lines. The idea is to define a regex that matches the format of the lines in your file, and "scan" through the file, extracting the pieces of data along the way. That version looks like this:
function searchFscanf($filename, $key) {It's only slightly more complicated; does it perform faster?
$f = fopen($filename, 'r');
while ($line = fscanf($f, "%[^\|]|%[^\|]\n")) {
list($k, $v) = $line;
if ($k == $key) {
fclose($f);
return $v;
}
}
fclose($f);
return false;
}
In a 2000 line datafile:
Key 10: 0.00052sIn a 100000 line datafile:
Key 150: 0.00101s
Key 900: 0.00572s
Key 1700: 0.01033s
Key 10: 0.00052sI think these results are pretty interesting. Using fscanf() can actually be considerably faster, and it isn't affected by the size of the file. However, it is drastically affected by the key you're searching for; if you have to scan through a lot of lines, it slows down significantly and steadily.
Key 1500: 0.00947s
Key 50000: 0.12980s
Key 99000: 0.20901s
In contrast, the file() method is more affected by the filesize than by which key you're searching for (searching for later keys does take longer, but it's not nearly as significant).
So which option is better? As the number of records increases, it appears the fscanf() method becomes more and more advantageous (assuming that early records are hit as often as later records, considerably less time is spent searching than with the file() method). The same consideration remains true even if the number of records remains constant, it just isn't as noticeable or significant at that level.
Unsurprisingly, fscanf() beats file(). But not by as much as I'd like. In order to maximize performance, another method is going to have to be devised; either a binary search that minimizes the number of comparisons made (O(ln n) versus O(n)), or perhaps just using something like memcached to keep the datafile in memory at all times. Maybe both.
Something to think about.
1 comment:
parse_ini_file will work pretty well. The syntax is slightly different, but it's fast.
Post a Comment