Saturday, August 30, 2008

[HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

 

Reference: Bruce Momjian writes: -> http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? -> http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

 

 

As a first time hacker of postgresql I’ve decided to attempt something that does not go too deep in to the heart of the software.

 

Before I make an attempt at writing a patch, I’ve decided to start benchmarking outside of Postgres. On reading the link underneath the Boyer Moore item in the in the new TODO list wiki, I learned that someone else had a shot at speeding up strpos around this time last year using KMP algotithm. On looking a little deeper into Boyer Moore’s algorithm it’s quite obvious that there is some overhead involved with preprocessing the search key (needle). My target has been to not have any trade offs with this possibly to be patch.  I didn’t want smaller searches to suffer.

 

I’d like to receive some feedback on the various functions in the attached before I go ahead and work on this any more. I’ve written 8 different versions of the function. Version 8 seems to be the fastest out of them, however I’ve not tested with multi byte chars.

 

Please note that these are just some initial ideas. For example startpos is not yet implemented, but there is very little overhead to that anyway.

 

For comparision sake I mocked up the current postgresql 8.3’s strpos() from varlena.c. I’ve no idea how accuratly this will be to the real version. Probably find out once I write the patch. I replaced strncmp with my own version as I was having problems with my compilers optimizer, it optimized too much and I was unable to benchmark. I felt running without and optimization was unfair as strncmp is. Perhaps if someone else wants to benchmark with –O2 on gcc they should first replace pg_strncmp with strncmp.

 

 

Currently I have a single file which can be compiled itself that contains benchmarks for the functions I’ve tested so far. All of these implement a stripped out version of the Boyer Moore algotithm. I’ve done away with one of the character maps in a bid to reduce the overhead involved creating them.

 

The position calculation will likely need changed for multi byte characters. I’m suspecting I’m not meant to be doing length calculations on pointers. Can someone confirm?

 

To keep this email short I’ve put most of the information required in as comments into the source of the program.

 

I’ve uploaded some benchmark results using the attached program. The benchmarking was done on windows…

Please see http://www.unixbeast.com/~fat/test1.html  there are 10 tests in total, in each I attempt to exploit weaknesses, some with my functions and some with the existing function.

 

My method for version 8 of the function probably looks quite unusual as it does some simple cost based optimisation

 

I look forward to receiving feedback on this.

 

David.

 

 

 

 

 

 

 

No comments: