Well, here it is 2:30 AM and I'm working on the key inversion/huffman coding/postman sort for large textual datasets. I'm only 3 hours into it and its working beautifully. Search results are well under half a second, but ingest is still quite slow.
That's what I've been wokring on for the last couple hours. Stemming will be important, and I plan to have a stemmer plugin architecture added tomorrow or later this week.
I've also found that performance isn't great on ext2 filesystems. I've read good things abour ReiserFS from IBM. I'll look into that when I've identified a spare partition or new disk.
Right now, I'm trying to avoid the use of databases or even specially fomated files. It's that brain dead simple! This will help with troubleshooting and debugging while I make sure the algorythms are correct.
Basically, just by using the most simple version of each step, I managed to build this in under three hours.
I have learned that it appears that links do not consume as many inodes on the fs as plain-old-files do. By using fs links I've managed to scale up the storage significantly beyond what flat files could do. ( roughly 5000 words equates to 750,000 tiny files, so it's important ).
Eventualy, it will mostly be re-using existing slots, but initially every word is new. It took almost 50 minutes to index 40 files. I'll definitely have to improve that.
The good news is that even without the optimizations, the search reults from such a set are returned in .02 seconds of user time, .03 seconds of system time, and .17 seconds of elapsed time using 28% cpu. That's wonderful.
Well, it's getting really late... logic, a MySQL backend, entry removal are all next.
glenn1you0 said on 2003-06-25 00:04:24:
OK, I'm collecting pieces and building small tests. What appears to be key to performance and scalability are a filesystem optimized for very small files and inode level links, and fast set operations. Perl isn't terribly great at set operations. Most implimentations center around using hashes, which work, but start to break down when the set is large and made up of many non-contiguous intervals. I have found an XS module that uses bit ranges managed by C code which appears complete. It's much faster. Not that I'm already optimizing for speed, rather, I need realistic performance just to keep building. Real optimizations will come later on.