I've been playing with various entity and information extraction frameworks for the past couple weeks with the goal of creating an web service for extracting the major topics from news articles. SP far, my work, such that it is, has shown promise, but is not as robust or reliable as I would have hoped.
I just noticed that Reuters has apparantly been working along the same lines and have opened their work to the public. On the one hand, I feel beaten. On the other, their service does not appear to be much better than my own - although they'll have a larger set of people re-training their decision trees than I'll ever have.
I've been using posts from my own blog as my testing ground, so I thought I'd throw my last post through and see what it churns out:
Organization: Entrepreneur Fund
Person: Len Gilbert, Tom Berreca, Darline Jean, Glenn Franxman, Matthew de Gannon, Sam Parker, Clark, Beth Higbee, Eleanor Cippel, Martha Stewart
Well, I don't even try to extract relations, so that's pretty cool.
IndustryTerm is pretty puzzling, although I understand why they were fooled by it.
My organization extraction appears to handle this better, inso far as I am extracting things like The Weather Channel, etc from that post.
My City extraction is better in so far as it fetches the state along with the city name.
Person extraction is interesting. They got Martha Stewart, where as I keep tripping over it, and call her Martha Stewart Living, and I pull out Omnimedia as an Orgnaization.
For comparisons sake, here's the result of my own project:
Fugly formatting aside, I'm being way more aggressive in my extraction. I think I'm going to need better confidence heuristics. Also, I'm a lot slower than OpenCalais.
There's all sorts of interesting differences, many probably come from the body of text they've got to train against. I'm still fooled by names like Matt de Ganon. The most interesting bit, though, is how the two systems treated the phrase "Martha Stewart Living Omnimedia".