Researching better search functionality for the CUNY J-School network

Search is currently the dominant information retrieval paradigm, and WordPress’ internal search functionality is one step removed from atrocious. With that in mind, I’d like to significantly improve how search works on the J-School’s WordPress network. These are the notes I’m putting together as a part of my planning process.

A search for my name currently looks something like this:

Ideally, the search functionality should support these requirements:

  • Query across all of the content objects associated with the J-School’s primary website. These objects include posts, pages, events, blogs, databases, members, groups, and (coming soon) job opportunities. Eventually it would be nice to search attachments as well.
  • Expand a query to include content from any of the 216 and counting websites within the network. Filter results to a specific site, or by author, publication date, categories, or tags.
  • Highlight results based on matched keywords. If possible, show the sections of text matching the query.
  • Log queries and (optionally) provide analytics on search trends.

As far as I can tell, the options on the table are Sphinx, Solr, and search as a service from IndexTank. Sphinx appears the lowest-hanging fruit; Solr takes a couple of weeks to set up and configure, and IndexTank costs money for anything over 500 queries/day.

For Sphinx, there’s a WordPress plugin making it easier to integrate the two. The author has reasonably detailed documentation for installing Sphinx via the admin, if you chose to do that.

Another sys admin has written a three part series on extending WordPress search with Sphinx.

Extending search sources to custom fields is apparently as simple as adding to the select query.

The best way to dynamically add new blogs to the index for WordPress multisite is by editing the .conf file, although I’ll need to develop a way to add a unique index for every piece of content.

I intend to get Sphinx working on the development environment first, document the steps it took, then implement on production.


Stijn Debrouwere January 24, 2011 Reply

I wouldn’t overestimate the amount of time you’d need simply to get started with Solr — it takes about fifteen minutes to get up and running. Configuration is a bitch, but that just kind of depends on how far you want to take things. That said, Sphinx is supposed to be pretty good as well, so I’m curious to hear how things will work out for you / CUNY. Let us know.


Daniel Bachhuber January 25, 2011 Reply

That’s what I’ve heard too. Installing Solr itself isn’t so difficult, it’s Tomcat, and all of the underlying parts. For now, I think Sphinx is the best option. I’ll definitely let you know how it goes.

Andrew Spittle January 25, 2011 Reply

Interesting notes Daniel. Any particular metrics you’ll track to judge the effectiveness of the improvements?

Daniel Bachhuber January 25, 2011 Reply

Well, currently we’re not tracking search use at all so we unfortunately won’t have a standard of comparison. To gauge use I would track the number of queries per day, and whether they resulted in a clickthrough. Later, if the tools permit it, we can get into what topics are most commonly searched for, etc.

Leave a Reply