Skip to content

A fulltext indexing solution for plain Ruby and Ruby On Rails based on the Xapian library

License

Notifications You must be signed in to change notification settings

gernotkogler/xapian_db

Repository files navigation

XapianDb

What’s in the box?

XapianDb is a ruby gem that combines features of nosql databases and fulltext indexing into one piece. The result: Rich documents and very fast queries. It is based on Xapian, an efficient and powerful indexing library.

XapianDb is inspired by xapian-fu and xapit. Thank you John and Ryan for your great work. It helped me learning to understand the xapian library and I borrowed an idea or two from you ;-)

Why yet another indexing gem?

In the good old days I used ferret and acts_as_ferret as my fulltext indexing solution and everything was fine. But time moved on and Ferret didn’t.

So I started to rethink fulltext indexing again. I looked for something that

  • is under active development

  • is fast

  • is lightweight and easy to install / deploy

  • is framework and database agnostic and works with pure POROS (plain old ruby objects)

  • is configurable anywhere, not just inside the model classes; I think that index configurations should not be part of the domain model

  • supports document configuration at the class level, not the database level; each class has its own document structure

  • integrates with popular Ruby / Rails ORMs like ActiveRecord or Datamapper through a plugin architecture

  • returns rich document objects that do not necessarily need a database roundtrip to render the search results (but know how to get the underlying object, if needed)

  • updates the index realtime (no scheduled reindexing jobs)

  • supports all major features of a full text indexer, namely wildcards!!

I tried hard but I couldn’t find such a thing so I decided to write it, based on the Xapian library.

If you found a bug or are looking for a missing feature, please post to the {Google Group}[http://groups.google.com/group/xapian_db]

Requirements

  • ruby 2.0.0 or newer

  • rails 3.0 or newer (if you want to use it with rails)

  • xapian-core and xapian-ruby binaries 1.2.x installed

Installing xapian binaries

On OSX, I recommend to install the binaries using homebrew like so: brew install xapian –ruby. Another option is to install and require the xapian-ruby gem; this works for linux, too. Make sure to add the gem to your Gemfile in this case. Make sure version 1.2.6 or newer is installed.

Getting started

If you want to use xapian_db in a Rails app, you need Rails 3 or newer.

For a first look, look at the examples in the examples folder. There’s the simple ruby script basic.rb that shows the basic usage of XapianDB without rails. In the basic_rails folder you’ll find a very simple Rails app using XapianDb.

The following steps assume that you are using xapian_db within a Rails app.

Configure your databases

Without a config file, xapian_db creates the database in the db folder for development and production environments. If you are in the test environment, xapian_db creates an in memory database. It assumes you are using ActiveRecord.

You can override these defaults by placing a config file named ‘xapian_db.yml’ into your config folder. Here’s an example:

# XapianDb configuration
defaults: &defaults
  adapter: datamapper # Available adapters: :active_record, :datamapper
  language: de        # Global language; can be overridden for specific blueprints
  term_min_length: 2  # Ignore single character terms
  enabled_query_flags: FLAG_PHRASE, FLAG_SPELLING_CORRECTION

development:
  database: db/xapian_db/development
  <<: *defaults

test:
  database: ":memory:" # Use an in memory database for tests
  <<: *defaults

production:
  database: db/xapian_db/production
  <<: *defaults

Available options

- adapter: :active_record|:datamapper, default: :active_record
- language: any iso language code, default: :none (activates spelling corrections, stemmer and stop words if an iso language code ist set)
- term_min_length: <n>, default: 1 (do not index terms shorter than n)
- term_splitter_count: <n>, default: 0 (see chapter Term Splitting)
- enabled_query_flags: <list of flags, separated by commas>
- disabled_query_flags: <list of flags, separated by commas>

The following query flags are enabled by default:

  - FLAG_WILDCARD
  - FLAG_BOOLEAN
  - FLAG_BOOLEAN_ANY_CASE
  - FLAG_SPELLING_CORRECTION

See the xapian docs for all available query flags; if you use the enabled_query_flags option, you must list all query flags that you want to enable since enabled_query_flags overwrites the defaults

If you do not configure settings for an environment in this file, xapian_db applies the defaults.

Configure an index blueprint

In order to get your models indexed, you must configure a document blueprint for each class you want to index. You can pass the class name as a symbol or as a string (if the class is namespaced):

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribute :name, :weight => 10
  blueprint.attribute :first_name
end

The example above assumes that you have a class Person with the methods name and first_name. Attributes will get indexed and are stored in the documents. You will be able to access the name and the first name in your search results.

If you want to index additional data but do not need access to it from a search result, use the index method:

blueprint.index :remarks, :weight => 5

If you want to declare multiple attributes or indexes with default options, you can do this in one statement:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attributes :name, :first_name, :profession
  blueprint.index      :notes, :remarks, :cv
end

Note that you cannot add options using this mass declaration syntax (e.g. blueprint.attributes :name, :weight => 10, :first_name is not valid).

Use blocks for complex evaluations of attributes or indexed values:

XapianDb::DocumentBlueprint.setup(:IndexedObject) do |blueprint|
  blueprint.attribute :complex do
    if @id == 1
      "One"
    else
      "Not one"
    end
  end
end

You may add a filter expression to exclude objects from the index. This is handy to skip objects that are not active, for example:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attributes :name, :first_name, :profession
  blueprint.index      :notes, :remarks, :cv
  blueprint.ignore_if {active == false}
end

You can add a type information to an attribute (default format is string). As of now the special types :string, :date, :date_time and :number are supported (and required for range queries):

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribute :age,           :as => :number
  blueprint.attribute :date_of_birth, :as => :date
  blueprint.attribute :name,          :as => :string
  blueprint.attribute :updated_at,    :as => :date_time
  blueprint.attribute :address,       :as => :json
end

If you don’t need field searches for an attribute, turn off the prefixed option (makes your index smaller and more efficient):

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribute :complex_object, prefixed: false
end

You can override the global adapter configuration in a specific blueprint. Let’s say you use ActiveRecord, but you have one more class that is not stored in the database, but you want it to be indexed:

XapianDb::DocumentBlueprint.setup(:SpecialClass) do |blueprint|
  blueprint.adapter :generic
  blueprint.index   :some_stuff
end

If you use associations in your blueprints, it might be a good idea to specify a base query to speed up rebuild_xapian_index calls (avoiding 1+n queries):

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.index :addresses, as: :json
  blueprint.base_query { |p| p.includes(:addresses) }
end

If you have configured a term_splitter_count, you might want to exclude certain attributes from the automatic term splitting:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribute :age, :as => :number, :no_split => true
end

You can specify a natural sort order for each class using a method symbol or a block. If you don’t specify an order expression in your xapian query, the matches are ordered by relevance and - within the same relevance - by the natural sort order. If you don’t specify the natural sort order, it defaults to id. Examples:

XapianDb::DocumentBlueprint.setup(:House) do |blueprint|
  blueprint.natural_sort_order :number
end

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.natural_sort_order do
    "#{surname} #{name}"
  end
end

You can specify a Ruby method for preprocessing the indexed terms. The method needs to be a class method that takes one argument (the terms to be indexed) and returns a string. You can configure it globaly in the config or on a blueprint. For example, this setup will make words with accented e characters searchable by their accentless form:

class Util
  def self.strip_accents(terms)
    terms.gsub(/[éèêëÉÈÊË]/, "e")
  end
end

XapianDb::Config.setup do |config|
  config.indexer_preprocess_callback Util.method(:strip_accents)
end

You may use attributes from associated objects in a blueprint; if you do that and an associated object is updated, your objects should be reindexed, too. You can tell XapnaDB about those dependencies like so:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribue :address, :as => :json

  blueprint.dependency :Address, when_changed: %i(street zip city) do |address|
    Person.joins{ address }.where{ adresses.id == my{ address.id } }
  end
end

The block you supply to the dependency declaration must return a collection of objects that should get reindexed, too.

If you want to manage the (re)indexing of your objects on your own, turn off autoindexing in your blueprint:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.autoindex false
end

This will turn off the auto-reindexing for any object of the configured class. Use XapianDb.reindex(object) to trigger the reindexing logic in your code. It will also turn off the auto-deletion of the doc, when the object gets destroyed. Use XapianDb.delete_doc_with(object.xapian_id) to trigger deletion logic in your code.

Place these configurations either into the corresponding class or - I prefer to have the index configurations outside the models - into the file config/xapian_blueprints.rb.

IMPORTANT

  • Do not place them into an initializer, this will not work when cache_classes is set to false (default in config/development.rb)

  • Whenever you make a change to a blueprint configuration, you must rebuild your entire xapian index

Update the index

xapian_db injects some helper methods into your configured model classes that update the index automatically for you when you create, save or destroy models. If you already have models that should now go into the index, use the method rebuild_xapian_index:

Person.rebuild_xapian_index

To get info about the reindex process, use the verbose option:

Person.rebuild_xapian_index :verbose => true

In verbose mode, XapianDb will use the progressbar gem if available.

To rebuild the index for all blueprints, use

XapianDb.rebuild_xapian_index

You can update the index for a single object, too (e.g. to reevaluate an ignore_if block without modifying and saving the object):

XapianDb.reindex object

Query the index

A simple query looks like this:

results = XapianDb.search "Foo"

You can use wildcards and boolean operators:

results = XapianDb.search "fo* or baz"

You can query attributes:

results = XapianDb.search "name:Foo"

You can force the sort order:

results = XapianDb.search "name:Foo", order: [:name]

You can query objects of a specific class:

results = Person.search "name:Foo"

You can search for exact phrases (if the query flag is turned on):

results = XapianDb.search('"this exact sentence"')

If you want to paginate the result, pass the :per_page argument:

results = Person.search "name:Foo", :per_page => 20

If you want to limit the number of results, pass the :limit argument (handy if you use the query for autocompletion):

results = Person.search "name:Foo", :limit => 10

On class queries you can specifiy order options:

results = Person.search "name:Foo", :order => :first_name
results = Person.search "Fo*", :order => [:name, :first_name], :sort_decending => true

If you define an attribute with a supported type, you can do range searches:

XapianDb::DocumentBlueprint.setup(:Person) do |blueprint|
  blueprint.attribute :age,           :as => :number
  blueprint.attribute :date_of_birth, :as => :date
  blueprint.attribute :name,          :as => :string
end

result = XapianDb.search("date_of_birth:2011-01-01..2011-12-31")
result = XapianDb.search("age:30..40")
result = XapianDb.search("name:Adam..Chris")

Open Ranges are supported, too:

result = XapianDb.search("age:..40")
result = XapianDb.search("age:30..")

You can combine range query expressions with other expressions:

result = XapianDb.search("age:30..40 AND city:Aarau")

Process the results

XapianDb.search returns a resultset object. You can access the number of hits directly:

results.hits # Very fast, does not load the resulting documents; always returns the actual hit count
             # even if a limit option was set in the query

If you use a persistent database, the resultset may contain a spelling correction:

# Assuming you have at least one document containing "mouse"
results = XapianDb.search("moose")
results.spelling_suggestion # "mouse"

The results behave like an array:

doc = results.first
puts doc.score.to_s         # Get the relevance of the document
puts doc.indexed_class      # Get the type of the indexed object as a string, e.g. "Person"
puts doc.name               # We can access a single attribute
puts doc.attributes         # We can access all attributes as a hash
person = doc.indexed_object # Access the object behind this doc (lazy loaded)

Use a search result with will_paginate in a view:

<%= will_paginate @results %>

Or with kaminari:

<%= kaminari @results %>

Facets

If you want to implement a simple drilldown for your searches, you can use a global facets query:

search_expression = "Foo"
facets = XapianDb.facets(:name, search_expression)
facets.each do |name, count|
  puts "#{name}: #{count} hits"
end

If you want the facets based on the indexed class, use the special attribute :indexed_class:

search_expression = "Foo"
facets = XapianDb.facets(:indexed_class, search_expression)
facets.each do |klass, count|
  puts "#{klass.name}: #{count} hits"

  # This is how you would get all documents for the facet
  # doc = klass.search search_expression
end

A class level facet query is possible, too:

search_expression = "Foo"
facets = Person.facets(:name, search_expression)
facets.each do |name, count|
  puts "#{name}: #{count} hits"
end

Any attribute declared in a blueprint can be used for a facet query. Use facet queries on attributes that store atomic values like strings, numbers or dates. If you use it on attributes that contain collections (like an array of strings), you might get unexpected results.

Find similar documents

If you have a rearch result, you can search for similar documents by selecting one or more documents from your result and passing them to the find_similar_to method:

results = XapianDb.search("moose")
similar = XapianDb.find_similar_to results.first

It works like this: The xapian engine extracts the most selective terms from the passed documents. Then, a new query is executed with the retrieved terms combined with OR operators. This method works best if your models contain large amounts of text.

Transactions

You can execute a block of code inside a XapianDb transaction. This ensures that the changed objects in your block will get reindexed only if the block does not raise an exception.

XapianDb.transaction do
  object1.save
  object2.save
end

Bulk inserts / updates / deletes

When you change a lot of models, it is not very efficient to update the xapian index on each insert / update / delete. Instead, you can use the auto_indexing_disabled method with a block and rebuild the whole index afterwards:

XapianDb.auto_indexing_disabled do
  Person.each do |person|
    # change person
    person.save
  end
end
Person.rebuild_xapian_index

Add your own serializers for special objects

XapianDb serializes objects to xapian documents as strings by default.

However, dates need special handling to support date range queries. To support date range queries and allow the addition of other custom data types in the future, XapianDb uses a simple, extensible mechanism to serialize / deserialize your objects. An example on how to extend this mechanism is provided in examples/custom_serialization.rb.

Term Splitting

If you want to build a realtime search showing results while the user types, you might experience very poor performance and a huge memory load for the first typed characters (1*, 12*…). XapianDb allows you to configure the term_splitter_count to avoid this. If you configure a term_splitter_count of e.g. 2, the term “test” will get indexed with “t”, “te” and “test”. Now you can apply the “*” only for search terms that are longer than the configured term_splitter_count resulting in a much better performance and lower memory footprint.

Production setup

Since Xapian allows only one database instance to write to the index, the default setup of XapianDb will not work with multiple app instances trying to write to the same database (you will get lock errors). Therefore, XapianDb provides three solutions based on queueing systems to overcome this. The first solution uses beanstalk, the second one uses resque and the third uses sidekiq.

Installation with beanstalk

1. Install beanstalkd

Make sure you have the beanstalk daemon installed

OSX

The easiest way is to use macports or homebrew:

port install beanstalkd
brew install beanstalkd

Debian (Lenny)

# Add backports to /etc/apt/sources.list:
deb http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free
deb-src http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free

sudo apt-get update
sudo apt-get -t lenny-backports install libevent-1.4-2
sudo apt-get -t lenny-backports install libevent-dev
cd /tmp
wget --no-check-certificate https://github.com/downloads/kr/beanstalkd/beanstalkd-1.4.6.tar.gz
tar xvf beanstalkd-1.4.6.tar.gz
cd beanstalkd-1.4.6/
./configure
make
sudo make install

2. Add the beanstalk-client gem to your config

gem 'beanstalk-client' # Add this to your Gemfile
bundle install

3. Install the beanstalk worker script

rails generate xapian_db:install

4. Configure your production environment in config/xapian_db.yml

production:
  database: db/xapian_db/production
  writer:   beanstalk
  beanstalk_daemon: localhost:11300

5. start the beanstalk daemon

beanstalkd -d

6. start the beanstalk worker from within your Rails app root directory

RAILS_ENV=production script/beanstalk_worker start

If everything is fine, you should find a file namend beanstalk_worker.pid in tmp/pids. If something goes wrong, you’ll find beanstalk_worker.log instead showing the stack trace.

Important: Do not start multiple instances of this daemon!

Installation with Resque

1. Install and start redis

Install and start redis as described on the resque github page.

2. Add the resque gem to your config

gem 'resque'
bundle install

3. Configure XapianDb to use resque in production

production:
  database:     db/xapian_db/production
  writer:       resque
  resque_queue: my_queue

If you don’t specify a queue name XapianDb will use ‘xapian_db’ by default.

4. Start the resque worker

RAILS_ENV=production QUEUE=my_queue rake resque:work

Be sure to specify the correct queue name when starting the worker.

If you don’t provide a queue name, it WON’T take ‘xapian_db’ by default! Do not start multiple instances of this worker!

Installation with Sidekiq

1. Install and start redis

Install and start redis as described on the resque github page.

2. Add the sidekiq gem to your config

gem 'sidekiq'
bundle install

3. Configure XapianDb to use sidekiq in production

production:
  database:     db/xapian_db/production
  writer:       sidekiq
  sidekiq_queue: my_queue
  set_max_expansion: 100
  sidekiq_retry: false

If you don’t specify a queue name XapianDb will use ‘xapian_db’ by default. Additionally, if you don’t provide a ‘sidekiq_retry’ option, it will default to ‘false’.

4. Start sidekiq

RAILS_ENV=production bundle exec sidekiq

About

A fulltext indexing solution for plain Ruby and Ruby On Rails based on the Xapian library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages