The Lens-Indexer represents a lookup service for Lens articles, built on top of an ElasticSearch index, and provides full-text search on article meta data (title, abstract, authors, keywords, etc.) as well as the content of article fragments.
Note: this is work in progress and should be considered experimental
A demo instance of the service is running on heroku.com. It uses an ElasticSearch instance hosted on qbox.io. Click the following link to view a sample search result:
The index is seeded with the eLife corpus found at http://s3.amazonaws.com/elife-cdn/xml_files.txt.
- Node.js 0.10.x
- ElasticSearch 1.4.x
To simplify setup you can pull up a virtual machine using Vagrant ( see Vagrantfile
). However since the setup is so simply you may just want to inspect provision.sh
to derive your own setup.
Clone the repo:
git clone https://github.com/elifesciences/lens-indexer.git
Pull in dependencies using npm:
cd lens-indexer
npm install
Adjust config.js
to point to ElasticSearch host:
var config = {
host: 'https://your-id.qbox.io'
};
We use individual scripts to seed the Elastic Search instance. You can combine them individually, according to your usecase. For instance if you want to update the index without resetting the index, just leave out step 01
.
01 Configure Index
$ scripts/01_configure_index.js
This sets up and resets the article and fragment indexes.
02 Create list of urls
$ scripts/02_create_list_of_urls.js
Takes the list of XML files from: http://s3.amazonaws.com/elife-cdn/xml_files.txt and stores it in data/filelist.js
.
03 Fetch XML
$ scripts/03_fetch_xml.js
Download latest versions of XML files according to data/filelist.js
.
04 Convert
$ scripts/04_convert.js
Converts XML files to Lens JSON using the Lens converter.
Note: We needed to port the converter to run server-side. Since this code is experimental and not in sync with the official Lens converter, there may be slighly different resulting JSON files.
05 Seed Index
$ scripts/05_seed_index.js
This is the part where the ES index is actually updated. If you are seeding a lot of documents make sure your ElasticSearch instance has enough memory. We were running into those issues several time during our testing phase.
After seeding you can run the indexer API.
$ PORT=4002 node server.js
Point your browser to the following url to test:
The index has the following structure
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"html_content": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [ "html_strip" ],
"filter": [ "classic" ]
}
}
}
},
"mappings": {
"article": {
"properties": {
// title and intro are indexed for fuzzy full-text search
"title": { "type": "string", "index" : "analyzed", "analyzer": "html_content", "search_analyzer": 'snowball', "language": "English" },
"intro": { "type": "string", "index" : "analyzed", "analyzer": "html_content", "search_analyzer": 'snowball', "language": "English" },
// authors for exact full-text search (no partial matches)
"authors": { "type": "string", "index" : "analyzed", "analyzer": "standard" },
// The rest are facets which are used for strict match queries or filtering only
"published_on": { "type": "string", "index" : "not_analyzed"},
"article_type": { "type": "string", "index" : "not_analyzed"},
"subjects": { "type": "string", "index" : "not_analyzed"},
"organisms": { "type": "string", "index" : "not_analyzed"}
}
},
"fragment": {
"_parent": {"type": "article"},
"properties": {
"id": { "type": "string", "index" : "not_analyzed" },
"type": { "type": "string", "index" : "not_analyzed" },
"content": { "type": "string", "index" : "analyzed", "analyzer": "html_content", "search_analyzer": 'snowball', "language": "English" },
"position": { "type": "integer", "index": "not_analyzed" }
}
}
}
Note: there is one index called
articles
having two types of entities,article
andfragment
, where afragment
is modelled as a child of anarticle
.