Solr - Search Facets
The following facets are included as part of “core” search results.
PLOS Search Field | Description | Note |
Specialized indexes for faceting, see notes below | ||
doc_partial_parent_id | DOI | the ID value of the parent document |
doc_type | Document Type | Two possible values: full or partial |
doc_partial_type | The type article section | introduction, abstract, etc |
doc_partial_body | The text of the article section | |
Facet fields | ||
affiliate_facet | Affiliate (facet) | Don’t search against |
author_facet | Author (facet) | Don’t search against |
subject_facet | Subject (facet) | Don’t search against |
editor_facet | Editor (facet) | Don’t search against |
article_type_facet | Article Type (facet) | Don’t search against |
However, two of these facets deserve special attention: cross_published_journal_key and doc_partial_type. To build the result sets for these two fields, two additional SOLR queries are required.
The “cross_published_journal_key” facet provides vision into how many documents (across all journals) match the terms you entered. It is queried separately because the “core” search is, by default, journal specific.
For details on building SOLR queries, look at Apache’s SOLR website. But here a few sample queries against our schema to get you started. Results are given in XML, Solr’s default format.
Simple search for the term “test”. An article is included in this result set if the word “test” appears anywhere in the article.
Search for the term “test” with facets. This query also queries for the term “test”, but the results also include facets for subjects, authors, editors, article types, and affiliates.
Get the journals facet (cross_published_journal_key). This shows all of the Journals which have been indexed in Solr.
Get the where my keywords appear facet (doc_partial_type). This is a list of all the sections of an article (e.g., Body, Materials and Methods, Introduction, etc) in which keywords can be specifically sought. For instance, if you want to know whether a name appears in only the References section.
When thinking of documents stored in SOLR, it’s important to think of each document as a collection of fields. Fields have different storage mechanisms optimized for searching and faceting. It’s important at this point to have a good understanding of what a facet is.
Faceted search is the dynamic clustering of items or search results into categories that let users drill into search results (or even skip searching entirely) by any value in any field. Each facet displayed also shows the number of hits within the search that match that category. Users can then “drill down” by applying specific constraints to the search results. Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search.” — Lucid Imagination
Not all stored fields should be searched against. Fields ending with _facet are stored in a way to generate facets accurately and are not designed to be search against. Some fields are indexed, but not stored and therefore can not be part of the search results. We also store two types of documents as defined by the doc_type field: “full” and “partial”. The later being for computation of the “Where my keywords appear” facet.
For normal search queries, “doc_type:full” should always be used as a filter (the fq url query parameter).
Note, that we use two types of searches, dismax and standard. Dismax searches are used for simple searches where no fields are specified. Under this circumstance the title, author and everything fields are searched with the highest priority given to title and author. For more details on dismax look at the SOLR configuration file and SOLR documentation. For 90% of our searching this is what should be used. Standard searches can be used for more specific results against specific fields, for these searches, one or more fields must be specified to search against.
“Where my Keywords Appear” Facet
The logic gets a little tricky here. SOLR out of the box does not
provide a way for us to tell our users what areas of the document the
search terms appeared in. In fact this is kind of backwards to the way
SOLR is designed. But we determined that this was a powerful bit of
knowledge and worth the effort in putting together a system that allows
this as a possibility. To do this when an article is ingested into the
system a number of SOLR documents are created. First a document is
created of doc_type “full” that contains the whole body of the research
article. For most searches this is all you’ll want to search against by
using the filter query: fq=doc_type:full
.
http://api.plos.org/search?q=id:10.1371/journal.pcbi.1000048
In addition to this first document, a number of document parts are created:
You’ll notice that each of these partial documents contain a number of fields duplicated in the original article’s document, we do this so most search terms applied to the search for the parent document, can be applied to the document parts.
http://api.plos.org/search?q=id:10.1371/journal.pcbi.1000048/title&fq=doc_type%3Apartial&fl=*
The difference between the full document and the partial, is that partial has no fields representative of “everything” and instead have a “doc_partial_body” field and “doc_partial_type” field. FYI, “doc_partial_type” is not stored and can only be retrieved as a facet.
So if we want to find all partial documents with terms that match our search query:
http://api.plos.org/search?q=doc_partial_body:test&fl=&fq=doc_type%3Apartial
If we want to generate a facet telling us what document parts contain the terms entered: