Solr 6 builds on the innovation of Solr 5 obviously.
First of all – let’s take a look at what was done in Solr 5.
There were improvements for “bin/solr” and “bin/post” – easy to startup Solr, add new documents, more APIs were introduced.
The user interface was rewritten in modern language (that is AngilarJS) to allow for more innovation and enhancements in near future.
The security was requested for a long time and so it was introduced in Solr 5. A few plugins were written for Kerberos and for basic authentication and authorization. There are plugin examples for customization.
Solr 5.4 introduced basic authentication
In Solr 5.5 the rule-based authorization expanded and became more flexible. The APIs were expanded such as ConfigSet API and Collections API expanded to manage collections flexibly (elegantly).
There is a new script in “bin/solr” for import and export of ZooKeeper configs. There are performance optimizations for faceting DocValue fields.
There are quite a few features in Solr 6 but let’s focus on few of them. The few big ones are ParallelSQL, Cross Data Center Replication, Graph Traversal, Modern APIs, new Jetty 9.3 with improved performance and support for HTTP/2.
ParallelSQL introduced to support relational algebra in a scalable manner. It seamlessly combines SQL with Solr’s full-text capabilities.
Parallel SQL has two modes: Realtime MapReduce and Facet aggregation model. MapReduce mode is for high cardinality fields and performs aggregation of distributed joins data. It uses the concept of shuffling very much like Map Reduce implementation frameworks, which partition the data for greater scalability, so a partitioning key is a very important piece of the data there. The other mode – Facet aggregation which pushes the aggregation to the nodes and only aggregated data returns back. So if you have a lot of data but low to no cardinality such option is quite performant.
Parallel SQL builds on two capabilities that are already in previous incarnations for SOLR: Export request handler and Streaming API.
Export request handler provides the capability of streaming the whole resultset. This can be used even with large resultsets to export them out of SOLR.
The search function is not the only one function available for Streaming API. There are also functions such as Stream Source and Stream Decorators. They define how data is retrieved and any aggregation performed and they designed to work with entire resultset. They can be compounded or wrapped to perform several operations at the same time.
Solr 6.x supports graph queries to find the interconnected data. This is local param type query parser that able to follow nodes to edges. Graph queries allow applying optional filters during the traversal. For example, you can find what your friends on social media likes “Honda Civic R 2017”, find what airplanes my friends used to fly by.
Solr 6.x APIs are more consistent, versioned, endpoint names are friendlier, JSON output by default but “wt” is still supported.
Lucene recently switched to new text scoring and instead of using TF*IDF it uses BF25.
Solr 6.x relies on latest Lucene trunk, so it inherited the same scoring algorithm. BF25 algorithm is a probabilistic model vs Term Frequency that used previously.
There is a new API to perform Backups and Restores.
Moving to Solr 6.x
First of all Solr 6.x expects that Java 8.x or higher installed on the host computer.
There is no more default schemaFactory but ManagedIndexSchemaFactory used instead. There will be no more schema.xml but managed-schema.
If no any SimilarityFactory defined then it defaulted to SchemaSimilarityFactory. If fieldType missed the similarity description it will default to BM25.