About Scalding

Scalding is a Scala library. Scalding is easy to work with and reason about the data in distributed systems like Hadoop.
It presents the data as a collection and allows to perform the computation on data in a matter that is similar to Scala API, so it appears
to the developer that the data is a collection and performs simple operations like filter and map. Map and Reduce that used in Hadoop stems from functional
programming and has a natural fit for Scalding (based on Scala – functional programming language).

Scala is easier to read than Java and it is also much more compact and addresses business logic in a straightforward way.

Scalding is built on top of Cascading – an abstraction layer for Hadoop, written in Java.
The advantage of utilizing Scalding is that it hides the complexity that Hadoop and Map Reduce presents.

Scalding operate by allowing you to think of your data as flow in the series of pipes.

Scalding is better than Pig. Pig is very good at solving simple, quick tasks.
However, it needs to utilize other programming languages to solve complex tasks building UDFs, also hard to unit test.

Scalding API comes with 3 APIs:

Fields API
Typed API – that promotes the type safety
Matrix API – that deals with matrix operations like matrix multiplication.

Scala collections are in memory on the single host. Although Scalding is utilizing Scala and Cascading to operate
on collections of data, distributed on a number of commodity servers, however, it gives you the feeling of normal in-memory collection you are operating on.

If you’ve seen Spark’s RDD (resilient distributed datasets) – then it’s a very similar concept here.

Top 10 Big Data Trends for 2017

Tableau published a paper on Top 10 Big Data Trends for 2017 that you can find here: http://tabsoft.co/2jXCXar

We disagree that speeding up Hadoop as number 1 trend. The author takes the evolutionary approach and not revolutionary one. What needed more and more is the event driven processing. The machinery should react to new incoming data that is easier to process. Even the 2nd trend is talking to some degree that Big Data is not just about to utilize Hadoop-like systems. Hadoop was written for batch processing.

Utilizing reactive streams:

When we were consulting for electronic book and academic journal publisher the vendor used Amazon Elastic MapReduce.  The vendor paid around 1.5 million annually for utilizing a very large cluster of Amazon EMR services for re-processing books and journals on daily basis.  In few instances, the error would happen and the vendor had to re-run the job.  We reengineered the system where only new content processed as soon as it would become “alive”.  Such system saved the company 90% on payment to Amazon for EMR resource utilization and newest books and journals appear online in a few seconds after the content would up-stream to the vendor.

5th trend on the list talks about the Variety of data.  But that’s not new, Gartner published the traits of Big Data in 2001.  What is not elaborated upon in the paper is HOW to tackle the variety.

Scala computer language is Scalable Language – language that borrowed from various programming languages the best concepts; enhanced and extended them.  It became trivial to create DSL ( Domain Specific Language) on top of Scala.  For example, in order to parse the JSON of any kind, the programmer can create a few lines of code that would match such JSON structure in Object Oriented way.

The paper prepares the reader to the new Trends in 2017 but focuses too much on Hadoop infrastructure and not at all on new trends that aid the Big Data – such as reactive streams, functional programming, elastic systems, non-blocking I/O, anychronous processing of “live” data. But we still recommend you to read Tableau’s research.

Code Musing

“I am sorry I have had to write you such a long letter, but I did not have time to write you a short one”
Pascal, Blaise (1623 – 1662) – French philosopher and mathematician.

At the age of 18 he invented the first calculating machine.

 

So I wonder why do we make the same mistake?  Let’s review a few code examples.

 

Instead of following code:

private boolean isItemPutEligible(final SolrDocument doc) {
    String putEligibility = "N";
    Object obj = doc.getFieldValue(IS_PUT_ELIGIBLE);
    if (obj != null) {
        putEligibility = obj.toString();
    }
    if ("Y".equalsIgnoreCase(putEligibility)) {
        return true;
    }
    return false;
}

Could be shortened to:

private boolean isItemPutEligible(final SolrDocument doc) {
    Object obj = doc.getFieldValue(IS_PUT_ELIGIBLE);
    if (obj != null && obj.toString().toUpperCase().equals("Y")) {
        return true;
    }
    return false;
}

Instead of this code:

Object obj = doc.getFieldValue(PROD_ID);
if (obj != null) {
    if (obj instanceof ArrayList<?>) {
        ArrayList<?> al = (ArrayList<?>) obj;
        if (!al.isEmpty()) {
            Object o = al.get(0);
            if (o != null) {
                result = o.toString();
            }
        }
    } else {
        result = obj.toString();
    }
}

Could be shortened to:

Object obj = doc.getFirstValue(PROD_ID);
if (obj!=null)
    return obj.toString();
return null;

Both examples are much easier to read and understand I think.

In first instance we just avoid using unnecessary variable:

putEligibility

In second instance we used method that returns the very first value from List or Object found in document per field name or null:

getFirstValue

I think just reviewing such examples would inspire you to write less code and drink more of your favorite drink!!!

What is new in SOLR 6.x

Solr 6 builds on the innovation of Solr 5 obviously.
First of all – let’s take a look at what was done in Solr 5.
There were improvements for “bin/solr” and “bin/post” – easy to startup Solr, add new documents, more APIs were introduced.
The user interface was rewritten in modern language (that is AngilarJS) to allow for more innovation and enhancements in near future.
The security was requested for a long time and so it was introduced in Solr 5. A few plugins were written for Kerberos and for basic authentication and authorization. There are plugin examples for customization.
Solr 5.4 introduced basic authentication
In Solr 5.5 the rule-based authorization expanded and became more flexible. The APIs were expanded such as ConfigSet API and Collections API expanded to manage collections flexibly (elegantly).
There is a new script in “bin/solr” for import and export of ZooKeeper configs. There are performance optimizations for faceting DocValue fields.

There are quite a few features in Solr 6 but let’s focus on few of them. The few big ones are ParallelSQL, Cross Data Center Replication, Graph Traversal, Modern APIs, new Jetty 9.3 with improved performance and support for HTTP/2.

ParallelSQL introduced to support relational algebra in a scalable manner. It seamlessly combines SQL with Solr’s full-text capabilities.

Parallel SQL has two modes: Realtime MapReduce and Facet aggregation model. MapReduce mode is for high cardinality fields and performs aggregation of distributed joins data. It uses the concept of shuffling very much like Map Reduce implementation frameworks, which partition the data for greater scalability, so a partitioning key is a very important piece of the data there. The other mode – Facet aggregation which pushes the aggregation to the nodes and only aggregated data returns back. So if you have a lot of data but low to no cardinality such option is quite performant.

Parallel SQL builds on two capabilities that are already in previous incarnations for SOLR: Export request handler and Streaming API.

Export request handler provides the capability of streaming the whole resultset. This can be used even with large resultsets to export them out of SOLR.

The search function is not the only one function available for Streaming API. There are also functions such as Stream Source and Stream Decorators. They define how data is retrieved and any aggregation performed and they designed to work with entire resultset. They can be compounded or wrapped to perform several operations at the same time.

Solr 6.x supports graph queries to find the interconnected data. This is local param type query parser that able to follow nodes to edges. Graph queries allow applying optional filters during the traversal. For example, you can find what your friends on social media likes “Honda Civic R 2017”, find what airplanes my friends used to fly by.

Solr 6.x APIs are more consistent, versioned, endpoint names are friendlier, JSON output by default but “wt” is still supported.

Lucene recently switched to new text scoring and instead of using TF*IDF it uses BF25.
Solr 6.x relies on latest Lucene trunk, so it inherited the same scoring algorithm. BF25 algorithm is a probabilistic model vs Term Frequency that used previously.

There is a new API to perform Backups and Restores.

Moving to Solr 6.x

First of all Solr 6.x expects that Java 8.x or higher installed on the host computer.
There is no more default schemaFactory but ManagedIndexSchemaFactory used instead. There will be no more schema.xml but managed-schema.
If no any SimilarityFactory defined then it defaulted to SchemaSimilarityFactory. If fieldType missed the similarity description it will default to BM25.

How to configure SQL Server 2012 Express to allow remote connections

INTRODUCTION

When you try to connect to an instance of Microsoft SQL Server 2012 Express from a remote computer, you might receive an error message. This problem might occur when you use any program to connect to SQL Server.

For example, you receive the following error message when you use the SQLCMD utility to connect to SQL Server:

Sqlcmd: Error: Microsoft SQL Native Client: An error has occurred while establishing a connection to the server. When connecting to SQL Server 2012, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections.

This problem might occur when SQL Server 2012 Express is not configured to accept remote connections. By default, SQL Server 2012 Express Edition do not allow remote connections.

To configure SQL Server 2012 Express to allow remote connections, you must complete these steps:
•Enable remote connections on the instance of SQL Server that you want to connect to from a remote computer.
•Turn on the SQL Server Browser service.
•Configure the firewall to allow network traffic that is related to SQL Server and to the SQL Server Browser service.

To enable remote connections on the instance of SQL Server 2012 Express and to turn on the SQL Server Browser service, use the SQL Server 2012 Surface Area Configuration tool. The Surface Area Configuration tool is installed when you install SQL Server 2012.

Enable remote connections for SQL Server 2012 Express

You have to enable remote connections for each instance of SQL Server 2012 Express that you want to connect to from a remote computer. To do this, follow these steps:
1.Click Start, point to Programs, point to Microsoft SQL Server 2012 Express, point to Configuration Tools, and then click SQL Server Surface Area Configuration.
2.On the SQL Server 2012 Surface Area Configuration page, click Surface Area Configuration for Services and Connections.
3.On the Surface Area Configuration for Services and Connections page, expand Database Engine, click Remote Connections, click Local and remote connections, click the appropriate protocol to enable for your environment, and then click Apply.

Note Click OK when you receive the following message:
Changes to Connection Settings will not take effect until you restart the Database Engine service.

4.On the Surface Area Configuration for Services and Connections page, expand Database Engine, click Service, click Stop, wait until the MSSQLSERVER service stops, and then click Start to restart the MSSQLSERVER service.

Enable the SQL Server Browser service

If you are running SQL Server 2012 Express by using an instance name and you are not using a specific TCP/IP port number in your connection string, you have to enable the SQL Server Browser service to allow for remote connections. For example, SQL Server 2012 Express is installed with a default instance name of Computer Name\SQLEXPRESS. You only have to enable the SQL Server Browser service one time, regardless of how many instances of SQL Server 2012 Express you are running. To enable the SQL Server Browser service, follow these steps.

Important These steps may increase your security risk. These steps may also make your computer or your network more vulnerable to attack by malicious users or by malicious software such as viruses. We recommend the process that this article describes to enable programs to operate as they are designed to, or to implement specific program capabilities. Before you make these changes, we recommend that you evaluate the risks that are associated with implementing this process in your particular environment. If you choose to implement this process, take any appropriate additional steps to help protect your system. We recommend that you use this process only if you really require this process.
1.Click Start, point to Programs, point to Microsoft SQL Server 2012 Express, point to Configuration Tools, and then click SQL Server Surface Area Configuration.
2.On the SQL Server 2012 Express Surface Area Configuration page, click Surface Area Configuration for Services and Connections.
3.On the Surface Area Configuration for Services and Connections page, click SQL Server Browser, click Automatic for Startup type, and then click Apply. Note When you click the Automatic option, the SQL Server Browser service starts automatically every time that you start Microsoft Windows.

4.Click Start, and then click OK.
Note When you run the SQL Server Browser service on a computer, the computer displays the instance names and the connection information for each instance of SQL Server that is running on the computer. This risk can be reduced by not enabling the SQL Server Browser service and by connecting to the instance of SQL Server directly through an assigned TCP port. Connecting directly to an instance of SQL Server through a TCP port is beyond the scope of this article. For more information about the SQL Server Browser server and connecting to an instance of SQL Server, see the following topics in SQL Server Books Online:

•SQL Server Browser Service
•Connecting to the SQL Server Database Engine
•Client Network Configuration

Create exceptions in Windows Firewall

These steps apply to the version of Windows Firewall that is included in Windows OS. If you are using a different firewall, see your firewall documentation for more information.

If you are running a firewall on the computer that is running SQL Server 2012 Express, external connections to SQL Server 2012 are blocked unless SQL Server 2012 Express and the SQL Server Browser service can communicate through the firewall. You must create an exception for each instance of SQL Server 2012 Express that you want to accept remote connections and an exception for the SQL Server Browser service.

SQL Server 2012 Express uses an instance ID as part of the path when you install its program files. To create an exception for each instance of SQL Server, you have to identify the correct instance ID. To obtain an instance ID, follow these steps:
1.Click Start, point to Programs, point to Microsoft SQL Server 2012 Express, point to Configuration Tools, and then click SQL Server Configuration Manager.
2.In SQL Server Configuration Manager, click the SQL Server Browser service in the right pane, right-click the instance name in the main window, and then click Properties.
3.On the SQL Server Browser Properties page, click the Advanced tab, locate the instance ID in the property list, and then click OK.
To open Windows Firewall, click Start, click Run, type firewall.cpl, and then click OK.
Create an exception for SQL Server 2012 Express in Windows Firewall
To create an exception for SQL Server 2012 in Windows Firewall, follow these steps:1.In Windows Firewall, click the Exceptions tab, and then click Add Program.
2.In the Add a Program window, click Browse.
3. Click C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe, click Open, and then click OK.
4.Repeat steps 1 through 3 for each instance of SQL Server 2012 Express that needs an exception.

Create an exception for the SQL Server Browser service in Windows Firewall
To create an exception for the SQL Server Browser service in Windows Firewall, follow these steps:1.In Windows Firewall, click the Exceptions tab, and then click Add Program.
2.In the Add a Program window, click Browse.
3.Click the C:\Program Files\Microsoft SQL Server\90\Shared\sqlbrowser.exe executable program, click Open, and then click OK.

Note The path might be different, depending on where SQL Server 2012 Express is installed.