About Scalding

Scalding is a Scala library. Scalding is easy to work with and reason about the data in distributed systems like Hadoop.
It presents the data as a collection and allows to perform the computation on data in a matter that is similar to Scala API, so it appears
to the developer that the data is a collection and performs simple operations like filter and map. Map and Reduce that used in Hadoop stems from functional
programming and has a natural fit for Scalding (based on Scala – functional programming language).

Scala is easier to read than Java and it is also much more compact and addresses business logic in a straightforward way.

Scalding is built on top of Cascading – an abstraction layer for Hadoop, written in Java.
The advantage of utilizing Scalding is that it hides the complexity that Hadoop and Map Reduce presents.

Scalding operate by allowing you to think of your data as flow in the series of pipes.

Scalding is better than Pig. Pig is very good at solving simple, quick tasks.
However, it needs to utilize other programming languages to solve complex tasks building UDFs, also hard to unit test.

Scalding API comes with 3 APIs:

Fields API
Typed API – that promotes the type safety
Matrix API – that deals with matrix operations like matrix multiplication.

Scala collections are in memory on the single host. Although Scalding is utilizing Scala and Cascading to operate
on collections of data, distributed on a number of commodity servers, however, it gives you the feeling of normal in-memory collection you are operating on.

If you’ve seen Spark’s RDD (resilient distributed datasets) – then it’s a very similar concept here.

Top 10 Big Data Trends for 2017

Tableau published a paper on Top 10 Big Data Trends for 2017 that you can find here: http://tabsoft.co/2jXCXar

We disagree that speeding up Hadoop as number 1 trend. The author takes the evolutionary approach and not revolutionary one. What needed more and more is the event driven processing. The machinery should react to new incoming data that is easier to process. Even the 2nd trend is talking to some degree that Big Data is not just about to utilize Hadoop-like systems. Hadoop was written for batch processing.

Utilizing reactive streams:

When we were consulting for electronic book and academic journal publisher the vendor used Amazon Elastic MapReduce.  The vendor paid around 1.5 million annually for utilizing a very large cluster of Amazon EMR services for re-processing books and journals on daily basis.  In few instances, the error would happen and the vendor had to re-run the job.  We reengineered the system where only new content processed as soon as it would become “alive”.  Such system saved the company 90% on payment to Amazon for EMR resource utilization and newest books and journals appear online in a few seconds after the content would up-stream to the vendor.

5th trend on the list talks about the Variety of data.  But that’s not new, Gartner published the traits of Big Data in 2001.  What is not elaborated upon in the paper is HOW to tackle the variety.

Scala computer language is Scalable Language – language that borrowed from various programming languages the best concepts; enhanced and extended them.  It became trivial to create DSL ( Domain Specific Language) on top of Scala.  For example, in order to parse the JSON of any kind, the programmer can create a few lines of code that would match such JSON structure in Object Oriented way.

The paper prepares the reader to the new Trends in 2017 but focuses too much on Hadoop infrastructure and not at all on new trends that aid the Big Data – such as reactive streams, functional programming, elastic systems, non-blocking I/O, anychronous processing of “live” data. But we still recommend you to read Tableau’s research.

Code Musing

“I am sorry I have had to write you such a long letter, but I did not have time to write you a short one”
Pascal, Blaise (1623 – 1662) – French philosopher and mathematician.

At the age of 18 he invented the first calculating machine.


So I wonder why do we make the same mistake?  Let’s review a few code examples.


Instead of following code:

private boolean isItemPutEligible(final SolrDocument doc) {
    String putEligibility = "N";
    Object obj = doc.getFieldValue(IS_PUT_ELIGIBLE);
    if (obj != null) {
        putEligibility = obj.toString();
    if ("Y".equalsIgnoreCase(putEligibility)) {
        return true;
    return false;

Could be shortened to:

private boolean isItemPutEligible(final SolrDocument doc) {
    Object obj = doc.getFieldValue(IS_PUT_ELIGIBLE);
    if (obj != null && obj.toString().toUpperCase().equals("Y")) {
        return true;
    return false;

Instead of this code:

Object obj = doc.getFieldValue(PROD_ID);
if (obj != null) {
    if (obj instanceof ArrayList<?>) {
        ArrayList<?> al = (ArrayList<?>) obj;
        if (!al.isEmpty()) {
            Object o = al.get(0);
            if (o != null) {
                result = o.toString();
    } else {
        result = obj.toString();

Could be shortened to:

Object obj = doc.getFirstValue(PROD_ID);
if (obj!=null)
    return obj.toString();
return null;

Both examples are much easier to read and understand I think.

In first instance we just avoid using unnecessary variable:


In second instance we used method that returns the very first value from List or Object found in document per field name or null:


I think just reviewing such examples would inspire you to write less code and drink more of your favorite drink!!!

What is new in SOLR 6.x

Solr 6 builds on the innovation of Solr 5 obviously.
First of all – let’s take a look at what was done in Solr 5.
There were improvements for “bin/solr” and “bin/post” – easy to startup Solr, add new documents, more APIs were introduced.
The user interface was rewritten in modern language (that is AngilarJS) to allow for more innovation and enhancements in near future.
The security was requested for a long time and so it was introduced in Solr 5. A few plugins were written for Kerberos and for basic authentication and authorization. There are plugin examples for customization.
Solr 5.4 introduced basic authentication
In Solr 5.5 the rule-based authorization expanded and became more flexible. The APIs were expanded such as ConfigSet API and Collections API expanded to manage collections flexibly (elegantly).
There is a new script in “bin/solr” for import and export of ZooKeeper configs. There are performance optimizations for faceting DocValue fields.

There are quite a few features in Solr 6 but let’s focus on few of them. The few big ones are ParallelSQL, Cross Data Center Replication, Graph Traversal, Modern APIs, new Jetty 9.3 with improved performance and support for HTTP/2.

ParallelSQL introduced to support relational algebra in a scalable manner. It seamlessly combines SQL with Solr’s full-text capabilities.

Parallel SQL has two modes: Realtime MapReduce and Facet aggregation model. MapReduce mode is for high cardinality fields and performs aggregation of distributed joins data. It uses the concept of shuffling very much like Map Reduce implementation frameworks, which partition the data for greater scalability, so a partitioning key is a very important piece of the data there. The other mode – Facet aggregation which pushes the aggregation to the nodes and only aggregated data returns back. So if you have a lot of data but low to no cardinality such option is quite performant.

Parallel SQL builds on two capabilities that are already in previous incarnations for SOLR: Export request handler and Streaming API.

Export request handler provides the capability of streaming the whole resultset. This can be used even with large resultsets to export them out of SOLR.

The search function is not the only one function available for Streaming API. There are also functions such as Stream Source and Stream Decorators. They define how data is retrieved and any aggregation performed and they designed to work with entire resultset. They can be compounded or wrapped to perform several operations at the same time.

Solr 6.x supports graph queries to find the interconnected data. This is local param type query parser that able to follow nodes to edges. Graph queries allow applying optional filters during the traversal. For example, you can find what your friends on social media likes “Honda Civic R 2017”, find what airplanes my friends used to fly by.

Solr 6.x APIs are more consistent, versioned, endpoint names are friendlier, JSON output by default but “wt” is still supported.

Lucene recently switched to new text scoring and instead of using TF*IDF it uses BF25.
Solr 6.x relies on latest Lucene trunk, so it inherited the same scoring algorithm. BF25 algorithm is a probabilistic model vs Term Frequency that used previously.

There is a new API to perform Backups and Restores.

Moving to Solr 6.x

First of all Solr 6.x expects that Java 8.x or higher installed on the host computer.
There is no more default schemaFactory but ManagedIndexSchemaFactory used instead. There will be no more schema.xml but managed-schema.
If no any SimilarityFactory defined then it defaulted to SchemaSimilarityFactory. If fieldType missed the similarity description it will default to BM25.

Jewish Calendrical calculations

A well-known mnemonic for calculating days of the week is the Calendar Atbash. An Atbash is a simple cypher where the first letter of the alphabet is replaced by the last, the second by the next to last, and so on. Thus Aleph is replaced by Tof, Beth by Shin and so on; this gives the acronym Atbash.

Applying the Atbash to the first seven days of Pesach, we get

Aleph – Tof – Tisha B’Av
Beth – Shin – Shavuot
Gimel – Resh – Rosh Hashana
Daled – Kuf – Keriat Hatorah, i.e. Simchat Torah, a day devoted to Keriat (“reading of”) the Torah
He – Tzadi – Yom Tzom Kippur, the Day of the Fast of Atonement
Vav – Pe – Purim
Zayin – Ayin – Yom ha-Atzmaut, Israel Independence Day
This is to be read “The first day of Pesach is on the same day of the week as the date beginning Tof, i.e. Tisha b’Av”, etc. (The first line is spoilt if that day is Shabbat so that the fast has to be postponed to Sunday.) Israel Independence Day may also be moved. Note that the Atbash remained incomplete until the creation of the State of Israel meant that this new festival was created.

About C++ 11

C++ is a very popular language and after 30 years, it’s still widely considered a best choice for many types of projects, including large scale systems and applications coding.
If you consider the layers of technology in a computer system as a stack, C++ is used to write code at all levels except firmware with its most common usage at the application level. Today, vast numbers of medium to large scale applications are written in C++. The list is huge and includes Microsoft Office, Adobe Photoshop, Illustrator, InDesign, Firefox, Google Chrome, provisioning service and billing systems for major phone and networks, even major web sites like Amazon, Facebook, and Google are either written in or have significant backend resources written in C++.

Ratified in August, 2011, C++11 is the first real extension of the C++ Standard. It provides a number of new features, including a range based for loop, type inference, lambda functions, and unambiguous null pointer constant,
and most of TR1.

Technical Report 1 or TR1 is mostly a set of library extensions, including regular expressions, smart pointers, hash tables, and random number generators.

Best practices in designing RESTful APIs

An affordance is a quality of an object, or an environment, which allows a user to perform an action.

You can follow these steps:

  • Identify stakeholders
  • Identify activities
  • Break activities into steps
  • Create API definitions
  • Validate API

The following questions should be asked:

  • Can resources exist one without the other?
  • Does one resource exist when another one exists?
  • Does the relationship between resources require more information than just the links between them?



How to set AWS Command Line Interface

pip install awscli


The AWS Command Line Interface User Guide walks you through installing and configuring the tool. After that, you can begin making calls to your AWS services from the command line.
$ aws ec2 describe-instances

$ aws ec2 start-instances –instance-ids i-1348636c

$ aws sns publish –topic-arn arn:aws:sns:us-east-1:546419318123:OperationsError –message “Script Failure”

$ aws sqs receive-message –queue-url https://queue.amazonaws.com/546419318123/Test

You can get help on the command line to see the supported services,
$ aws help
the operations for a service,
$ aws autoscaling help
and the parameters for a service operation.
$ aws autoscaling create-auto-scaling-group help

Using Knockout.js with PHP: Best Practices

I recently had a project that made me temporarily shift from my more native c#/asp.net environment and use php on the backend instead. As by more and more asp.net developers these days I have become accustomed to doing just about everything in the front end with the JavaScript library Knockout. Of course knockout.js is completely compatible with php being that it is on the front end while php is on the back end, but in combining php with knockout there are a few things that I have found make the match just a bit smoother.

Use json_encode() to pass PHP arrays to knockout

function getOrders() {
include_once 'mysql_connect.php'; 
$email = $_SESSION['Email'];

$query = sprintf("SELECT * FROM `Order` WHERE `Email` = '%s' order by id desc",
		mysqli_real_escape_string($con, $email));
$result = mysqli_query($con, $query);
$data = array();
while($row = mysqli_fetch_array($result, MYSQLI_ASSOC)){
  $data[] = $row;

return json_encode($data);//json_encode() is the key

Then on on the front-end:

$(document).ready(function () {
    //Pass JSON encoded data directly into javascript variable
    var data = <?php getOrders(); ?> ;
    var vm = new ViewModel(data);	

function ViewModel (data){
	self = this;
	self.Orders = ko.mapping.fromJS(data);

Use ko.toJS() to send data from your ViewModel to PHP

function ViewModel (){
    self = this;
    self.Order = {
		FirstName : ko.observable(),
		LastName : ko.observable(),
		URL : ko.observable(),	
		Comments : ko.observable()	
    self.CreateOrder = function() {
        //Here is where you convert the data to something php can swallow
        var data = ko.toJS({"Data":order});
            url: "CreateOrder.php",
            type: 'post',
            data: data,
            success: function (result) {

And then on the back end:

include_once 'mysql_connect.php'; 
//recieve raw data into php variable
$data = $_POST['Data'];

//extract each field from raw data
$email = $data['Email'];
$firstName = $data['FirstName'];
$lastName = $data['LastName'];
$comments = $data['Comments'];