Scalding is a Scala library. Scalding is easy to work with and reason about the data in distributed systems like Hadoop.
It presents the data as a collection and allows to perform the computation on data in a matter that is similar to Scala API, so it appears
to the developer that the data is a collection and performs simple operations like filter and map. Map and Reduce that used in Hadoop stems from functional
programming and has a natural fit for Scalding (based on Scala – functional programming language).
Scala is easier to read than Java and it is also much more compact and addresses business logic in a straightforward way.
Scalding is built on top of Cascading – an abstraction layer for Hadoop, written in Java.
The advantage of utilizing Scalding is that it hides the complexity that Hadoop and Map Reduce presents.
Scalding operate by allowing you to think of your data as flow in the series of pipes.
Scalding is better than Pig. Pig is very good at solving simple, quick tasks.
However, it needs to utilize other programming languages to solve complex tasks building UDFs, also hard to unit test.
Scalding API comes with 3 APIs:
Fields API
Typed API – that promotes the type safety
Matrix API – that deals with matrix operations like matrix multiplication.
Scala collections are in memory on the single host. Although Scalding is utilizing Scala and Cascading to operate
on collections of data, distributed on a number of commodity servers, however, it gives you the feeling of normal in-memory collection you are operating on.
If you’ve seen Spark’s RDD (resilient distributed datasets) – then it’s a very similar concept here.