Code Chronicle: Big Data Analytics

Functions are group of statements that perform a task. Scala functions do look a lot like variables. The following code shows how a function can be declared.



A simple example of a function in Scala

def add(a:Int,b:Int):Int=a+b

Within the parenthesis lies the arguments followed by the argument type, after the parenthesis the return type and after the equal to "=" lies the logic of our function. The above code can also be written as below.


def add(a:Int, b:Int):Int={
  println(a+b)
}


def sum(a:Int, b:Int):Int={
  a+b
}


 println(sum(10,20))
 add(20,30)

Points to note

Scala Functions start with a def keyword
Scala permits nested function definitions
Braces {} are optional. For clarity purposes once can enclose multi statement functions in braces.
return statement is optional
A Function that doesn't return any value can return Unit, equivalent to void in other languages.

Hadoop is a framework to process huge amount of data across clusters of computers, using commodity hardware in a distributed computing environment. It can work on a single server or thousands of machines having their own storage. Hence it is a massively parallel execution environment that brings the power of supercomputing using only commodity hardware. Hadoop is primarily used for big data analytics.
Hadoop should be classified as an ecosystem comprised of many components that range from data storage, to data integration, to data processing, to specialized tools for data analysts.

Hadoop Components

HDFS is a main component of Hadoop. It is a distributed File System able to run on commodity hardware. This is where the data is stored. It provides the foundation for other tools, such as HBase.

MapReduce: Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). MapReduce is a core component of the Apache Hadoop software framework. Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage.
HBase: A column-oriented NoSQL database. Simply put HBase is the DataStore for Hadoop and BigData.
Zookeeper: It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is Hadoop’s distributed coordination service. Specifically designed for distributed management. Many components of Hadoop depend on Zookeeper.
Oozie: Oozie is Hadoop workflow scheduler. It schedules Hadoop Jobs.It is integrated with rest of Hadoop stack.
Pig: Pig is a platform for analyzing large data sets. It consists of its own scripting language, PIG Latin which is translated by the compiler that produces MapReduce sequences.
Hive: An SQL-like, high-level language It works like pig but translate Sql like queries into MapReduce sequences.

The Hadoop ecosystem also contains several other frameworks

Sqoop: Tool to transfer data to and from Hadoop to relational databases.
Flume: Tool to move data from individual machines to HDFS.

Code Chronicle

Scala Functions

What is Hadoop

Hadoop Components

Unleashing the Power of NumPy Arrays: A Guide for Data Wranglers

Pages

Search This Blog