Showing posts with label Big Data Analytics. Show all posts
Showing posts with label Big Data Analytics. Show all posts

Scala Functions


Functions are group of statements that perform a task. Scala functions do look a lot like variables. The following code shows how a function can be declared.


A simple example of a function in Scala
def add(a:Int,b:Int):Int=a+b
Within the parenthesis lies the arguments followed by the argument type, after the parenthesis the return type and after the equal to "=" lies the logic of our function. The above code can also be written as below.

def add(a:Int, b:Int):Int={
  println(a+b)
}







def sum(a:Int, b:Int):Int={
  a+b
}


 println(sum(10,20))
 add(20,30)

Points to note
  1. Scala Functions start with a def keyword
  2. Scala permits nested function definitions
  3. Braces {} are optional. For clarity purposes once can enclose multi statement functions in braces.
  4. return statement is optional
  5. A Function that doesn't return any value can return Unit, equivalent to void in other languages.

What is Hadoop



Hadoop is a framework to process huge amount of data across clusters of computers, using commodity hardware in a distributed computing environment. It can work on a single server or thousands of machines having their own storage. Hence it is a massively parallel execution environment that brings the power of supercomputing using only commodity hardware. Hadoop is primarily used for big data analytics. 
Hadoop should be classified as an ecosystem comprised of many components that range from data storage, to data integration, to data processing, to specialized tools for data analysts.

Hadoop Components


HDFS is a main component of Hadoop. It is a distributed File System able to run on commodity hardware. This is where the data is stored. It provides the foundation for other tools, such as HBase.

  1. MapReduce: Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). MapReduce is a core component of the Apache Hadoop software framework. Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage.
  2. HBase: A column-oriented NoSQL database. Simply put HBase is the DataStore for Hadoop and BigData. 
  3. Zookeeper: It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is Hadoop’s distributed coordination service. Specifically designed for distributed management. Many components of Hadoop depend on Zookeeper.
  4. Oozie: Oozie is Hadoop workflow scheduler. It schedules Hadoop Jobs.It is integrated with rest of Hadoop stack.
  5. Pig: Pig is a platform for analyzing large data sets. It consists of its own scripting language, PIG Latin which is translated by the compiler that produces MapReduce sequences.
  6. Hive: An SQL-like, high-level language It works like pig but translate Sql like queries into MapReduce sequences.
The Hadoop ecosystem also contains several other frameworks
  1. Sqoop: Tool to transfer data to and from Hadoop to relational databases. 
  2. Flume: Tool to move data from individual machines to HDFS.

Unleashing the Power of NumPy Arrays: A Guide for Data Wranglers

Ever feel like wrestling with data in Python using clunky loops? NumPy comes to the rescue! This blog post will unveil the magic of NumPy a...