Sumit Bisht: cloud

Showing posts with label cloud. Show all posts

Friday, September 27, 2013

Enchanting cloud with juju !

Automating infrastructure is not a new thing, but doing so with the ease of setting up paas is still in its infancy. As an interesting solution to this, Cannonical has recently come out with a PAAS tool Juju that can easily setup, configure, provision, include and change master/slaves, provide continuous integration environment among other things. This tool has not come out from wild, but is currently in use: Juju and MaaS are the default deployers for Openstack. So all large Openstack deployments that are currently happening on Ubuntu use Juju and MaaS. Additionally Ubuntu One and other Ubuntu cloud services are all running on top of Juju.
Juju might sound similar to Puppet or Chef, but rather than just providing a server, this provides various services. Internal to each juju are different charms that call various programs and provide the infrastructure to the juju. The charms App Store contains charms on various services which can be used in the juju to solve a particular task. There are currently hundreds of different charms available - from databases to php/ruby/python servers to nosql and hadoop instances and a growing community of developers who customize and release their own customized charms.
One feature that is worth mentioning is the ability to create a traditional Paas such as CloudFoundary as a charm to run over juju.
For instance, here's the procedure to boot off a rails sever:
juju deploy rails myapp --config sample-app.yml
(This yml file contains the url to the github repository containing the charm)
juju deploy haproxy
juju add-relation haproxy myapp
Now the database can be created:
juju deploy postgresql
juju add-relation postgresql:db myapp
then migrate the database
juju ssh myapp/0 run rake db:migrate
Finally expose the proxy
juju expose haproxy
and view the status, which should provide you with the public url
juju status haproxy
The coolest feature now is to add/remove servers horizontally through gui or through command as:
juju add-unit myapp
juju remove-unit myapp
Currently, there is also a charm-championship taking place that encourages developers to take part in developing customized charms for various themes.
So what are you waiting for?
Jump right in at https://jujucharms.com to see it in action.

Thursday, June 6, 2013

MapReduce2.0: Next Generation Mapreduce using YARN

There has been a lot of long overdue changes in apache hadoop, the most popular framework to perform mapreduce over big data.
As the users of hadoop may recall, one of the vulnerability in mapreduce was the presence of a single node that performed work of job tracker which performed the tasks of job scheduling as well as management and handling errors in map and reduce operations done by these jobs. Until now, there had been workarounds that addressed the failure of this job management node but the separation of work in the jobtracker was not addressed. In particular, several aspects regarding scalability, cluster utilization, change/upgradation, supporting workloads other than MapReduce itself.

YARN(Yet-Another-Resource-Negotiator) separates this problem by splitting up resource management and job scheduler into different daemons that are then applicable on applications - either as a single job or a graph of interrelated jobs. Resource management is done by a global ResourceManager(RM) that resembles the old job tracker, but exists independently of the job or its application. To perform the jobs for each application, ApplicationMaster(AM) is used which acts as an intermediary between the RM and the nodes, which are in turn manipulated through NodeManager(NM) and it relays the node health and resource requirements to the RM as needed.

The YARN archiecture

A key advantage of YARN is that it supports algorithms other than MapReduce. There are problems that need to have intermediate or realtime results while processing and traditionally can't be used in mapreduce such as graph processing. Simply put in context of hadoop, if one node's results are required by another after or during processing of one or the both records, then batch processing is the answer.

People generally get confused in the context of the newer version of hadoop(the top level project) with MRv2 as the hadoop supports both the MR1 as well as MRv2. The change is simply in the architecture, that can determine the performance(especially post the map stage - wherein the mapreduce implementations often left nodes underutilized at the reduce phase) of the application in use. One key thing to note here is the fact that the YARN is a framework in itself and what hadoop does is to run mapreduce as a job on the YARN. Thus hadoop preserves its existing API but reaps the benefits arising out of dynamic node allocation done internally by YARN.

Saturday, September 22, 2012

Going to classes, the online way

Recently, there have been a lot of announcements regarding free online classrooms for instructor-led E-learning. I was a bit skeptical of it at first, but now am glad that I enrolled myself into it.

Some of the advantages that online courses offer over conventional courses:

All the comforts of an online\distance learning course for free.
Vibrant discussion forums that help erase doubts.
High quality material: These courses are taught as-is in the real classes in top universities.
Hands-on learning: The learner gets to create programs and solve practical problems related with the topic.
Stay focused on a specific topic - working as a professional developer, it is natural to get stuck with the mundane aspects of programming and forget the big picture as there is the excitement into hacking into different topics without much time constraints, like in college.

I am currently enrolled in two courses at coursera.org and one at edx.
My first course is a 10 week course about Big Data, which I had an inking that it would be related with hadoop and other high level aspects of it, but what surprised me was that the first three weeks of this course have passed and the instructor is tirelessly helping us understand the basics/theory of the importance of big data and its key challenges. This being an IIT post graduate level course, further lends credibility to its undivided concentration over the background aspects of big data (and intelligently breaking it down to look, listen, learn, connect, predict, correct).
Another course that has just started is Programming in Scala by none other than its inventor itself, Martin Odensky. What is pleasant about the course is that right from day one, the course uses test driven development and sbt for building & submitting programming excercises.
I also have high hopes for the upcoming SAAS course offered by MIT that seems promising and offers me to create a cloud based solution after a long time.

These myriad courses do take up time, especially the after work hours, but is a better use to put my time rather than just browsing around. So, lot of efforts and headaches are induced into this process, but is worth a try - whether you are a student, a developer or a curious bystander who happens to have some of these type of subjects affecting them in one way or another.

Tuesday, July 31, 2012

VM management with Vagrant

Setting up a virtualized environment on your own machine is always a headache inducing and risky setup. However, with virtualbox, one can use the vms without changing the OS internals (such as in Xen). The chief disadvantage of using virtualbox directly is that the virtual machine quickly escalates to a huge scale, and if you are cloning/ distributing your virtual machine, it easily becomes a hassle involving both the VM and its virtual hard disk.

However, with Vagrant, one can quickly create, configure and delete virtual machines, similar to a professional cloud environment like Amazon.
According to their documentation; ' Vagrant gives you the tools to build unique development environments for each project once and then easily tear them down and rebuild them only when they’re needed so you can save time and frustration.'. The documentation starts with a 5 minute tutorial that demonstrates how easy it is to setup, connect and configure different VM instances.

The configuration is handled by a vargant file, which is a ruby DSL.For the GUI inclined people, Vagrant can also be configured by different provisioners like Puppet or Chef. Thus, developers and project managers can quickly configure and setup environments for different projects.

By easily enabling the developer to directly configure the VM, the development process can be streamlined, in principle with the DevOpts movement.

Sunday, February 20, 2011

Increasing usage of Functional Programming in driving scalable architecture

One of the most exciting technologies that I've been pursuing lately is nothing new, but as an old wine in a new bottle, provides an alternative method to solve emerging issues for addressing performance. I am referring to Functional Programming, that has been growing in importance during some period in past few years due to various ongoing development in today's huge data processing applications used in enterprise environments. I've been seeing the rise of this otherwise academic language, and is increasingly used/considered to be usable for the past few years due to various ongoing developments.

Features
Functional programming provides various features that separates it as a different paradigm for programming( http://en.wikipedia.org/wiki/Functional_programming ). I'll be explaining various features as per this post's context.
Higher Order Functions : FP revolves around functions! Think of this as a small version of class in OOP. If you've worked with, say anonymous or inner classes in the past. You've unknowingly been doing FP in OOP, and needless to say, it was difficult to learn and maintain. Generics solve the same problem as First Class Functions, but the resultant implementation leaves a lot to be desired.
Recursion : Again a handy feature that provides a lot of bang for small amount of buck, or code! If you know how to use your functions, you can do a lot with your code without having to write boilerplate or plumbing code in your algorithmic implementations.
Immutability : Now this is what I am talking about! We all are familiar with multi-threaded programming, yet the tools for using it are hardly ever used as multi threaded programming involves sharing of resources, which is not a good thing. In all other programming paradigms, we are familiar with the foll construct :
int a = 0; Initialization at this line in the memory

   a = 1;       Memory address pointed by a gets 'Updated'

The problem lies with the second statement as we don't know when and more importantly which thread is going to invoke it. But if you recall from your Mathematics, we used to have following notations there:

   let a=0      Initialize a

   a=5          Assign a something new, and ignore previous state

As the value of a is immutable, we can safely have as many threads to it as we need. This means, as our applications are thread safe, we can have concurrent / parallel executions of the same code by 2 or 200000 invocations without adverse effects.

Applicability in meeting the challenges in scaling and performance

Today's applications need to fully utilize the underlying hardware. Moore's law doesn't holds when we consider raw cycle performance, but is still applicable if we consider the overall increase in processing capability through hyper threaded and multi core processor hardware. The information generation and its processing needs has increased too, leading to use of parallel computations. This is where the FP comes in handy. Its true that infrastructure can be abstracted from the application (as with cloud platforms) but if we do that ourselves, we can have greater flexibility and scalability for our solutions.
One feature of many 'newer' breed of FP languages like F# or scala that are hybrid in nature that I haven't discussed before is the ease of use of these languages is the ability to construct Domain Specific Languages; DSLs. From the business' perspectives, DSLs are fast becoming key to map the technology-business divide and keep application agility and re-usability simultaneously.

My exposure these days is with scala, which is a hybrid FP language that runs on JVM, and is interoperable with java. It is attracting the same kind of curiosity for java developers right now which ruby (largely because of Rails) did a few years back. However, this language aims at scalability, which is also its full name (scala stands for scalable architecture) within the JVM itself, making it an ideal candidate for solving middleware performance and scalability related issues. The fact that it replaced Rails in handling message processing in Twitter speaks for itself. I am also working on a research paper that explains the use of scala (and FP in general) to process massive amounts of data in distributed/cloud environment. Will post some more information here as it gets finalized.