Saturday, February 15, 2014

Optical Character Recognition with Nodejs

Today, I was prototyping a OCR tool to use as a web based API. My first intention was to develop a desktop version in python and provide it via flask application, but using node proved to be a lot easier.
Node.js has a library binding with Tesseract which proved to be quite handy.
I simply installed the library first using npm

npm install nodecr

Next, in a simple node application, I processed a user uploaded image:

ncr.process(filePath, function(error, text){ ....

This callback function performed the task of parsing the image and providing the text.

I have uploaded it into a generic application at https://github.com/SumitBisht/node-ocr and hope you will find it helpful.
Note that this is a really dumb form of OCR and the image sanitation needs to be provided first into it, on which I am working upon.

Sunday, February 9, 2014

Book Review: Thinking With Data

Thinking With Data - 

How to turn Information into Insights











 During the past week, I read this book by Max Shron which addresses the big data challenge from a different perspective - What questions to ask from data to gain the best possible or the most beneficial answer for its owners. The importance of this book is not just confined to people working in some niche segment, but associated with big data in general- from students/researchers to data analysts and developers. if you are expecting some big data technology or implementation, you'll be disappointed. Instead, the book focuses essentially on the problem of deciding What to find from data. For instance, the scope of this problem is further extended into 4 parts - context, needs, vision, outcome.

As I am currently working full time on big data project where similar problems crop up - for instance, we know how to perform a predictive analysis algorithm but the main challenge is to select a specific algorithm and fields to obtain the mathematical result that can infer results which actually solve business problems. As the author is into data strategy consultancy and a former data scientist, the tone of the book is quite practical and uses real world examples to better explain its concepts where needed.

One of areas where I felt the book was weak was in this assumption that the problem/challenge will be a greenfield one and the legacy/existing systems will not influence in the decision/role making process of proposed solutions. The presence of existing big data strategies in place can act as the guideline for the future process. Another thing which was missing were the presence of anti-patterns of the big data formulation strategy - such as what not to do while tinkering with the data and algorithms to extract intelligence from data.

In spite of being shorter than other books of the same topic, this book does a overall good job in discussing the problem of what to extract for big data analysis and is definitely a must read and reference for anyone dealing with the same and avoiding showing unnecessary noise instead of meaningful data.

Disclaimer: This book has been provided me by OReilly under their Blogger Review program.