using factory design patterns in map reduce design for big data analytics

WHITE paper

www.hcltech.com

Abstract

Abbreviations

Market Trends and Challenges

Solution

Case Study

Revenue Benchmarking

MR Latency Benchmarking

Word Count with Combiner

Word Count without Combiner

Best Practices

Conclusion

Reference

Author Info

2

2

2

3

5

5

5

6

7

7

7

7

7

TABLE OF CONTENTS

Adaptive MapReduce

WHITE PAPER

This paper explores the various map reduce design patterns and comes out with a unified working solution (library). The library has the potential to ‘adapt’ itself to any data processing need which can be achieved by Map Reduce. This would not only enable HCL and Clients save a lot of man hours but as well as enforces the ‘good practices’ of map reduce design pattern in the code by default. HCL Technologies has been actively working with multiple clients for the last couple of years in verticals such as ISPs, Aero, Banking & Finance and Media & Entertainment delivering them service and solutions in the Big Data/Data Analytics domain. One of the fundamental problems that all of these leading companies came up with to HCLT was processing big data which is in different data formats, spread across multiple sources and with one or more co-relational mapping parameters. There fuels a need for a unified library which can act as a bridge for solving these varied cross domain problems and utilize the good practices of Map Reduce.

Hadoop efficiently solved the Volume and Velocity of Big Data; however there is a gap which calls for a solution which will make use of existing frameworks to solve the Variety problem. The solution of the 3rd V (Variety) actually boils down to seamlessly handling of data processing even if the data type/processing algorithm gets modified. The clients gener-ally come up with ad-hoc data source/processing/mapping problems and we have to implement with the appropriate MR programs. However, due to isolated problems/data sources solo programs are written resulting in redundant effort in/across teams and project. Most of the times clients initially lack clear visibility of the entire requirements and midways may request to include a data source. In most of the cases there calls for a lot of rework involved which results in scope change from project management perspective and clients generally don’t want to reschedule much. The project which we are currently implementing for the largest Aerospace Company is a pre-prod application which will expand into a full time production environment in the near future. We currently have visibility into only 3 data source and in production the number of data sources would be at-least 5 times more. The task that the client has asked us to deliver is that there should be minimum code changes and no change at all in the architecture. This challenge is in line with the problem described in paragraphs above.

Fig1.MPP Report highlighting the efforts in man-days

Data processor/MR job for



Unit test with representative data

Report & Dashboard development

Tool evaluation for reports and dashboard

Develop reports (3 reports)

Develop dashboard

5 days

5 days

5 days

5 days

35 days

5 days

15 days

10 days

49

50

51

52

53

54

55

56

ID Task Name Duration Start Finish Predecess

Tue 11/25/14

Tue 11/25/14

Tue 12/2/14

Tue 12/9/14

Mon 11/3/14

Mon 11/3/14

Mon 11/10/14

Mon 12/1/14

Mon 12/1/14

Mon 12/1/14

Mon 12/8/14

Mon 12/15/14

Fri 12/19/14

Fri 11/7/14

Fri 11/28/14

Fri 12/12/14

33

33

49

51

54

55

Sl. No. Acronyms Full form

1 AMR Adaptive Map Reduce

Market Trends and Challenges

Abbreviations

Abstract

As we can clearly see in the diagram above to support each Data Processing Algorithm we need to spend about 5 Man-Days for the development alone. Now with use of AMR the need for such cycles can be eliminated

As in any programing paradigm MR has a set of design patterns too. The design patterns are generally based out of ‘good practices’ which evolves out of years of research and implementation in the industry. Currently when MR programs are written these patterns are not used always. However it has been noticed that there is a considerable improvement in performance when patterns are used. By introducing a library/framework we would enforce the projects to follow the good practices of MR. This would also enable projects to quickly map the processing logic to a pattern without much research and would ease the development effort a lot.

HCLT Analytics group have a lot of customizable solutions off the shelves for Data Ingestion, Data Persistence and Multi Tenancy however we don’t have a framework/library for core Data Processing of Hadoop.

The diagram depicts the fact that the degree to which software is customized does play an important role in project acquisi-tions. Hence a highly customizable solution in Big Data processing module can be of a great value addition to HCLT as a company. It will enable us to go for project acquisitions with overall solutions for every aspect of Data Analytics.

We decided to approach this problem first by analysing the Map Reduce design patterns. There are 23 patterns as of now.

Fig2. Major Variables affecting Software Acquisition

JoinMeta

PatternsInput and

OutputSummarization Filtering Data

Organitation

Reduce SideJoin

ReplicatedJoin

CompositeJoin

CartesianProducts

Job Chaining

Chain Folding

Top Ten Items

Job Marging

GeneratingData

ExternalSource Data

External Source Input

NumericalSummarization

Inverted IndexSummarization

Counting withCounters

Filtering

BloomFilering

Top Ten Items

Distinet

StructuredtoHierarchical

Parttioning

Binning

Total OrderSorting

Shuffling

PartitionPruning

Solution

EntirelyOff-the-Shelf

Software

Off-the-ShelfSoftware

Partly Customized

(a) Degree to which Acquired Software is Customized

(b) Scale of Acquisition, or Degree to which theoverall Acquisition is Acquired as Separated Components

EntirelyCustomSoftware

FullSystem

SeveralComponents

SingleComponent

The idea was to identify the commonality across these patterns and also to understand the level of dependencies among the implementation details for each pattern. We found out that each pattern require at least

Input and Output Paths: Which dataset to process? Where should be the output written?Class of Action required for example: Filtering, Aggregation etc.Processing Details: Which set of fields are required? How?Input and Output Data Types: What to process ?

Here as depicted in the diagram, different shapes are created using the Factory Pattern. The shapes are created using ‘Concrete Classes’, the Factory is passed on with the information to create the objects, the Factory instantiate the concrete class according to the information passed and a shape object is created.

The question that we asked ourselves was how to create a library/framework which can be used to instantiate the MR Job objects required serving any MR pattern. The well-oiled ‘Factory’ Design Pattern was used for this purpose.

Fig2. Major Variables affecting Software Acquisition

In AMR we created concrete MR classes for every MR design pattern. The information of which class to instantiate is passed on to the Factory using the xml configuration file as shown in the diagram above. When the data comes into the system the appropriate object is instantiated according to the rules set in regards to the source/algorithm and the MR Job is started. The design pattern used is in its nascent stages, though we are currently using Factory we can slowly evolve into a Builder Pattern when we would want to achieve greater granularity in the data processing. As of now the generic version of the library is WIP. * We cannot reveal the original Class Diagrams and Full Config file details currently due to NDA.

Quantitative benefits which can be achieved by AMR are mostly measurable however the framework/ library have the potential to get us some project acquisitions too. Currently we have not taken the solution to our sales teams who are likely to give us those figures. Through latency and cost benchmarking we can illustrate the measurable parameters as follows:

The MR Job above without Combiner takes about 40 min to complete as evident in the screenshot above. The CPU Time Taken is about 1964120 ms. One can notice that the Combine Input/Records are present in the screen shot below.

Case Study

Revenue BenchmarkingLet us assume an average of 5 man-days effort for on boarding a data source. With proposed AMR if we are proposing to reduce it to 4 days (average) per data source, we can claim 20% reduction in development effort to on board a new data source.

MR Latency BenchmarkingThe showcased example is the simplest example of Word Count in MR, but the benchmarks clearly highlight the advantages of using a design pattern.

Data Set: NY Times news articles: Source: ldc.upenn.edu Documents =300000 No. of Words =102660 Size of Data = 1 GB

Word Count with Combiner

The MR Job above without Combiner takes about 42.5 min to complete. The CPU Time Taken is about 1853760 ms. One can notice that the Combine Input/Records are 0 in the screen shot below.

We can deduce the following from the above There is a gain of about 2.5 min in processing latency There is an increase of about 6% CPU time utilization and 2% Physical Memory utilization. It shows greater consumption of the machine resources. More consumption of the machine resources is always preferable in a distributed environment.

Now as control measure we comment out the Combiner class as depicted above and run the program again.

Word Count without Combiner

We are utilizing the best practices of industry and bringing it all under an umbrella. These would result in huge qualitative benefits in terms of program code and processes.

The quality principle/objective of HCL as an organization is “We shall satisfy our customers by delivering quality products and services that meet their requirements on time, every time”. AMR as a framework ensures highest level of quality in the product/service we develop for implementing Data Processing for Big Data.

We also belief “The quality of a product is largely determined by the quality of the process that is used to develop and maintain it”. By introducing AMR we would be able to enforce a standardized process of MR across the organization which is based of industry’s best practices in terms of design patterns thus ensuring highest level of quality in the process itself. “On time Delivery, Cost Control, Enhance Customer Satisfaction and Continual Service Improvement” are the key quality objectives of HCLT; AMR would allow us to realize most of the goals effectively One of the core principles of quality is REUSE which AMR promotes by reusing MR code.

The tools used for developing the library are free open sources tools none of which is proprietary to the client or any compa-ny. However it may be noted that the AMR concept and the library developed are proprietary to HCLT as a whole.

Key Domains where Big Data is in use today are Aero, Auto, Manufacturing, Public Sector, Governance, Health Care and Media, the list goes on. Now all of these domains have unique processing needs for each of the data sources and the algorithm which can be addressed by AMR. Also if one notes closely the solution is domain independent. The modification that is required is only in form of the configuration file which is required to run the program. The solution can be used as-is as a library for any scenarios where we have to use MR for processing data.

The solution is not library version or tool dependent. It can support any upgrades or modifications in the supporting libraries as long as there is no major change in the implementation of Map Reduce algorithms itself. We are currently using it with Cloudera Hadoop 4/5 releases as well as vanilla Apache Hadoop.

http://www.byzantinereality.com/2009/4/History-of-MapReduce-Part-2whttp://www.maxwideman.com/papers/acquisition/involve.htm http://www.slideshare.net/zhengwenshen/20130201-mapreduce-design-patternshttps://qualitydiva.hcl.com/Other_Links/OMS_Overview.ppt http://www.tutorialspoint.com/design_pattern/factory_pattern.htm

Author InfoKinnar Kumar Sen HCL Engineering and R&D Services

Hello there! I am an Ideapreneur. I believe that sustainable business outcomes are driven by relationships nurtured through values like trust, transparency and flexibility. I respect the contract, but believe in going beyond through collaboration, applied innovation and new generation partnership models that put your interest above everything else. Right now 110,000 Ideapreneurs are in a Relationship Beyond the Contract™ with 500 customers in 31 countries. How can I help you?

TM

Best Practices

Conclusion

Reference

using factory design patterns in map reduce design for big data analytics

Business