ramu hadoop_spark developer with 6 years of experience
TRANSCRIPT
Ramu JMobile: +91- 9866407641
Email: [email protected]
SUMMARY: Over 6 years of professional IT experience with 3+ years of experience in Big Dataand Hadoop eco-
system along with Spark. Experience in requirement gathering, designing, developing, testing, implementing and maintaining
systems. Experience in all phases of the Software Development Life Cycle(SDLC). Expertise in Hadoop architecture and its various components – Hadoop File System HDFS,
MapReduce, Name Node, Data Node, Job Tracker, Task Tracker, Secondary Name Node and YARN. Expertise in developing and implementing big data solutions and data mining applications on
Hadoop using Hive Hive2, PIG, Spark,Sqoop, Impala,HUEandOozie workflows. Expertise in working with HUI (Hadoop user interface) and used to develop project and testing. Expertise in Hadoop testing by using the Hadoop user interface. Having POC Experience in Sparkwith Single RDD, Pair RDD and DStream . Extensive expertise in Extracting and Loading data to various databases including Oracle, MS SQL
Server, Teradata, Flat files, XML files. Expertise on Talend ETL Data integration Tool. Extensive expertise in developing XSD, XSLT and preparing XML files compatible to the xsd to parse
the xml data into flat files to process into HDFS. Developed Avro Schema to create the Avro and parquet tables in the hive by using the Avro schema
URL. Good Experience in working with SerDe’s like AvroFormat,Parquet format data. Good Experience in developing a report by using hive Queries, hive UDF’s and also prepared Pig
Scripts, Pig UDF’s for the analytics as per the Client requirements. Good experience in writing shell scripting in UNIX. Extensive expertise in Hadoop support and maintenance. Experience with data analysis and able to implement complex and sophisticated SQL Logic. Develops
and reviews project plans, identifies issues, resolves issues, and communicates status of assigned projects to users and manager.
Expertise in Microsoft Azure HD Insights including creating Hadoop cluster dynamically on top of Microsoft Azure and developed the projects as per the client requirement.
Knowledge on IBM Big Insights – Analytic Applications, IBM Big data Platform, Accelerators and information integration.
Excellent understanding of Hadoop Map Reduce Programming paradigm. Good Knowledge on Hadoop Cluster administration, monitoring and managing Hadoop clusters
using Cloudera Manager. Experience in understanding and managing Hadoop Log Files. Experience in managing the Hadoop infrastructure with Cloudera Manager. Experience in importing and exporting data using Sqoop, Flume. Good knowledge in job workflow scheduling and monitoring tools like Oozie.
Experience in developing Hadoop integrations for data ingestion, data mapping and data processing capabilities
Experience in performing offline analysis of large data sets using components from the Hadoop ecosystem
Sound knowledge of using NOSQL database HBase and its architecture. Good knowledge on Impala. Experience in retrieving data from Business Objects Universes, Personal Data Files, Stored
Procedures, RDBMS and Created complex and Sophisticated Reports that had multiple data providers, drilldown, slice and dice features using Business Objects
Excellent problem solving and communication skills Experience in integration of various data sources like SQL Server, Oracle, Tera Data, Flat files, DB2,
Mainframes. Developed multiple Proof-Of-Concepts to justify viability of the ETL solution including performance
and compliance to non-functional requirements. Conduct Hadoop training workshops for the development teams as well as directors and
management team to increase awareness. Prepare presentations of solutions to Big Data/Hadoop business cases and present the same to
company directors to get go-ahead on implementation. Designed end to end ETL flow for one of the feed having millions of records inflow daily. Used
apache tools/frameworks Hive, Pig, Sqoop for the entire ETL workflow. Setup Hadoop cluster, build Hadoop expertise across development, production support and testing
teams, enable production support functions, optimize Hadoop cluster performance in isolation as well as in context of the production workloads/jobs
Highly motivated to work and take independent responsibility and to prioritize multiple tasks.
TECHNICAL SKILLS:
Big Data Ecosystems Hadoop(MapReduce, HDFS), YARN,Zookeeper, Pig, Hive,Sqoop, Flume, Spark,Hue,Oozie
ETL Tools informaticaDatabases Oracle, SQL Server, MYSQL, Teradata, NoSQL,DB2Software Tools SQL *Plus, Toad, SQL LoaderProgramming Languages xml,Rexx, COBOL,JCL,PL/IOperating System Linux, WindowsReporting Tools SAP BO
Experience:
Organization Designation Duration
Infosys India Pvt Ltd Technology Analyst (April 2014 to Till Date)
IBM India Pvt. Ltd system Engineer (July 2010 - March 2014)
Projects Profile:
1. Project Name: DDSW (Dealer Data Stage Ware house)
Client Caterpillar
Role Hadoop Developer
Organization Infosys Pvt Ltd, India
Duration Sep 2014 – till date
Environment Distribution: Cloudera Hadoop Distrubution
Components: HDFS,Mapreduce,YARN
Echo System:Hive,Pig,Impala
Middle ware: Sqoop
Scripting: unixShellscripting,and Ruby scripting
Work flow: oozie
HDFS user interface: Hadoop User interface(HUI)
Database : oracle
ETL: Data stage
Source file : XML data
Project Description:
Caterpillar’s business model originates from a guide, issued in the 1920s, that established territory relationships with a number of Dealer affiliates. These largely autonomous relationships allowed the Dealers to develop their own models for tracking important data, such as customers and inventory that relate to local market conditions, including government regulation and customary business practices.
The Dealer Data Staging Warehouse (DDSW) platform stages the data received from Caterpillar’s Dealers and prepares them for consumption for a wide variety of uses, such as customer portal services, analytics for equipment monitoring, parts pricing, and customer lead generation, and other emerging applications.
The DDSW project is an ETL pipeline between per-dealer inbound data and a per-domain dataset. DDSW is charged with accepting, validating, transforming, securing, and exposing Dealer data for consumption by various Caterpillar consumers.
Consumer access to all dealer data is constrained by a View, which omits data to which a Consumer should not have access. Access rules are maintained by a matrix of Domains and Dealers permitted to each Consumer, which in turn informs the configuration of Views
Contribution:
Understanding business needs, analysing functional specifications and map those to develop Hql’s and Pig Latin Scripts.
Designed XSD, XSL to parse the XML structure file into in pipe delimited format to facilitate effective querying on the data. Also have hand on Experience on Pig and Hive User Define Functions (UFD).
Execution of Hadoop ecosystem and Applications through Apache HUE. Prepared Avro Schema structure to create the Hive tables and also created the Parquet format tables
to process the pipe delimited data. Feasibility Analysis (For the deliverables) - Evaluating the feasibility of the requirements against
complexity and time lines. Involved in creating folders to place the code, lib, data including Avro schema to execute the project
in a proper structured manner. Writing Pig scripts for data analysis and end result will be processed to HDFS. Implemented Hive tables and HQL Queries for the reports. Written and used complex data type in
Hive. Storing and retrieved data using HQL in Hive. Developed Hive queries to analyze reducer output data.
Developed PIG Latin scripts to extract data from source system. Involved in Extracting, loading Data from Hive to Load an RDBMS using SQOOP
Spark Mini project (POC):
2 Project Name: DDSW (Dealer Data Stage Ware house)
Client Caterpillar
Role Hadoop Developer
Organization Infosys Pvt Ltd, India
Duration Feb 2015– till date
Environment Distribution: Cloudera Hadoop Distrubution
Components: HDFS,Mapreduce,YARN
API: Scala
Echo System: Hive,Impala and Spark
Middle ware: Sqoop
Scripting: Shell,Python and Ruby
Work flow: oozie
Scheduling: Hadoop User Entrprise(HUE)
Database : oracle
ETL: Data stage
Source file : XML data
Project Description:
The Dealer Data Staging Warehouse (DDSW) platform stages the data received from Caterpillar’s Dealers and prepares them for consumption for a wide variety of uses, such as customer portal services, analytics for equipment monitoring, parts pricing, and customer lead generation, and other emerging applications.
The DDSW project is an ETL pipeline between per-dealer inbound data and a per-domain dataset. DDSW is charged with accepting, validating, transforming, securing, and exposing Dealer data for consumption by various Caterpillar consumers.
Consumer access to all dealer data is constrained by a View, which omits data to which a Consumer should not have access. Access rules are maintained by a matrix of Domains and Dealers permitted to each Consumer, which in turn informs the configuration of Views
Contribution:
Understanding business needs, analysing functional specifications and map those to develop Spark SQL.
Designed XSD, XSL to parse the XML structure file into in pipe delimited format to text file for facilitate effective querying on the data.
Created file based RDD’s by using the above text file after parsing the xml file and also the same RDD is processed to the Data frames just to compare the performance between existed Map reduce methodology and Spark methodology.
Created the Avro and Parquet Format data tables and stored the data in to Avro and parquet format. We developed and implemented the Spark Sql’s by connecting to the hive tables to process the data
and created views to build the reports by using the views in the Tableau. Execution of Hadoop ecosystem and Applications through Apache HUE. Feasibility Analysis (For the deliverables) - Evaluating the feasibility of the requirements against
complexity and time lines. Involved in creating folders to place the code, lib, data including Avro schema to execute the project
in a proper structured manner. Writing Pig scripts for data analysis and end result will be processed to HDFS. Developed RDD’s and which were processed the data in the spark by writing Pig scripts to analyse the
data as per the client requirement. Involved in Extracting, loading Data from Hive to Load an RDBMS oracle using SQOOP
3. Project Name: aEDW Enhanced Capabilities Project
Client Toyota Financial Services
Role Hadoop Developer
Organization Infosys Pvt. Ltd, India
Duration Aril 2014– August 2014
Environment Distribution: Cloudera Hadoop Distribution
Components: HDFS,Mapreduce
Echo System: Hive,Pig
Middleware: Sqoop
Scripting: Shell
Work flow: oozie
Scheduling: Autosys
Database: Teradata
Project Description
Active Enterprise Data warehouse (aEDW), a central repository for all lines of Business and generate reports accordingly is developed. The data is extracted from different legacy systems such as Sql Server, Oracle, Sybase and MySQL using Informatica. This data is transformed and transported to the Centralized repository using Teradata Utilities. Summary tables are derived and subject specific Data Marts designed for faster and accurate analysis. These Data marts would be used for current, as well as future analytic and reporting needs. Reports are developed using Cognos. The objective of aEDW customer account profile is to provide a 360º view of Toyota Financial Services customer’s life cycle nationwide.
Contribution
Understanding business needs, analysing functional specifications and map those to develop Hql’s and Pig Latin Scripts.
Project is executing in Teradata Environment and we slowly changing to Hadoop system. Created the hive tables with the similar structure of Teradata tables and Connected to the Teradata
via JDBC through the sqoop and import the tables to hive. Feasibility Analysis (For the deliverables) - Evaluating the feasibility of the requirements against
complexity and time lines. Involved in creating code, lib, and data folders to execute the project in a proper structured manner. Performing data migration from Legacy Databases RDBMS to HDFS using SQOOP. Writing Pig scripts for data analysis and end result will be processed to HDFS. Implemented Hive tables and HQL Queries for the reports. Written and used complex data type in
Hive. Storing and retrieved data using HQL in Hive. Developed Hive queries to analyse reducer output data.
Highly involved in designing the next generation data architecture for the unstructured data Developed PIG Latin scripts to extract data from source system.
POC 1(Proof of concept) on Hadoop:
3. Acquisition and Statistical Knowledge Made Easy
Client AT&T
Role Hadoop Team Member and SAP BO Developer
Organization IBM India Pvt Ltd, Chennai
Duration Sep 2013 – Feb 2014
Environment Distribution: Cloudera Hadoop distribution
Components: HDFS,Mapreduce
Echo System: Hive, Pig
Middle ware: Sqoop
Work flow: oozie
Scheduling: Hadoop User interface(HUI)
Database : Teradata
Source file: .csv files and Mainframe sequence files.
Project description:
ASKME is a data warehouse containing detailed information on historical provisioning and maintenance activity for AT&T. It is the primary source of information for closed trouble tickets for the West, Southwest, Midwest, and East regions. Its also providing information related to closed service orders.
Contribution
Understanding business needs, analysing functional specifications and map those to develop Hql’s and Pig Latin Scripts.
Project is executing in Teradata Environment and we slowly changing to Hadoop system. Imported the Mainframe sequence files from Teradata system to HDFS through NDM (Network data
mover). Feasibility Analysis (For the deliverables) - Evaluating the feasibility of the requirements against
complexity and time lines. Involved in creating code, lib, and data folders to execute the project in a proper structured manner. Performing data migration from Legacy Databases RDBMS to HDFS using SQOOP. Writing Pig scripts for data analysis and end result will be processed to HDFS. Implemented Hive tables and HQL Queries for the reports. Written and used complex data type in
Hive. Storing and retrieved data using HQL in Hive. Developed Hive queries to analyse reducer output data.
Highly involved in designing the next generation data architecture for the unstructured data Developed PIG Latin scripts to extract data from source system.
4. Acquisition and Statistical Knowledge Made Easy
Client AT&T
Role SAP BO Developer
Organization IBM India Pvt Ltd, Chennai
Duration Aug 2010 – Feb 2014
Environment Reporting Tool: SAP BO
Database: Mainframe Teradata
Project description:
ASKME is a data warehouse containing detailed information on historical provisioning and maintenance activity for AT&T. It is the primary source of information for closed trouble tickets for the West, Southwest, Midwest, and East regions. It’s also providing information related to closed service orders.
Contribution:
Involved in understanding the business requirements which specified in the BRD.
Involved in understanding client business environment and data base for reporting
Involved in gathering report requirements by coordinating with onsite spocs
Involved in generating reports from scratch
Extensive experience in designing universe using designer
Involved in universe design by design schema and resolving join path problems like Loops,
traps(Chasm trap, Fan trap) using contexts or alias
Extensively used hierarchies, derived tables, @functions, Cascading.
Extensively used universe tuning techniques using aggregate tables,indexes,short cut joins ,
Conditional objects
Well experienced in generating a complex reports using WEBI to meet customer requirements
Created complex reports Using merge dimensions, combined queries, prompts, filters,
conditional variables ,alerts,hyperlinks,charts
Involved in scheduling and publishing a reports directly to the customer
Involved in complete reports phase like gathering requirmenents,developing and deploying
Certifications:
1. InfoSphereBigInsights Essentials using Apache Hadoop (SPVC) 2W602
2. DB2 9 Fundamental certification as per client requirement.
Trainings:
1. InfoSphereBigInsights Essentials using Apache Hadoop (SPVC) 2W602
2. BAO (Teradata, BO)
3. Qlikview reporting Tool