ba mapreduce biginsights analysis pdf

22
© Copyright IBM Corporation 2014 Trademarks Processing and content analysis of various document types using MapReduce and InfoSphere BigInsights Page 1 of 22 Processing and content analysis of various document types using MapReduce and InfoSphere BigInsights Sajad Izadi Partner Enablement Engineer IBM Benjamin G. Leonhardi Software Engineer IBM Piotr Pruski Partner Enablement Engineer IBM 29 July 2014 Businesses often need to analyze large numbers of documents of various file types. Apache Tika is a free open source library that extracts text contents from a variety of document formats, such as Microsoft® Word, RTF, and PDF. Learn how to run Tika in a MapReduce job within InfoSphere® BigInsights™ to analyze a large set of binary documents in parallel. Explore how to optimize MapReduce for the analysis of a large number of smaller files. Learn to create a Jaql module that makes MapReduce technology available to non-Java programmers to run scalable MapReduce jobs to process, analyze, and convert data within Hadoop. This article describes how to analyze large numbers of documents of various types with IBM InfoSphere BigInsights. For industries that receive data in different formats (for example, legal documents, emails, and scientific articles) InfoSphere BigInsights can provide sophisticated text analytical capabilities that can aid in sentiment prediction, fraud detection, and other advanced data analysis. Learn how to integrate Apache Tika, an open source library that can extract the text contents of documents, with InfoSphere BigInsights, which is built on the Hadoop platform and can scale to thousands of nodes to analyze billions of documents. Typically, Hadoop works on large files, so this article explains how to efficiently run jobs on a large number of small documents. Use the steps here to create a module in Jaql that creates the integration. Jaql is a flexible language for working with data in Hadoop. Essentially, Jaql is a layer on top of MapReduce that enables easy analysis and manipulation of data in Hadoop. Combining a Jaql module with Tika makes it easy to read various documents and use the analytical capabilities of InfoSphere BigInsights, such as text analytics and data mining, in a single step, without requiring deep programming expertise.

Upload: parashara

Post on 16-Aug-2015

234 views

Category:

Documents


0 download

DESCRIPTION

Ba Mapreduce Biginsights Analysis PDF

TRANSCRIPT

Copyright IBM Corporation 2014 TrademarksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 1 of 22Processing and content analysis of various documenttypes using MapReduce and InfoSphere BigInsightsSajad IzadiPartner Enablement EngineerIBMBenjamin G. LeonhardiSoftware EngineerIBMPiotr PruskiPartner Enablement EngineerIBM29 July 2014Businesses often need to analyze large numbers of documents of various file types. ApacheTika is a free open source library that extracts text contents from a variety of document formats,such as Microsoft Word, RTF, and PDF. Learn how to run Tika in a MapReduce job withinInfoSphere BigInsights to analyze a large set of binary documents in parallel. Explore howto optimize MapReduce for the analysis of a large number of smaller files. Learn to create aJaql module that makes MapReduce technology available to non-Java programmers to runscalable MapReduce jobs to process, analyze, and convert data within Hadoop.This article describes how to analyze large numbers of documents of various types with IBMInfoSphere BigInsights. For industries that receive data in different formats (for example, legaldocuments, emails, and scientific articles) InfoSphere BigInsights can provide sophisticated textanalytical capabilities that can aid in sentiment prediction, fraud detection, and other advanceddata analysis.Learn how to integrate Apache Tika, an open source library that can extract the text contents ofdocuments, with InfoSphere BigInsights, which is built on the Hadoop platform and can scale tothousands of nodes to analyze billions of documents. Typically, Hadoop works on large files, sothis article explains how to efficiently run jobs on a large number of small documents. Use thesteps here to create a module in Jaql that creates the integration. Jaql is a flexible language forworking with data in Hadoop. Essentially, Jaql is a layer on top of MapReduce that enables easyanalysis and manipulation of data in Hadoop. Combining a Jaql module with Tika makes it easy toread various documents and use the analytical capabilities of InfoSphere BigInsights, such as textanalytics and data mining, in a single step, without requiring deep programming expertise.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 2 of 22This article assumes a basic understanding of the Java programming language, Hadoop,MapReduce, and Jaql. Details about these technologies are outside the scope of the article,which focuses instead on sections of code that must be updated to accommodate custom code.Download the sample data used in this article.Overview: InfoSphere BigInsights, Tika, Jaql, and MapReduceclassesInfoSphere BigInsights is built on Apache Hadoop and enhances it with enterprise features,analytical capabilities, and management features. Apache Hadoop is an open source project thatuses clusters of commodity servers to enable processing on large data volumes. It can scalefrom one to thousands of nodes with fault-tolerance capabilities. Hadoop can be thought of as anumbrella term. It includes two main components:A distributed file system (HDFS) to store the dataThe MapReduce framework to process dataMapReduce is a programming paradigm that enables parallel processing and massive scalabilityacross the Hadoop cluster. Data in Hadoop is first broken into smaller pieces, such as blocks, anddistributed on the cluster. MapReduce can then analyze these blocks in a parallel fashion.Apache TikaThe Apache Tika toolkit is a free open source project used to read and extract text and othermetadata from various types of digital documents, such as Word documents, PDF files, or files inrich text format. To see a basic example of how the API works, create an instance of the Tika classand open a stream by using the instance.Listing 1. Example of Tikaimport org.apache.tika.Tika;...private String read(){ Tika tika = new Tika(); FileInputStream stream = new FileInputStream("/path_to_input_file.PDF"); String output = tika.parseToString(stream); return output;}If your document format is not supported by Tika (Outlook PST files are not supported, forexample) you can substitute a different Java library in the previous code listing. Tika does supportthe ability to extract metadata, but that is outside the scope of this article. It is relatively simple toadd that function to the code.JaqlJaql is primarily a query language for JSON, but it supports more than just JSON. It enables youto process structured and non-traditional data. Using Jaql, you can select, join, group, and filterdata stored in HDFS in a manner similar to a blend of Pig and Hive. The Jaql query language wasinspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql isibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 3 of 22a functional, declarative query language designed to process large data sets. For parallelism, Jaqlrewrites high-level queries, when appropriate, into low-level queries consisting of Java MapReducejobs. This article demonstrates how to create a Jaql I/O adapter over Apache Tika to read variousdocument formats, and to analyze and transform them all within this one language.MapReduce classes used to analyze small filesTypically, MapReduce works on large files stored on HDFS. When writing to HDFS, files arebroken into smaller pieces (blocks) according to the configuration of your Hadoop cluster. Theseblocks reside on this distributed file system. But what if you need to efficiently process a largenumber of small files (specifically, binary files such as PDF or RTF files) using Hadoop?Several options are available. In many cases, you can merge the small files into a big file bycreating a sequence file, which is the native storage format for Hadoop. However, creatingsequence files in a single thread can be a bottleneck and you risk losing the original files. Thisarticle offers a different way to manipulate a few Java classes used in MapReduce. Traditionalclasses require each individual file to have a dedicated mapper. But this process is inefficient whenthere are many small files.InfoSphere BigInsights Quick Start EditionInfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version ofInfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you cantry out the features that IBM has built to extend the value of open source Hadoop, like BigSQL, text analytics, and BigSheets. Guided learning is available to make your experience assmooth as possible including step-by-step, self-paced tutorials and videos to help you startputting Hadoop to work for you. With no time or data limit, you can experiment on your owntime with large amounts of data. Watch the videos, follow the tutorials (PDF), and downloadBigInsights Quick Start Edition now.As an alternative to traditional classes, process small files in Hadoop by creating a set of customclasses to notify the task that the files are small enough to be treated in a different way from thetraditional approach.At the mapping stage, logical containers called splits are defined, and a map processing task takesplace at each split. Use custom classes to define a fixed-sized split, which is filled with as manysmall files as it can accommodate. When the split is full, the job creates a new split and fills thatone as well, until it's full. Then each split is assigned to one mapper.MapReduce classes for reading filesThree main MapReduce Java classes are used to define splits and read data during a MapReducejob: InputSplit, InputFormat, and RecordReader.When you transfer a file from a local file system to HDFS, it is converted to blocks of 128 MB. (Thisdefault value can be changed in InfoSphere BigInsights.) Consider a file big enough to consume10 blocks. When you read that file from HDFS as an input for a MapReduce job, the same blocksare usually mapped, one by one, to splits. In this case, the file is divided into 10 splits (whichimplies means 10 map tasks) for processing. By default, the block size and the split size are equal,but the sizes are dependent on the configuration settings for the InputSplit class.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 4 of 22From a Java programming perspective, the class that holds the responsibility of this conversionis called an InputFormat, which is the main entry point into reading data from HDFS. From theblocks of the files, it creates a list of InputSplits. For each split, one mapper is created. Theneach InputSplit is divided into records by using the RecordReader class. Each record represents akey-value pair.FileInputFormat vs. CombineFileInputFormatBefore a MapReduce job is run, you can specify the InputFormat class to be used. Theimplementaion of FileInputFormat requires you to create an instance of the RecordReader, and asmentioned previously, the RecordReader creates the key-value pairs for the mappers.FileInputFormat is an abstract class that is the basis for a majority of the implementations ofInputFormat. It contains the location of the input files and an implementation of how splits mustbe produced from these files. How the splits are converted into key-value pairs is defined in thesubclasses. Some example of its subclasses are TextInputFormat, KeyValueTextInputFormat, andCombineFileInputFormat.Hadoop works more efficiently with large files (files that occupy more than 1 block).FileInputFormat converts each large file into splits, and each split is created in a way thatcontains part of a single file. As mentioned, one mapper is generated for each split. Figure 1depicts how a file is treated using FileInputFormat and RecordReader in the mapping stage.Figure 1. FileInputFormat with a large fileHowever, when the input files are smaller than the default block size, many splits (and therefore,many mappers) are created. This arrangement makes the job inefficient. Figure 2 shows how toomany mappers are created when FileInputFormat is used for many small files.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 5 of 22Figure 2. FileInputFormat with many small filesTo avoid this situation, CombineFileInputFormat is introduced. This InputFormat works wellwith small files, because it packs many of them into one split so there are fewer mappers,and each mapper has more data to process. Unlike other subclasses of FileInputFormat,CombineFileInputFormat is an abstract class that requires additional changes before it can beused. In addition to these changes, you must ensure that you prevent splitting the input. Figure 3shows how CombineFileInputFormat treats the small files so that fewer mappers are created.Figure 3. CombineFileInputFormat with many small filesMapReduce classes used for writing filesYou need to save the text content of the documents in files that are easy to process in Hadoop.You can use sequence files, but in this example, you create delimited text files that contain thecontents of each file in one record. This method makes the content easy to read and easy todeveloperWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 6 of 22use in downstream MapReduce jobs. The Java classes used for writing files in MapReduce areOutputFormat and RecordWriter. These classes are similar to InputFormat and RecordReader,except that they are used for output. The FileOutputFormat implements OutputFormat. It containsthe path of the output files and directory and includes instructions for how the write job must berun.RecordWriter, which is created within the OutputFormat class, defines the way each record passedfrom the mappers is to be written in the output path.Implementing custom MapReduce classesIn the lab scenario used in this article, you want to process and archive a large number of smallbinary files in Hadoop. For example, you might need to have Hadoop analyze several researchpapers in PDF format. Using the traditional MapReduce techniques, it will take a relatively longtime for the job to complete, only because you have too many small files as your input. Moreover,the PDF format of your files isn't natively readable by MapReduce. In addition to these limitations,storing many small files in the Hadoop distributed file system can consume a significant amount ofmemory on the NameNode. Roughly 1 GB for every million files or blocks is required. Therefore,files smaller than a block are inefficiently processed with traditional MapReduce techniques. It'smore efficient to develop a program that has the following characteristics:Is optimized to work with large number of small filesCan read binary filesGenerates fewer, larger files as the outputA better approach is to use Apache Tika to read the text within any kind of supported documentformat to develop a TikaInputFormat class to read and process small files by using a MapReducetask, and to use TikaOutputFormat to show the result. Use InputFormat, RecordReader, andRecordWriter to create the solution. The goal is to read many small PDF files and generate outputthat has a delimited format that looks similar to the code below.Listing 2. Desired output|||...This output can be used later for downstream analysis. The following sections explain the detailsof each class.TikaHelper to convert binary data to textThe purpose of this helper class is to convert a stream of binary data to text format. It receives aJava I/O stream as an input and returns the string equivalent of that stream.If you are familiar with MapReduce, you know that all tasks contain some configuration parametersset at runtime. With these parameters, you can define how the job is supposed to be run thelocation where the output is to reside, for example. You can also add parameters that the classesare to use.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 7 of 22In this application, assume you want to output a delimited file. Therefore, you need a way toreplace the chosen delimiter character in the original text field with a different character and away to replace new lines in the text with the same replacement character. For this purpose, addtwo parameters: com.ibm.imte.tika.delimiter and com.ibm.imte.tika.replaceCharacterWith.As shown in Listing 3, in the TikaHelper class, read those parameters from an instance ofConfiguration to get the replacement options. Configuration is passed from RecordReader, whichcreates the TikaHelper instance, described in a following section of this article.Listing 3. TikaHelper.java constructorpublic TikaHelper(Configuration conf){ tika = new Tika(); String confDelimiter = conf.get("com.ibm.imte.tika.delimiter"); String confReplaceChar =conf.get("com.ibm.imte.tika.replaceCharacterWith"); if (confDelimiter != null )this.delimiter = "["+ confDelimiter + "]"; if (confReplaceChar != null )this.replaceWith = confReplaceChar; logger.info("Delimiter: " + delimiter); logger.info("Replace With character:" + replaceWith);}After preparing the options, call the readPath method to get a stream of data to be convertedto text. After replacing all the desired characters from the configuration, return the stringrepresentation of the file contents.The replaceAll method is called on a string object and replaces all recurring characters withthe one specified in the argument. Because it takes a regular expression as input, surround thecharacters with the regular expression group characters [ and ]. In the solution, indicate that if thecom.ibm.imte.tika.replaceCharacterWith is not specified, all characters are to be replaced withan empty string.In this article, the output is saved as delimited files. This makes them easy to read and process.However, you do need to remove newline and delimiter characters in the original text. In use casessuch as sentiment analysis or fraud detection, these characters are not important. If you need topreserve the original text 100 percent, you can output the results as binary Hadoop sequence filesinstead.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 8 of 22Listing 4. TikaHelper constructorpublicString readPath(InputStream stream){ try {String content = tika.parseToString(stream);content = content.replaceAll(delimiter, replaceWith);content = content.replaceAll(endLine, replaceWith);return content; } catch (Exception e) {logger.error("Malformed PDF for Tika: " + e.getMessage()); } return "Malformed PDF";}TikaInputFormat to define the jobEvery MapReduce task must have an InputFormat. TikaInputFormat is the InputFormatdeveloped in this solution. It is extended from the CombineFileInputFormat class with inputparameters for key and value as Text. Text is a writable, which is Hadoop's serialization format tobe used for key-value pairs.TikaInputFormat is used to validate the configuration of the job, split the input blocks, and createa proper RecordReader. As shown in Listing 5 in the createRecordReader method, you can returnan instance of RecordReader. As described, you don't need to split the files in TikaInputFormatbecause the files are assumed to be small. Regardless, TikaHelper cannot read parts of a file.Therefore, the return value for the isSplitable method must be set to false.Listing 5. TikaInputFormat.javapublic class TikaInputFormat extends CombineFileInputFormat{ @Override public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {return new TikaRecordReader((CombineFileSplit) split, context); } @Override protected boolean isSplitable(JobContext context, Path file) {return false; }}TikaRecordReader to generate key-value pairsTikaRecordReader uses the data given to the TikaInputFormat to generate key-value pairs. Thisclass is derived from the abstract RecordReader class. This section describes the constructor andthe nextKeyValue methods.In the constructor shown in Listing 6, store the required information to carry out the job deliveredfrom TikaInputFormat. Path[] paths stores the path of each file, FileSystem fs represents aibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 9 of 22file system in Hadoop, and CombineFileSplit split contains the criteria of the splits. Noticethat you also create an instance of TikaHelper with the Configuration to parse the files in theTikaRecordReader class.Listing 6. TikaRecordReader.java constructorpublic TikaRecordReader(CombineFileSplit split, TaskAttemptContext context) throws IOException{ this.paths = split.getPaths(); this.fs = FileSystem.get(context.getConfiguration()); this.split = split; this.tikaHelper = new TikaHelper(context.getConfiguration());}In the nextKeyValue method shown in Listing 7, you go through each file in the Path[] and returna key and value of type Text, which contains the file path and the content of each file, respectively.To do this, first determine whether you are already at the end of the files array. If not, you move onto the next available file in the array. Then you open a FSDataInputStream stream to the file. In thiscase, the key is the path of the file and the value is the text content. You pass the stream to theTikaHelper to read the contents for the value. (The currentStream field that always points to thecurrent file in the iteration.) Next, close the used-up stream.Explore HadoopDevFind resources you need to get started with Hadoop powered by InfoSphere BigInsights,brought to you by the extended BigInsights development team. Doc, product downloads,labs, code examples, help, events, expert blogs it's all there. Plus a direct line to thedevelopers. Engage with the team now.This method is run once for every file in the input. Each file generates a key-value pair. Asexplained, when the split has been read, the next split is opened to get the records, and so on.This process also happens in parallel on other splits. In the end, by returning the value false, youstop the loop.In addition to the following code, you must also override some default functions, as shown in thefull code, available for download.Listing 7. TikaInputFormat.java nextKeyValue@Overridepublic boolean nextKeyValue() throws IOException, InterruptedException{ if (count >= split.getNumPaths()) {done = true;return false;//we have no more data to parse } Path path = null; key = new Text(); value = new Text(); try {path = this.paths[count];developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 10 of 22 } catch (Exception e) {return false; } currentStream = null; currentStream = fs.open(path); key.set(path.getName()); value.set(tikaHelper.readPath(currentStream)); currentStream.close(); count++; return true; //we have more data to parse}TikaOutputFormat to specify output detailsThis class determines where and how the output of the job is stored. It must be extended from anOutputFormat class. In this case, it is extended from FileOutputFormat. As shown in Listing 8, youfirst allocate the path for the output, then create an instance of TikaRecordWriter to generate theoutput files. Just as the TikaInputFormat, this class must be specified in the main method to beused as the OutputFormat class.Listing 8. TikaOutputFormat.javapublic class TikaOutputFormat extends FileOutputFormat{ @Override public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {//to get output files in part-r-00000 formatPath path = getDefaultWorkFile(context, "");FileSystem fs = path.getFileSystem(context.getConfiguration());FSDataOutputStream output = fs.create(path, context);return new TikaRecordWriter(output, context); }}TikaRecordWriter to create the outputThis class is used create the output. It must be extended from the abstract RecordWriter.In the constructor shown in Listing 9, you get the output stream, the context, and the customconfiguration parameter, which serves as the delimiter between the file name and its content. Thisparameter can be set in the runtime (main method). If it is not specified, | is picked by default.Listing 9. TikaRecordWriter.java constructorpublic TikaRecordWriter(DataOutputStream output, TaskAttemptContext context){ this.out = output; String cDel = context.getConfiguration().get("com.ibm.imte.tika.delimiter"); if (cDel != null)delimiter = cDel; logger.info("Delimiter character: " + delimiter);}ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 11 of 22In the write method shown in Listing 10, use the key and value of type Text created in the mapperto be written in the output stream. The key contains the file name, and the value contains the textcontent of the file. When writing these two in the output, separate them with the delimiter and thenseparate each row with a new line character.Listing 10. TikaRecordWriter.java write@Overridepublic void write(Text key, Text value) throws IOException,InterruptedException{ out.writeBytes(key.toString()); out.writeBytes(delimiter); out.writeBytes(value.toString()); out.writeBytes("\n");}TikaDriver to use the applicationTo run a MapReduce job, you need to define a driver class, TikaDriver, which contains the mainmethod, as shown in Listing 11. You can set the TikaInputFormat as the custom InputFormat, andsimilarly, you can set the TikaOutputFormat as the custom OutputFormat for the job.Listing 11. Main methodpublic static void main(String[] args) throws Exception{ int exit = ToolRunner.run(new Configuration(), new TikaDriver(), args); System.exit(exit);}@Overridepublic int run(String[] args) throws Exception{ Configuration conf = new Configuration(); //setting the input split size 64MB or 128MB are good. conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 67108864); Job job = new Job(conf, "TikaMapreduce"); conf.setStrings("com.ibm.imte.tika.delimiter", "|"); conf.setStrings("com.ibm.imte.tika.replaceCharacterWith", ""); job.setJarByClass(getClass()); job.setJobName("TikaRead"); job.setInputFormatClass(TikaInputFormat.class); job.setOutputFormatClass(TikaOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;}Tika and Log4j API attachmentRemember to attach the Tika and Log4j API upon running the task. To do this in Eclipse, goto the job configuration by clicking Run > Run Configurations and in the Java MapReducesection, click the JAR Settings tab and find the APIs by adding them to the Additional JARFiles section.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 12 of 22Pay attention to the first line in bold. If the max split size is not defined, the taskattributes all of the input files to only one split, so there is only one map task. To preventthis, define the max split size. This value can be changed by defining a value for themapreduce.input.fileinputformat.split.maxsize configuration parameter. This way, each splithas a configurable size 64MB in this case.You have now finished the MapReduce job. It reads all files in the HDFS input folder andtranscodes them into a delimited output file. You can then conveniently continue analyzing the datawith text analytical tools, such as IBM Annotation Query Language (AQL). If you want a differentoutput format or you want to directly transform the data, you must modify the code appropriately.Because many people are not comfortable programming Java code, this article explains how touse the same technology in a Jaql module.Using a Jaql module rather than Java classesThis section describes how to create a Jaql module using the same technology as in the previoussection and how to use this module to transform documents, load them from external file systems,and directly analyze them. A Jaql module enables you to do all of this processing, without writingany Java code, using a straightforward syntax.The InputFormat, OutputFormat, RecordReader, and RecordWriter classes described previously,reside in the org.apache.hadoop.mapreduce and org.apache.hadoop.mapreduce.lib.outputpackages, which are known as the new Hadoop APIs.To use the same approach with Jaql, you need to implement classes in theorg.apache.hadoop.mapred package, which is an older version of the MapReduce APIs.First, learn how to apply the same methods to the older package.TikaJaqlInputFormat to validate inputThis class is used to validate the input configuration for the job, split the input blocks, and createthe RecordReader. It is extended from org.apache.hadoop.mapred.MultiFileInputFormat classand it contains two methods.As shown in Listing 12, the constructor creates an instance of TikaJaqlRecordReader and theisSplitable method is set to return false to override the default behavior for stopping theInputFormatfrom splitting the files. To be able to manipulate the input after loading in Jaql, use thegeneric type JsonHolder.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 13 of 22Listing 12. TikaJaqlInputFormat.javapublic class TikaJaqlInputFormat extends MultiFileInputFormat{ @Override public RecordReader getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException {return new TikaJaqlRecordReader(job, (MultiFileSplit) split); } @Override protected boolean isSplitable(FileSystem fs, Path filename) {return false; }}TikaJaqlRecordReader to generate key-value pairsThis class is used to generate the key-value pairs used in MapReduce. It is derived from theorg.apache.hadoop.mapred.RecordReader class to maintain compatibility with Jaql. This sectiondescribes the constructor and the next methods.In the constructor as shown in Listing 13, initialize the needed class variables. Get the split, whichcontains information about the files, and create a new instance of TikaHelper to read the binaryfiles.Listing 13. TikaJaqlRecordReader constructorpublic TikaJaqlRecordReader(Configuration conf, MultiFileSplit split)throws IOException{ this.split = split; this.conf = conf; this.paths = split.getPaths(); this.tikaHelper = new TikaHelper(conf);}What about OutputFormat and RecordWriter?You don't need to implement the output part of the task because after loading the data withJaql, you can use existing pre-defined Jaql modules to manipulate the data and write it out invarious formats.In the next method, as shown in Listing 14, iterate through all the files in the split, one after theother. After opening a stream to each file, assign the name and the contents as the elements to anew instance of BufferedJsonRecord. BufferedJsonRecord helps you keep items in an appropriateformat. Jaql internally runs on JSON documents, so all data needs to be translated into valid JSONobjects by the I/O adapters. The BufferedJsonRecord is then assigned as the value of the record.The key, however, remains empty.Listing 14. TikaJaqlRecordReader next methodpublic boolean next(JsonHolder key, JsonHolder value) throws IOExceptiondeveloperWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 14 of 22{ if (count >= split.getNumPaths()) {done = true;return false; } Path file = paths[count]; fs = file.getFileSystem(conf); InputStream stream = fs.open(file); BufferedJsonRecord bjr = new BufferedJsonRecord(); bjr.setNotSorted(); bjr.add(new JsonString("path"), new JsonString(file.getName())); bjr.add(new JsonString("content"),new JsonString(this.tikaHelper.readPath(stream))); value.setValue(bjr); stream.close(); count++; return true;}Creating the Jaql moduleJaql modules enable users to create packages of reusable Jaql functions and resources. Createa tika module that contains an I/O adapter. I/O adapters are passed to I/O functions and allowJaql to read or write from various source types, such as delimited files, sequence files, AVROfiles, HBase and Hive tables, and much more. This tika module enables users to read binary filessupported by Apache Tika (such as Word files or PDF documents) to extract the file name and thetext content. To create the tika module, export the TikaJaql classes developed previously as aJAR file. Jaql can dynamically load Java resources and add them to the class path by using thefunction addRelativeClassPath() to register such additional libraries.Creating and referencing modules is straightforward in Jaql. Every Jaql script can be added asa module by adding it to the search path of Jaql. The easiest way to do this is by creating a newfolder in the $JAQL_HOME/modules directory and including your files there. In this case, themodule is named tika, so you need to create a folder $JAQL_HOME/modules/tika. You can thencreate functions within Jaql scripts and include them in this folder.Create a custom function named tikaRead() that usescom.ibm.imte.tika.jaql.TikaJaqlInputFormat for the input format component. This function isto be used for reading, so change only the inoptions (and not the outoptions). Based on theimplemented classes developed in the previous section, calling the tikaRead() function as aninput for read produces one record for every input file with two fields: path, which is the full filename, and content, which is the text content of the file. Calling the tikaRead() function is similar tocalling any other Jaql input I/O adapter, such as lines() or del(). Usage examples are included ina subsequent section.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 15 of 22Create the file tika.jaql, as shown in Listing 15 and put it in the $JAQL_HOME/modules/tikadirectory so it can be easily imported into other Jaql scripts. The name of the Jaql file is notrelevant, but the name of the folder you created under the modules folder is important. You canalso add modules dynamically using command-line options from a Jaql-supported terminal.This code looks for the generated JAR files in /home/biadmin/. You need to copy the Tika JAR filein this folder and export your created class files as TikaJaql.jar to this folder, as well. In Eclipse,you can create a JAR file from a project with the Export command.Listing 15. tika.jaqladdRelativeClassPath(getSystemSearchPath(), '/home/biadmin/tika-app-1.5.jar,/home/biadmin/TikaJaql.jar');//creating the functiontikaRead = fn( location: string, inoptions : {*}? = null, outoptions : {*}? = null){location,"inoptions": {"adapter": "com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter","format": "com.ibm.imte.tika.jaql.TikaJaqlInputFormat","configurator": "com.ibm.jaql.io.hadoop.FileInputConfigurator"}};Using JaqlNow that the module has been created, use the following examples to help you see some possibleuses of this function.Jaql is quite flexible and can be used to transform and analyze data. It has connectors to analyticaltools, such as data mining and text analytics (AQL). It has connectors to various file formats (suchas line, sequence, and Avro) and to external sources (such as Hive and HBase). You can also useit to read files from the local file system or even directly from the web.The following section demonstrates three examples for the use of the tika module in Jaql. Thefirst example shows a basic transformation of binary documents on HDFS into a delimited filecontaining their text content. This example illustrates the fundamental capabilities of the module;it is equivalent to the tasks you carried out with the MapReduce job in the previous sections. Thesecond example shows how to use Jaql to load and transform binary documents directly froman external file system source into HDFS. This example can prove to be a useful procedure ifyou do not want to store the binary documents in HDFS, but rather to store only the contents ina text or sequence file format, for instance. The load is be single threaded in this case, so it doesnot have the same throughput as the first approach. The third example shows how to do textanalysis directly within Jaql after reading the files, without first having to extract and persist the textcontents.Using the code in Listing 16, read files inside a directory from HDFS and write the results back intoHDFS. This method closely mirrors what you have done in the MapReduce job in the first section.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 16 of 22You must import the tika module you created to be able to use the tikaRead() functionality. Youthen read the files in the specified folder using the read() function, and write the file names andtext contents to a file in HDFS in delimited file format.You can find additional information on Jaql in the InfoSphere BigInsights Knowledge Center.The demo input is a set of customer reviews in Word format in a folder, as shown in Listing 16. Ofthe 10 reviews, some are positive and some are negative. Assume you want to extract the textand store it in delimited format. Later, you might want to perform text analytics on it. You wantto keep the file name because it tells you who created the review. Normally, that relationship isdocumented in a separate table.Listing 16. The input files in hdfs:/tmp/reviews/review1.docreview2.docreview3.doc...As shown in Listing 17, run the Jaql command to read all the supported documents of this folder,extract the text, and save it into a single delimited file that has one line per original document.Listing 17. HDFS to HDF using Jaqlimport tika(*);read(tikaRead("/tmp/reviews")) //You could put data transformations here -> write(del("/tmp/output", {schema:schema{path,content}, delimiter:"|", quoted:true}));You can now find the output in the /tmp/output folder. This folder contains the text content of theWord documents originally in /tmp/reviews in the format shown below.Listing 18. Output of Jaql Tika transformation::::::::::::::part-00000::::::::::::::"review1.doc"|"I do not care for the camera.""review10.doc"|"It was very reliable ""review2.doc"|"The product was simply bad. ""review3.doc"|"The user interface was simply atrocious. ""review4.doc"|"The product interface is simply broken. "...::::::::::::::part-00001::::::::::::::"review5.doc"|"The Windows client is simply crappy. ""review6.doc"|"I liked the camera. It is a good product. ""review7.doc"|"It is a phenomenal camera. ""review8.doc"|"Just an awesome product. ""review9.doc"|"I really liked the Camera. It is excellent. "...You can now easily analyze the document contents with other tools like Hive, Pig, MapReduce, orJaql. You have one part file for each map task.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 17 of 22Using Jaql, you are not constrained by reading files exclusively from HDFS. By replacing theinput path to one that points to a local disk (of the Jaql instance), you can read files from the localfile system and use the write() method to copy them into HDFS, as shown in Listing 19. Thisapproach makes it possible to load documents into InfoSphere BigInsights and transform them in asingle step. The transformation is not done in parallel (because the data was not read in parallel tobegin with), but if the data volumes are not so high, this method can be convenient.If your operation is CPU-constrained, you can also use a normal read operation that runs inMapReduce. However, this method requires you to put the files on a network file system andmount it on all data nodes. The localRead command in runs the transformation in a local task.Listing 19. Loading data into HDFS using Jaqlimport tika(*);localRead(tikaRead("file:///home/biadmin/Tika/CameraReviews")) -> write(seq("/tmp/output"));As you can see, the only difference here is the local file path. Jaql is flexible and can dynamicallychange from running in MapReduce to local mode. You can continue to perform all of the datatransformations and analytics in one step. However, Jaql does not run these tasks in parallelbecause the local file system is not parallel. Note that in the previous example, the output formatis changed to a Jaql sequence file. This approach is binary and it is faster, so you don't need toreplace characters in the original text. However the disadvantage is that the output files aren'thuman readable anymore. This format is great for efficient, temporary storage of intermediate files.This last example in Listing 20 shows how to run a sentiment detection algorithm on a set ofbinary input documents. (The steps on how to create the AQL text analytics code for this areomitted because there are other comprehensive articles and references existing that go into moredetail. In particular, see the developerWorks article "Integrate PureData System for Analytics andInfoSphere BigInsights for email analysis" and the InfoSphere BigInsights Knowledge Center.Listing 20. Text analysis using Jaqlimport tika(*);import systemT;read(tikaRead("/tmp/reviews")) -> transform { label: $.path, text: $.content } -> transform { label: $.label,sentiments: systemT::annotateDocument( $, ["EmotiveTone"], ["file:///home/biadmin/Tika/"], tokenizer="multilingual", outputViews=["EmotiveTone.AllClues"])};In a nutshell, the commands in the previous sections can read the binary input documents, extractthe text content from them, and apply a simple emotive tone detection annotator using AQL. Theresulting output is similar to Listing 21.Listing 21. Jaql output[ {developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 18 of 22"label": "review1.doc","sentiments": { "EmotiveTone.AllClues": [{ "clueType": "dislike", "match": "not care for"} ], "label": "review1.doc", "text": "I do not care for the camera."} }, {"label": "review10.doc","sentiments": { "EmotiveTone.AllClues": [{ "clueType": "positive", "match": "reliable"} ], "label": "review10.doc", "text": "It was very reliable "} },...You can now use Jaql to further aggregate the results, such as counting the positive and negativesentiments by product and directly uploading the results to a database for deeper analyticalqueries. For more details on how to create your own AQL files or use them within Jaql, see thedeveloperWorks article "Integrate PureData System for Analytics and InfoSphere BigInsights foremail analysis" and the InfoSphere BigInsights Knowledge Center.Archiving the filesAs mentioned, HDFS is not efficient at storing many small files. Every block stored in HDFSrequires some small amount of memory in the HDFS NameNode (roughly 100B). Therefore, anexcessive number of small files can increase the amount of memory consumed on the NameNode.Because you have already implemented a solution to read small binary files and convert themto larger files as the output, you can now get rid of the original small files. However, you mightwant to reanalyze your binary files later by using different methods. Use Hadoop Archive (HAR) toreduce the memory usage on the NameNodes by packing the chosen small files into bigger files.It's essentially equivalent to Linux TAR format, or Windows CAB files, but on HDFS.Run the archive command using the template below.Listing 22. Archive commandhadoop archive -archiveName archive_name.har -p /path_to_input_files /path_to_output_directoryThe first argument specifies the output file name, and the second designates the source directory.This example includes only one source directory, but this tool can accept multiple directories.After the archive has been created, you can browse the content files.ibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 19 of 22Listing 23. List HAR fileshadoop fs -lsr har:///path_to_output_directory/archive_name.harBecause you have the input files in HAR format, you can now delete the original small files to fulfillthe purpose of this process.It is good to note that HAR files can be used as input for MapReduce. However, processing manysmall files, even in a HAR, is still inefficient because there is no archive-aware InputFormat thatcan convert a HAR file containing multiple small files to a single MapReduce split. This limitationmeans that HAR files are good as a backup method and as a way to reduce memory consumptionon the NameNode, but they are not ideal as input for analytic tasks. For this reason, you need toextract the text contents of the original files before creating the HAR backup.ConclusionThis article describes one approach to analyzing a large set of small binary documents withHadoop using Apache Tika. This method is definitely not the only way to implement such function.You can also create sequence files out of the binary files or use another storage method, suchas Avro. However, the method described in this article offers a convenient way to analyze a vastamount of files in various types. Combining this method with Jaql technology, you have the abilityto extract contents directly while reading files from various sources.Apache Tika is one of the most useful examples, but you can replicate the same approach withessentially any other Java library. For example, you can extract binary documents not currentlysupported by Apache Tika, such as Outlook PST files.You can implement everything described in this article by using only Java MapReduce. However,the Jaql module created in the second part of this article is a convenient way to load and transformdata in Hadoop without the need for Java programming skills. The Jaql module enables you todo the conversion process during load and to use analytical capabilities, such as text or statisticalanalysis, which can be completed within a single job.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 20 of 22DownloadsDescription Name SizeProject and sample files for this article SampleCode.zip 26MBibm.com/developerWorks/ developerWorksProcessing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 21 of 22ResourcesLearnHadoop: The Definitive Guide offers more information about Hadoop and MapReduceprogramming, which offers a great way to learn about Hadoop, MapReduce programming,and the Hadoop classes used in this article.Engage with the HadoopDev team.Read "Integrate PureData System for Analytics and InfoSphere BigInsights for emailanalysis" in combination with this article to understand an end-to-end solution withInfoSphere BigInsights, IBM PureData System for Analytics, and IBM Cognos for EmailAnalysis.Learn more about Apache Tika.The InfoSphere BigInsights Knowledge Center product documentation includes the fullreference for Jaql and AQL.Self-paced tutorials (PDF): Learn how to manage your big data environment, import datafor analysis, analyze data with BigSheets, develop your first big data application, developBig SQL queries to analyze big data, and create an extractor to derive insights from textdocuments with InfoSphere BigInsights.Technical introduction to InfoSphere BigInsights: Learn more on Slideshare.Get products and technologiesInfoSphere BigInsights Quick Start Edition: Download this no-charge version, available as anative software installation or as a VMware image.DiscussInfoSphere BigInsights forum: Ask questions and get answers.developerWorks ibm.com/developerWorks/Processing and content analysis of various document types usingMapReduce and InfoSphere BigInsightsPage 22 of 22About the authorsSajad IzadiSajad Izadi a student at York University in Toronto focusing on information technology.He is completing an internship at IBM's Toronto software development lab as amember of the Toronto's Information Management Business Partner team. Hismain responsibilities include technical verification of ReadyFor DB2 applications forbusiness partners and aiding the big data team in partner enablement activities bydeveloping demos used in POCs. His interests include databases, data warehousing,and application development. He is a certified IBM DB2 10.1 Administrator and aCCNA.Benjamin G. LeonhardiBenjamin Leonhardi is the team lead for the big data/warehousing partnerenablement team. Before that, he was a software developer for InfoSphereWarehouse at the IBM R&D Lab Boeblingen in Germany. He was a developer inthe data mining, text mining, and mining reporting solutions.Piotr PruskiPiotr Pruski is a partner enablement engineer within the Information ManagementBusiness Partner Ecosystem team in IBM. His main focus is to help accelerate salesand partner success by reaching out to and engaging business partners, enablingthem to work with products within the IM portfolio, namely InfoSphere BigInsights andInfoSphere Streams. Copyright IBM Corporation 2014(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)