[ieee 2012 7th colombian computing congress (ccc) - medellin, colombia (2012.10.1-2012.10.5)] 2012...

6
1 A Framework for High Performance Image Analysis Pipelines Raúl Ramos-Pollán, Angel Cruz-Roa, Fabio A. González Abstract—This paper describes the software framework being developed to enable the execution of large-scale image analysis pipelines. Images are analyzed through algorithms (feature ex- traction, annotation, classification, etc.) assembled into processing pipelines and managed by the framework to be run onto the available computing resources, whether cloud, opportunistic or dedicated clusters. Underneath, we use Google’s big table storage model for image sources and metadata, showing both flexibility and performance. Additionally, our architecture provides a clear separation between framework developers, providers of algo- rithms and experimenters, enabling the organization of teams and software repositories to ensure its organizational sustainability in the long term. Altogether, we integrate best practices in pattern recognition, software engineering and high performance computing to enable large scale experiments in image analysis. We herewith describe the framework and present preliminary results demonstrating its scalability and ease of use. Index Terms—Image analysis, pattern recognition, high per- formance computing, cloud computing, opportunistic computing. I. I NTRODUCTION Everyday around of 2.5 quintillion bytes of data are created, in fact 90% of the data in the world today has been created in the last two years alone 1 . Big data are datasets that grow in several aspects: variety, velocity and volume. This phenomena happens due to the fast advance of technology, which has produced devices and tools that make it really easy to acquire, store and share huge amounts of data. Visual information is an important part of this data deluge as there is an ever increasing number of image collection publicly available. Examples of these technological devices and tools include smart-phones, high-resolution cameras, storage services as Flickr and Picasa, or social networking sites, such as Facebook or MySpace. These sites host billions of data items including pictures, text, and video and with associated information such as geographi- cal location and tags. This huge amount of data is an important source of information and knowledge, which is changing the methodological approach to different computational problems that range from computer vision to automatic translation. Particular examples include multimedia information retrieval [18], [20], [4], [17], [19], [12], [10] and automatic translation [1], where learning from data [3] has been, so far, the most successful approach, since it takes advantage of the large amount of available information. The authors are with the Bioingenium Research Group at Universidad Nacional de Colombia, Bogotá, Colombia, e-mails: {aacruzr, rramosp, fagonzalezo}@unal.edu.co 1 http://www-01.ibm.com/software/data/bigdata/ Another important source of visual information is scientific research. For instance, medical practice and biomedical re- search, thanks to the advances on digital image acquisition and processing, are generating large collections of images, some of them publicly available [11], [13], [15], [6], [16]. These large biomedical databases have a great potential, not yet exploited, as a source of information and knowledge, which could impact biomedical research in different application fields such as diagnosis, prognosis and theragnosis [8], [14], [9]. Effective extraction of all the information and knowledge in these image collections requires the application of different types of algorithms from image processing, machine learning and pattern recognition. These processes are usually organized in pipelines that involve multiple tasks, some of them highly demanding in terms of computational resources. There are several applications or frameworks which provide a set of feature extractors, but most of them are not designed to efficiently exploit high-performance computing infrastructures for large scale image processing and analysis and, furthermore, over computing resources of different nature (opportunistic, cloud, etc.). For example, LIRe (Lucene Image Retrieval) [7] is a well known Java-based library for Content-based image retrieval which provides a broad set of feature extractors for image retrieval. ImageJ 2 is an open-source Java-based library for image analysis, segmentation, processing and so on with several plugins specialized for different applications. IMMI (RapidMiner 5 Image Mining Extension 3 ) is an opensource plugin of the well known machine learning framework, Rapid- Miner, for image mining workflows. This is one of most com- plete frameworks for image processing addressed to design typical workflows for image processing, mining, classification and so on. Among its characteristics one can find feature extraction processes (local, segment and global features), face/object detection, image classification and segmentation, similarity measures for images, point-interest detectors, image processing and many others. However, comprehensive use of large amount of computing resources remains an artisanal and ad-hoc task (for instance, RapidMiner requires to manually distribute its agents through SSH) . In the long run, our framework is conceived to support a wide range of image analysis and processing algorithms, but also machine learning and data mining ones. This paper presents a general framework for high- performance image analysis that allows the specification of 2 http://rsbweb.nih.gov/ij/index.html 3 http://splab.cz/en/research/data-mining/articles 978-1-4673-1476-3/12/$31.00 c 2012 IEEE

Upload: fabio-a

Post on 28-Feb-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

1

A Framework for High Performance ImageAnalysis Pipelines

Raúl Ramos-Pollán, Angel Cruz-Roa, Fabio A. González

Abstract—This paper describes the software framework beingdeveloped to enable the execution of large-scale image analysispipelines. Images are analyzed through algorithms (feature ex-traction, annotation, classification, etc.) assembled into processingpipelines and managed by the framework to be run onto theavailable computing resources, whether cloud, opportunistic ordedicated clusters. Underneath, we use Google’s big table storagemodel for image sources and metadata, showing both flexibilityand performance. Additionally, our architecture provides a clearseparation between framework developers, providers of algo-rithms and experimenters, enabling the organization of teams andsoftware repositories to ensure its organizational sustainabilityin the long term. Altogether, we integrate best practices inpattern recognition, software engineering and high performancecomputing to enable large scale experiments in image analysis.We herewith describe the framework and present preliminaryresults demonstrating its scalability and ease of use.

Index Terms—Image analysis, pattern recognition, high per-formance computing, cloud computing, opportunistic computing.

I. INTRODUCTION

Everyday around of 2.5 quintillion bytes of data are created,in fact 90% of the data in the world today has been created inthe last two years alone1. Big data are datasets that grow inseveral aspects: variety, velocity and volume. This phenomenahappens due to the fast advance of technology, which hasproduced devices and tools that make it really easy to acquire,store and share huge amounts of data. Visual information is animportant part of this data deluge as there is an ever increasingnumber of image collection publicly available. Examples ofthese technological devices and tools include smart-phones,high-resolution cameras, storage services as Flickr and Picasa,or social networking sites, such as Facebook or MySpace.These sites host billions of data items including pictures, text,and video and with associated information such as geographi-cal location and tags. This huge amount of data is an importantsource of information and knowledge, which is changing themethodological approach to different computational problemsthat range from computer vision to automatic translation.Particular examples include multimedia information retrieval[18], [20], [4], [17], [19], [12], [10] and automatic translation[1], where learning from data [3] has been, so far, the mostsuccessful approach, since it takes advantage of the largeamount of available information.

The authors are with the Bioingenium Research Group at UniversidadNacional de Colombia, Bogotá, Colombia,e-mails: {aacruzr, rramosp, fagonzalezo}@unal.edu.co

1http://www-01.ibm.com/software/data/bigdata/

Another important source of visual information is scientificresearch. For instance, medical practice and biomedical re-search, thanks to the advances on digital image acquisitionand processing, are generating large collections of images,some of them publicly available [11], [13], [15], [6], [16].These large biomedical databases have a great potential, notyet exploited, as a source of information and knowledge, whichcould impact biomedical research in different application fieldssuch as diagnosis, prognosis and theragnosis [8], [14], [9].

Effective extraction of all the information and knowledgein these image collections requires the application of differenttypes of algorithms from image processing, machine learningand pattern recognition. These processes are usually organizedin pipelines that involve multiple tasks, some of them highlydemanding in terms of computational resources.

There are several applications or frameworks which providea set of feature extractors, but most of them are not designed toefficiently exploit high-performance computing infrastructuresfor large scale image processing and analysis and, furthermore,over computing resources of different nature (opportunistic,cloud, etc.). For example, LIRe (Lucene Image Retrieval) [7]is a well known Java-based library for Content-based imageretrieval which provides a broad set of feature extractors forimage retrieval. ImageJ2 is an open-source Java-based libraryfor image analysis, segmentation, processing and so on withseveral plugins specialized for different applications. IMMI(RapidMiner 5 Image Mining Extension3) is an opensourceplugin of the well known machine learning framework, Rapid-Miner, for image mining workflows. This is one of most com-plete frameworks for image processing addressed to designtypical workflows for image processing, mining, classificationand so on. Among its characteristics one can find featureextraction processes (local, segment and global features),face/object detection, image classification and segmentation,similarity measures for images, point-interest detectors, imageprocessing and many others. However, comprehensive use oflarge amount of computing resources remains an artisanal andad-hoc task (for instance, RapidMiner requires to manuallydistribute its agents through SSH) . In the long run, ourframework is conceived to support a wide range of imageanalysis and processing algorithms, but also machine learningand data mining ones.

This paper presents a general framework for high-performance image analysis that allows the specification of

2http://rsbweb.nih.gov/ij/index.html3http://splab.cz/en/research/data-mining/articles

978-1-4673-1476-3/12/$31.00 c©2012 IEEE

Page 2: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

2

pipelines involving different types of image processing andanalysis tasks. The framework provides an abstraction layerthat encapsulates different types of high-performance archi-tectures including grid, opportunistic and cloud infrastructures,including conventional servers and desktops. Additionally, theframework encourages a clear distinction of roles to enableframework developers, algorithm designers and experimentersto work in a coordinated way. The framework is evaluated ina publicly available histology image data base performing avisual feature extraction task.

The rest of paper is organized as follows: Section II de-scribes image processing pipelines and how they are modeledin the proposed framework; Section III presents the overallframework architecture; Section IV shows the experimen-tal evaluation and results; finally, Section V concludes andpresents the future work.

II. IMAGE PROCESSING PIPELINES

A. Introducing image processing pipelines

We consider an image processing pipeline as made of aset of consecutive stages, each one defining an operation(algorithm) to be executed over input data (images or datasets)and producing output data (datasets). Figure 1 shows a simpleimage analysis pipeline consisting of (1) a features extractionalgorithm fed by image raw data and produces a vector offeatures (such as a gray histogram), (2) a model trainingalgorithm (such a support vector machine) feeding on thefeatures vectors and producing a trained model and (3) aprediction stage that outputs predictions on the extractedfeatures using the trained model.

Figure 1. Simple image analysis pipeline

Furthermore, each algorithm in each stage is usually con-trolled by a set of configuration parameters and undergoesseveral runs (such as for cross-validation, bootstrapping, etc.)on different subsets of the available data. Each execution ofan algorithm (with a given set of configuration parameters, orsubset of the data) is prone to be run concurrently with othersand, thus, there is an inherent possibility to exploit parallelism.Figure 2 shows further detail in the feature extraction andmodel training stages of previous example pipeline.

In this case, the pipeline does the following: (1) splitsthe initial images dataset in two subsets to allow parallelprocessing of each subset, (2) includes two configurationsfor the feature extraction stage and (3) uses three-fold crossvalidation in the model training stage. As it can be seen,the need for computing power starts to be a crucial issue,but also the possibilities for exploiting parallelization becomeevident. In this case, the same features extraction algorithmcan run four instances in parallel, one for each configurationand split of the initial images collection. After that, themodel training stage can be run in parallel for each one of

Figure 2. Detailed image analysis pipeline

the three folds for each resulting dataset from each featuresextraction configuration. Also, there are intermediate datasetsthat need to be managed from one stage to the other. Asexperiments include more splits for the data, configurations forthe algorithms (both for feature extraction and model training)and validation methods the amount and complexity of theartifacts increases and the need for a framework to efficientlymanage data and computation becomes essential. Thus, theneed for a framework.

B. Pipeline model

We use the model for a pipeline illustrated in figure 3 asthe foundation for our architecture and for the schedulingalgorithms for computation and data distribution. Dot linesrepresent one-to-many relationships. Therefore, we consider apipeline as composed on many repeats, each repeat includesseveral configurations made of different stages.

Figure 3. Components of a pipeline

The intuition behind each component of this pipeline modelis as follows. A configuration is a combination of parametervalues of the algorithms of the stages of the pipeline. Thus,a pipeline contains many configurations corresponding tothe different parameter values for the algorithms that theexperimenter wishes to understand. The notion of a stagewas described in section II-A. Sinchronization among stages isstraight forward since a any stage cannot start if its preceedingstages have not finished. Also, any stage may have several runsand/or splits. A run corresponds to the notion of having thesame algorithm configuration run several times over relateddata to gain statistical confidence of its results. For example,when using cross-validation, each fold corresponds to onerun in our framework. Or when using bootstrapping, eachresampling of the data corresponds as well to one run. Finally,a split corresponds to the idea of having the input datadivided to allow algorithms being executed in parallel overdifferent subsets of the data. Of course, this depends on thecapability of particular algorithms to be parallelized to exploitdata partitioning. Algorithm providers express this capabilitythrough the API (Application Programming Interface) offeredby the framework for them to implement or adapt their code(see section III below). Finally, having a pipeline with several

Page 3: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

3

repeats allows the experimenter to run the same pipeline sev-eral times. This might be desirable in cases where algorithmsare stochastic in nature and one might need to run the samepipeline several times for statistical smoothing.

Figure 4. Example configuration file defining a one-stage pipeline

C. Defining pipelines

Pipelines are defined in configuration files that are preparedby the experimenter. Then, they are handed over to theframework to generate the scheduling for computation and datadistribution which are, later, effectively performed by workermodules as described in section III below.

Figure 4 shows a simple configuration file defining a singlestage pipeline. Note how for this single stage (1) the algo-rithm to run is set by specifying in the first line the Javaclass implementing it, (2) the algorithm has two parameters(lowPass and highPass) containing three values each, whichmakes 9 configurations for this pipeline, (3) the location of theinput data to feed the algorithm with (origin) and the locationto store the data produced by the algorithm (destination) arespecified separately; and (4) input data is divided into 20 splitswhich, if there exist worker modules available at the sametime, will run in parallel for each split.

Input and output data is stored in database tables usingGoogle’s Big Table model [2] with predefined column familiesfor data content and metadata. The framework integratesalgorithms and data source implementations that developersproduce following a well defined API within the frameworkand a simple Java based deployment model. This allowsexperimenters to integrate new algorithms very easily whilepreserving independence of their code and work.

A pipeline is then executed by the framework which, fol-lowing this model, schedules its components for execution inthe appropriate order and handles data as it is being needed orproduced. For this, the framework provides worker modulesthat can be deployed onto available computing resources by theexperimenter. These resources can be idle desktop computerson his lab (implementing a opportunistic computing model),virtual machines on the Amazon cloud or computing nodes ofa dedicated cluster.

The framework is available to experimenters through acommand line interface that can be used to load pipelinesinto the system, produce their execution schedule and retrievepipelines progress and status. It also allows experimenters toload the databases with their files (images to be processed)and start worker modules at their convenience.

The following section describes the architecture of theframework and the scheduling mechanism through whichworkers cooperate to execute a pipeline in an entirely dis-tributed mode.

III. FRAMEWORK ARCHITECTURE

A. Pipeline execution model

The execution model for pipelines follows two steps. First,when given an exploration, the framework generates the or-dered schedule of pending computing tasks for the pipelineand stores it in a database. Then, worker modules lookup theschedule and take over the next free task to execute. Tasks canbe in one of three states: pending, in-progress or done.

The schedule is simply a list of tasks. For instance, runthe algorithm SimpleFeaturesExtractor over split number 2 ofthe input data. Or, summarize the results of a stage when allruns are done (such as when all cross validation folds havebeen evaluated), etc. The hierarchical pipeline model in figure3 implicitly defines the execution order of tasks allowing usto group and aggregate results as they are being produced.This is, no worker will take over the task of summarizinga stage until all runs are done. In this case, when a workerstarts or becomes free it will start other runs in other stagesor configurations.

B. Task distribution and coordination

There is no a-priori distribution of tasks among workers,instead, whenever a new worker is available looks up inthe schedule in the database, takes over the next free taskand tags it as in-progress so that other workers know itis not free anymore. As illustrated in figure 5, there is nocentral intelligent node that workers contact to retrieve tasks,but a database with scheduling information is shared amongworkers, who contain the logic to know what to do nextbased on the information in the scheduling database withoutinterfering with the rest of the workers. Therefore, when anexperimenter starts the execution of a pipeline (this is, whenhe asks the framework to create the schedule for a pipeline) heis not required to know or set in advance how many workersare to be contributing to the execution of the pipeline. Workerscan join a pipeline execution at anytime.

Figure 5. Only a shared database, no central coordination node

This empowers the framework with the flexibility to adaptto different computing models, whether opportunistic (desktopcomputers can start workers anytime), on the cloud (with

Page 4: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

4

workers encapsulated in virtual machines), or over dedicatedclusters (with workers available at each computing node).

In this model, coordination among workers is done indi-rectly through the state of the tasks in the shared schedulingdatabase. Coordination among workers basically requires theoperation of checking a task state by a worker and modifying itto be atomic (this is, within the same transaction). In additionto that, when workers take over a task they report (ping)periodically in the database that they are active on such task.When another worker is looking for what to do next, it mightfind tasks in-progress with no recent ping, in such case, thenew worker will assume the worker previously executing thattask has died. It will therefore take over and restart that task. Ifthe previous worker comes back to life and sees that somebodyelse has pinged his task, it will abandon it and will lookfor something else to do. Synchronization among workers toperform a list of tasks is done with simple mechanisms andwithout the need of a central node. All logic is within eachworker and it is very simple.

C. Workers deployment

The only condition for a worker to participate in a pipelineis connectivity to the shared scheduling database, and to theinput and output databases. A worker is a process that looksup the next task to do, retrieves data as required, launches thealgorithm specified, and stores the data produced. As such,it can be encapsulated as required to enable its deploymentover available computing resources. The whole frameworkis programmed in Java which favors its portability and, inaddition, it provides the mechanisms to embed native binarycode for different hardware architectures (for native algorithmsimplementations).

Currently, workers have been encapsulated as a commandline tool, within Java Web Start and within an Amazonvirtual machine. Starting workers as a command line toolallows any machine with our framework installed to partic-ipate in the execution of a pipeline, simply by launching itfrom the command line, or integrating it within the machineboot sequence. Java Web Start allows any user wanting tocontribute with his desktop machine to launch a worker byusing his or her browser. Finally, the command line tool hasbeen encapsulated within an Amazon machine image (AMI)that can be cloned and launched to reach a desired amount ofworkers on Amazon’s cloud infrastructure.

In all cases, regardless how workers are deployed andstarted, they will contribute to the pipelines scheduled in theshared database they are given access to.

D. APIs and contributing roles

A key issue in designing our framework was to ensureits extensibility. Although, it already offers a set of featureextraction algorithms and an implementation to use HBase[5] for data storage, we foresee its real utility based on (1) itspossibility to integrate new algorithms and data sources and (2)its capacity to allow experimenters, developers of algorithms

http://docs.oracle.com/javase/6/docs/technotes/guides/javaws/

and framework designers to contribute asynchronously to itsevolution in a coherent manner.

Figure 6 shows an overall view of the components of theframework. The command line tool allows experimentersto load pipelines definitions, load files, etc. The workermodule effectively runs tasks and can be started by any ofits launchers.

Figure 6. Framework components

Notice that the framework core provides neither concretealgorithms nor a storage layer. Rather, it defines two APIs foralgorithms and storage, and integrates whatever implementa-tions of those APIs might be available. In some sense, theframework core acts as glue between the pipeline definitionsmade by the experimenter, the execution of the pipelinescarried out by workers, and concrete implementations ofspecific algorithms and underlying storage.

This way, it supports and separates four contributing roles toits long term evolution: the framework provider, who devel-ops the framework core, workers and associated tools (launch-ers and command line); the algorithm provider, who developsimage processing algorithms implementing the frameworkalgorithm API; the storage provider, who develops interfaceswith storage systems following the framework storage API;and, finally, the experimenter who uses the whole frameworkto define and execute image processing pipelines.

The framework uses Java annotations and introspection toenable a seamless integration of artifacts (packaged implemen-tations) from algorithms and storage providers. Annotationsenable providers to tag the fields they use for parameters.This allows the framework to gather values for these fieldsfrom pipeline configuration files (such as the one in figure 4)and, through introspection, find the appropriate place in theimplementations to set the values as the algorithm providerexpects. Figure 7 shows an example implementation of animage processing algorithm. It conveys a sense of the natureof the algorithm API and the way annotations are used to tellthe framework which ones are user definable parameters.

Then, the algorithm provider disitributes his implementationas a jar file or includes it into the framework source tree. Theframework is distributed as a compressed file for users to unzipit in their preferred location and can start using it right awayto load images, pipelines and start workers. The frameworkalso includes an implementation of the storage API for usingan existing HBase installation out of the box, and 16 featureextraction algorithms based mostly on the ImageJ library.

Page 5: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

5

Figure 7. Example implementation following the Algorithm API

E. Google’s Big Table model

The framework’s storage API follows Google’s Big Tablemodel, which encourages tables indexed by a single key valueused mostly through sequential scans on specific columns.Columns can be added and removed dynamically within a setof column families established when creating a table.

Scheduling, input and output data are managed by theframework on tables using this model, through concrete im-plementations of the storage API (HBase, in this case). Apartfrom performance and scalability qualities, two features ofthe Big Table model have been particularly useful for ourframework. First, the column family model has allowed us tocreate columns as required whenever we needed to tag inputdata to distribute it into splits or run for each pipeline. Then,workers scan input data tables and filter the rows required forthe split or run assigned to them. Second, although Big Table(and HBase) are essentially non transactional, they provide onesingle compound transaction named check-and-put, providingin an atomic operation the capability to modify a row if oneof its columns has a certain value. It ensures the modificationtakes place only if the row had that value during the call. Thisis exactly the single transactional operation that our workersrequire to ensure their synchronization indirectly through thescheduling database as explained above.

IV. EXPERIMENTAL EVALUATION

For real uses of this framework a research group may havemany options depending of their computational resources. Forexample, it could be have a small set of desktop computersfor preliminary experiments and then access a larger cloudresource for production or massive processing. In imageprocessing, an important decision is usually whether sourcedata is to be stored in raw format or compressed. In thissense, two different scenarios were defined in order to evaluatethe framework for a typical image analysis pipeline overan image database in terms of distributed processing. In afirst scenario we use an image dataset with one hundredimages in JPG format (compressed) whereas in a secondscenario we have the same images in BMP format (raw). Theaverage size of compressed and raw images is 500KB and2.3MB respectively. Both scenarios were evaluated in a smalland heterogeneous set of computational resources, typicallyavailable in a common research lab, two laptops (by Wi-Ficonnection) and three desktop computers (by LAN connection)

with different configurations of memory and CPUs running amaximal of two workers per computer.

These experiments are defined in a simple pipeline, whichextracts a typical feature (i.e. MPEG dominant color) onthe mentioned image datasets using from 1 to 10 workers.The results in figures 8 and 9 show the total elapsed andcompute times for each number of workers. The elapsed timeis measured from the start of the experiment until the end,including all data transfer times and other delays inherentto the distribution of tasks. Compute time is measured ateach worker just before and after calling the image processingoperation and, therefore, excludes any transfer time. Elapsedtimes for the dataset with compressed images range from 273seconds using one worker to 51 seconds using ten workers.In the case of the dataset with raw images elapsed timesrange from 364 seconds (one worker) until 75 seconds (tenworkers). This is clearly related to the fact that transfer timeis greater in raw images than in compressed images. Howevercompute time is higher in average for compressed images(~340 seconds) than in raw images (~260 seconds), due thefact that they require additional processing (decompressing theimages) and the extra libraries needed to load for this purpose.

Figure 8. Computing time Vs number of workers for compressed imagescollection (JPG format).

Figure 9. Computing time Vs number of workers for raw images collection(BMP format).

In a different view, figure 10 shows the speedup obtainedby both experiments according to the number of workers, withrespect to the sequential case (thick black and red plots). Aspeedup of 2x means that the execution time is halved. First, itshows how, as the number of workers increases, we get slightlygreater speedups in the case of the compressed images (JPG

Page 6: [IEEE 2012 7th Colombian Computing Congress (CCC) - Medellin, Colombia (2012.10.1-2012.10.5)] 2012 7th Colombian Computing Congress (CCC) - A framework for high performance image analysis

6

format) probably due to the fact that JPGs are smaller and lesscostly to transfer. Thinner lines shows the theoretical speedupgiven by Amdahl’s law when the amount of parallelizable codeis between 50% and 95% of the total code. It can be seen thatthe speedup obtained in our experiment is comparable to thetheoretical case of having between 80% and 90% of the codeparallelizable. Our experiments include the time to transferimages, loading libraries, etc. and therefore we consider thisto be reporting the expected behavior.

Figure 10. Theoretical and experimental speed-up.

V. CONCLUSION AND FUTURE WORK

In this work we described our approach to tackle the issueof executing image analysis pipelines on a large scale usinga set of available heterogeneous computing resources. In thisendeavor we pursued both agility for the experimenter whenharnessing computing power and adaptability of our frame-work to use resources available through different computingmodels (opportunistic, cloud, clusters). As shown, this isobtained by: (1) assuming a hierarchical model for a pipeline,(2) generating a task schedule associated with a pipeline storedon a shared database, and (3) integrating a simple logic withincomputing workers based on querying the shared schedule.The underlying storage model also showed to be appropriateand scalable together with this computation model.

In addition, our framework allows for a clear separation ofroles contributing to its evolution through APIs delimiting thework by each developer and by exploiting Java annotationsand introspection capabilities. The experiments have also al-lowed us to test and understand the robustness, configurabilityand deployabiblity, since a significative number of workers,datasets and experimental configurations have been deployed,undeployed, reset, etc. with great success.

Through experimentation, our framework has shown a be-havior comparable to code 85% parallelizable according toAmdahl’s law, which would be roughly expected with thedifferent data transfer and computing load conditions underwhich experiments were carried out. Future work is focusedon (1) fully supporting machine learning stages to enable morecomplex pipelines and (2) adding the facilities to integrateMATLAB implementations of our algorithms so that earlierexperimental phases on our research teams can also exploitthe framework. Additionally, we are further performing ex-periments to gain confidence on using the framework underdifferent platforms, specially on the cloud.

VI. ACKNOWLEDGEMENTS

This work was partially funded by the project “AutomaticAnnotation and Retrieval of Radiology Images Using LatentSemantic” number 110152128803 and project “Medical ImageRetrieval System Based On Multimodal Indexing” number110152128767 through Colciencias call number 521, in 2010.Cruz-Roa also thanks Colciencias for its support through adoctoral grant in call 528 2011.

REFERENCES

[1] Y. Bar-Hillel. The present status of automatic translation of languages.Readings in Machine Translation. MIT Press: Boston, pages 45–77,2003.

[2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: adistributed storage system for structured data. In Proceedings of the 7thUSENIX Symposium on Operating Systems Design and Implementation -Volume 7, OSDI ’06, pages 15–15, Berkeley, CA, USA, 2006. USENIXAssociation.

[3] J. Hays and A. A. Efros. Scene completion using millions of pho-tographs. ACM SIGGRAPH 2007 papers, 2007.

[4] J. L. J.Z. Wang and G. Wiederhold. Simplicity: semantics-sensitiveintegrated matching for picture libraries. IEEE Trans. Pattern Anal.Mach. Intell., 23(9):947–963, 2001.

[5] A. Khetrapal and V. Ganesh. HBase and Hypertable for large scaledistributed storage systems. Dept. of Computer Science, . . . , 2006.

[6] E. Lein, M. Hawrylycz, and N. Ao. Genome-wide atlas of geneexpression in the adult mouse brain. Nature, 2006.

[7] M. Lux and S. A. Chatzichristofis. Lire: lucene image retrieval:an extensible java cbir library. In Proceedings of the 16th ACMinternational conference on Multimedia, MM ’08, pages 1085–1088,New York, NY, USA, 2008. ACM.

[8] A. Madabhushi. Digital pathology image analysis: opportunities andchallenges (editorial). Imaging In Medicine, 1(1):7–10, October 2009.

[9] A. Madabhushi, A. Basavanhally, S. Doyle, S. Agner, and G. Lee.Computer-aided prognosis: predicting patient and disease outcome viamulti-modal image analysis. In Proceedings of the 2010 IEEE interna-tional conference on Biomedical imaging: from nano to Macro, ISBI’10,pages 1415–1418, Piscataway, NJ, USA, 2010. IEEE Press.

[10] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to InformationRetrieval. Cambridge University Press, New York, NY, USA, 2008.

[11] M. Martone, S. Zhang, A. Gupta, and X. Qian. The cell-centereddatabase. Neuroinformatics, 2003.

[12] C. Meadow, B. Boyce, D. Kraft, and C. Barry. Text information retrievalsystems. 2007.

[13] H. Muller, N. Michoux, D. Bandon, and A. Geissbuhler. A review ofcontent-based image retrieval systems in medical applications-clinicalbenefits and future directions. International journal of medical infor-matics, 73:1–23, Feb. 2004. PMID: 15036075.

[14] F. Pene, E. Courtine, A. Cariou, and J. Mira. Toward theragnostics. Crit-ical Care Medicine, 37(1 Suppl):S50–58, Jan. 2009. PMID: 19104225.

[15] A. Persson, S. Hober, and M. Uhlen. A human protein atlas basedon antibody proteomics. Current opinion in molecular therapeutics,8(3):185, 2006.

[16] F. Pontén, K. Jirström, and M. Uhlen. The Human Protein Atlas: a toolfor pathology. The Journal of pathology, 216(4):387–393, 2008.

[17] J. Sivic and A. Zisserman. Video google: a text retrieval approach toobject matching in videos. In Computer Vision, 2003. Proceedings. NinthIEEE International Conference on, pages 1470–1477 vol.2, 2003.

[18] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain.Content-based image retrieval at the end of the early years. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 22:1349–1380, 2000.

[19] J. Vogel and B. Schiele. Semantic modeling of natural scenes forContent-Based image retrieval. International Journal of ComputerVision, 72(2):133–157, 2006.

[20] J. Wang and Y. Du. Rf*ipf: a weighting scheme for multimedia infor-mation retrieval. In Image Analysis and Processing, 2001. Proceedings.11th International Conference on, pages 380–385, 2001.