Download - Madsen WP Open Source Data Integration
-
8/6/2019 Madsen WP Open Source Data Integration
1/16
-
8/6/2019 Madsen WP Open Source Data Integration
2/16
The Role of Open Source in Data Integrat ion January, 2009
Page 1 Thi rd Nature
TABLEOFCONTENTS
Introduction ........................................................................................................................2 OpenSourceandtheFutureofDataIntegration ........................................................2
SpendingPrioritiesEmphasizeNeedforDataIntegration .................................... 2TheDriveTowardOpenSourceDataIntegration..................................................3
UnderstandingDataIntegration ......................................................................................... 4TheDifferenceBetweenApplicationIntegrationandDataIntegration ......................4OperationalDataIntegrationvs.AnalyticDataIntegration.........................................4ThreeApproachesforDataIntegration .......................................................................5
Consolidation ......................................................................................................... 5Propagation............................................................................................................6 Federation..............................................................................................................7
CreatingSolutionsforOperationalDataIntegrationProblems................................... 8TheMostCommonPractice:CustomCoding ........................................................8 TheStandardOption:BuyaDataIntegrationProduct.......................................... 8TheThirdAlternative:OpenSource ......................................................................9
TheBenefitsofOpenSourceforDataIntegration ...........................................................10Flexibility.....................................................................................................................11 VendorIndependence................................................................................................11 OptimalPrice .............................................................................................................. 12
Recommendations............................................................................................................14
-
8/6/2019 Madsen WP Open Source Data Integration
3/16
The Role of Open Source in Data Integrat ion January, 2009
www.ThirdNature.net Page 2
Introduction
Open Source and the Future of Data Integration
Dataintegration(DI)hasseenminimalautomationoverthepastdecadedespitemany
technologyadvances.Mostcompaniesstillhandcodedataintegrationbetween
applications(operationaldataintegration)usingtechniquesthatwouldbefamiliartoa
programmerfromthe1980s. Inbusinessintelligence40%oftheExtractTransformand
Load,orETL,processesarestillhandcoded.
Inthenextfewyearsitslikelythattherewillbestabletodeclininginvestmentinnew
applicationsduetoeconomicfactors,butincreasedneedfordataintegration
technology.IntegrationconsumesasignificantportionoftheITbudgetandiscoming
underheavierscrutinyasacosttocontrol.
Newvendorsareaddressingthisgapindataintegration,butfaceadoptionchallengesin
IT.Dataintegrationisadevelopertaskandaninfrastructureitem,makingnewtools
hardtojustify.Fortunately,thereareopensourceproductsinthismarketcapableof
supplyingthemuchneededautomation.
Opensourcedataintegrationtoolscanprovidethecostadvantagesofhandcodingwith
theproductivityadvantagesoftraditionaldataintegrationsoftware.Theyare
establishedinthedevelopertoolsmarketwhichhasbeenthetraditionalstrongholdof
opensourcesoftware.Expectopensourcetobeakeycomponentofdataintegration
(andespeciallyofoperationaldataintegration)inthenearfuture,similartothewayitis
akeycomponentofapplicationdevelopmentenvironmentstoday.
Spending Priorities Emphasize Need for Data Integration
BusinessintelligenceappearsconsistentlyasthetopiteminsurveysofbusinessandIT
managementpriorities.Morebusinessintelligencemeansgreaterneedfordata
integrationtools.BImarketsurveysshowthatroughly40%ofcompaniesarehand
codingtheirETLprocesses,leavingroomforgrowth.
ACIOInsightITspendingsurveyshowsthatstandardizingandconsolidatingITinfrastructureisthenumberonepriorityinthecomingyearforlargefirms,andnumber
twoinmediumandsmallfirms.Acrossallfirms,improvinginformationqualityshowsup
asthenumberthreepriority.
ResultsfromasurveybyOracleondataintegrationshowedthat30%oftheircustomers
arebuyingtoolsforoperationaldataintegrationtoday.Thetopneedsofthese
customerswerelowlatencydataaccess,doingmigrations,andcreatingdataservices.
Thesestatisticsclearlyhighlightthenewfocusondataintegration.Only60%ofthe
businessintelligencemarketisusingdataintegrationtools,sothereisstillroomfor
growth.With70%to85%ofcompaniesstillhandcodingoperationaldataintegration,
itsclearthatthisisanareareadyforautomation,withearlyadoptersalreadyusing
thesetools.
-
8/6/2019 Madsen WP Open Source Data Integration
4/16
The Role of Open Source in Data Integrat ion January, 2009
The Drive Toward Open Source Data Integration
OpensourcehasbecomeastandardpartoftheinfrastructureinITorganizations.Most
areusingLinux,opensourcedevelopmenttools,andmanyarerunningopensource
databases.Themajorityofenterprisewebinfrastructureisbuiltusingopensource.
Thisgrowingfamiliaritywithopensourceledtoincreasedadoptionratesacrossall
categoriesoftoolsandapplications.Venturecapitalfloodedintoopensourcestartups
overthepastseveralyearsresultinginanexplosionofenterprisereadytoolsand
applications.
Opensourcedataintegrationvendorsarecreatingchallengesforbothtraditional
vendorsintheDImarketwhoaretryingtointroducenewtools,andfornewnonopen
sourcevendorsofoperationaldataintegrationtools.Theexistenceofopensourcetools
inamarketraisesbarrierstoentrythatarehardforvendorstoaddress.Thisisthe
scenariothatplayedoutinthewebserver,applicationserver,andJavadevelopment
toolsmarkets.
Enterprisecustomersaredemandingprojectsizeddataintegrationtoolsthatcanbe
scaleduptoenterpriseuse.Theydontwantcomplex,expensiveDIproductsthatare
notafitwiththedistributednatureoftheapplicationenvironment.Withsuchalarge
marketneed,thefuturedirectionofdataintegrationissuretohavealargeopensource
component.
Page 3 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
5/16
The Role of Open Source in Data Integrat ion January, 2009
www.ThirdNature.net Page 4
Understanding Data Integration
The Difference Between Application Integration and Data Integration
Dataintegration(DI)andenterpriseapplicationintegration(EAI)arenotthesamething,
thoughvendorssometimesobscurethedifferencetobroadentheappealoftheirtools.
Applicationintegrationfocusesonmanagingtheflowofevents(transactionsor
messages)betweenapplications.Dataintegrationfocusesonmanagingtheflowofdata
andprovidingstandardizedwaystoaccesstheinformation.
Applicationintegrationaddressestransactionprogrammingproblems,allowingoneto
directlylinkoneapplicationtoanotheratafunctionallevel.Thefunctionsareexposed
toexternalapplicationsviathetoolsAPI,thushidingallapplicationsbehindacommon
interface.
Dataintegrationaddressesadifferentsetofproblems.DIstandardizesthedatarather
thanthetransactionorservicecall,providingabetterabstractionfordealingwith
informationthatiscommonacrosssystems.DItoolsabstracttheconnectors,transport
andmoreimportantlymanipulationnotjustthesystemendpoints.Whendone
properly,DIensuresthequalityofdataasitisbeingintegratedacrossapplications.
Thetypeandlevelofabstractionarewhatdifferentiatesthetwoclassesofintegration.
EAItoolsareatransporttechnologythatrequiresthedevelopertowritecodeatthe
endpointstoaccessandtransformdata.Thesetoolstreatdataasabyproduct.This
makesfunctionsreusableattheexpenseofcommondatarepresentations.
Dataintegrationtoolsuseahigherlevelofabstraction,hidingthephysicaldata
representationandmanipulationaswellastheaccessandtransport.Thetoolsprovide
dataportabilityandreusabilitybyfocusingondataandignoringtransactionsemantics.
Becausetheyareworkingatthedatalayerthereisnoneedtowritecodeatindividual
endpoints,andalldatatransformationandvalidationisdonewithinthetool.
ThekeypointindifferentiatingDIandEAIistoknowthattherearetwodistincttypesof
integrationwithseparateapproaches,methodsandtools.Eachhasitsrole,onefor
managingtransactionsandoneformanagingthedatatransactionsoperateon.
Operational Data Integration vs. Analytic Data Integration
Therearetwodifferentwaysofusingdataintegrationtoolsbasedonthetypeof
systemsbeingintegrated:transactionalapplicationsorbusinessintelligencesystems.
Theseusesaffecttheapproach,methodsandtoolsthatarebestforthejob.Extract,
transformandloadorETListhetermusedinanalyticsystems.Theindustryissettling
onthetermoperationaldataintegrationorOpDIwhenreferringtodataintegrationforapplications.
Businessintelligencehasbeentheprimarydriverofdataintegrationproductsforthe
pastdecade.BIsystemsaremostoftenloadedinbatchcyclesaccordingtoafixed
schedule,bringingdatafrommanysystemstoonecentralrepository.Theyhave
relativelylargevolumesofdatatoprocessinashorttime,buthavelittleconcurrent
loadingactivity.Mostproductswereoriginallydesignedtomeetthespecificneedsof
theanalyticdataintegrationmarket.
-
8/6/2019 Madsen WP Open Source Data Integration
6/16
The Role of Open Source in Data Integrat ion January, 2009
Thenatureofoperationaldataintegrationproblemsisdifferent.Dataintegrationisa
smallelementofanapplicationprojectunlikeadatawarehousewhereDImayconsume
80%oftheprojectbudgetandtimeline.
Mostapplicationintegrationprojectsneeddatafromoneortwoothersystems,notthe
manysourcesandtablesfeedingadatawarehouse.Thescopeisusuallysmaller,with
lower
data
volumes
and
narrower
sets
of
data
being
transferred
with
minimal
transformation.
AkeychallengeforOpDIisthatthedataisusuallyneededmorefrequentlythanone
batchpernight,unlikemostanalyticenvironments.TraditionalETLproductsforthe
datawarehousemarketdonthandlelowlatencyrequirementsaswellasother
integrationtools.ThismakesETLapoorerfitforsometypesofoperationaldata
integration.
Thedifferencesinfrequencyofexecution,datavolume,latencyandscopearetechnical
elementsthatdifferentiateoperationalandanalyticdataintegration.Theother
characteristicthatseparatesthemisusagescenarios.Howpeopleintegratedatain
operational
environments
is
different.
Three Approaches for Data Integration
Thedataintegrationscenarioscommonlyencounteredinprojectscanbemappedto
oneofthreeunderlyingapproaches:consolidation,propagationorfederation.
Consolidationimpliesmovingdatatoasinglecentralrepositorywhereitcanbe
accessed.Withpropagationthedataiscopiedfromthesourcestotheapplicationslocal
datastore.Federationleavesdatainplacewhilecentralizingtheaccessmechanismsso
thedataappearstoconsumingapplicationsasifitwereconsolidated.
Consolidation Propagation FederationConsolidation
Theconceptofconsolidationistomovethedatawholesalefromoneormoresystems
toanother.Allintegrationandtransformationisdonebeforeitisloadedinthetarget
system.
This
is
most
often
seen
in
business
intelligence,
where
ETL
is
used
to
centralize
datafrommanysystemsintoasingledatawarehouseoroperationaldatastore.Outside
ofanalyticenvironments,asinglecentrallyaccessedrepositoryismostlikelytobe
foundinmasterdatamanagementandCRMprojects.
Page 5 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
7/16
The Role of Open Source in Data Integrat ion January, 2009
Intheworldofoperationaldataintegrationthereareseveralother
scenariosthatfitwithinaconsolidationapproach.Systemmigrations,
upgradesandconsolidationsallrequirelargescalemovementofdata
fromonesystemtoanother.
Consolidation Consideramergeroracquisitionwherethereareredundantsystemsbetweenthetwocompanies.Ifthecompaniesarerunningmultiple
instancesofthesamesoftwaretheycanreducethecostofsoftwaremaintenanceand
operationsbyconsolidatingtheseintooneinstance.
MergingthedatafromseveralinstancesofanERPsystemisnotatrivialtask.Therecan
bethousandsoftablestocopyandmerge,andthatsthesimplepart.Dataquality
issuesareusuallydiscoveredintheprocess.Thesolutionmayrequirededuplicating
customerrecords,mergingvendors,orreassigningandcrossreferencingproduct
numbers.
The advantage of not
physically copying datameans that there are nodatabases to create ortables to manage,speeding development.
Thesametasksandproblemsoccurinasinglecompanywhenmigratingfromone
vendorsapplicationtoanother,forexamplewhenmovingfromaninternalCRMsystem
toahostedapplication.Evenpackagedapplicationupgradescaninvolvealevelofdata
migration.Deployingnewapplicationsalmostalwaysinvolvesimportingdataand
settingupdatafeedstoandfromothersystems.
Propagation
Unliketheonetimejobofanupgrade,migration,orconsolidation,propagationisan
ongoingactivity.Propagationisthemostpopularapproachusedforrepetitivedata
integrationbecauseitsthesimplesttoimplement.Whenanapplicationneedsdata
fromanothersystem,anautomatedprogramordatabasetoolisusedtocopythedata.
Datatransformation,ifany,isdoneaspartoftheprocessbeforeloadingthedatainto
thetarget.
Dependingonthetools,propagationcanbescheduledasabatch
activityortriggeredbyevents.Mostofthetimeitisdoneasapush
modelfromthesourcetothetarget,butitcanalsobe
implementedasapullmodeldrivenbytheapplication.
Thedatamovementmaybeonewayorbidirectional.Oneway
datamovementiscommoninscenarioswhereanapplicationneeds
periodicdatafeedsorrefreshesofreferencedata.Forexample,aproductpricing
systemneedstosendpriceupdatestoawebsite,anorderentrysystemandacustomer
servicesystem.
Propagation
Synchronizingdatabetweensystemsismorechallengingbecauseitisbidirectionaland
caninvolvemorethantwosystems.Asthenumberofsystemsgoesup,thenumberof
possibleconnectionsexplodes.Customerdataisacommoncasewheresynchronization
isused.
Manyapplicationscantouchcustomerdata,forexampleorderentry,accountspayable,
CRMandSFAsystems.Somechanges,likecreditstatus,customercontactsorrefunds
shouldberepresentedacrossallthesystemswhentheyoccur.Becausethesystemsare
www.ThirdNature.net Page 6
-
8/6/2019 Madsen WP Open Source Data Integration
8/16
The Role of Open Source in Data Integrat ion January, 2009
independent, itisntpossibletocentralizethedata.Instead,thedataneedstobe
synchronizedsochangesinonelocationarereflectedinotherlocations.
Propagationoftenleadstotheneedforsynchronizationbecausedataisbeingcopied
andlaterchangedindownstreamsystems.Datamultipliesanddiscrepanciesappear
leadingtodisagreementsaboutwhichinformationiscorrect.
Dealingwiththeseproblemsatenterprisescalecanbeoverwhelmingbecauseofthe
tangleofhandcodedintegrationthatevolvedovertheyearswiththeapplications.
Propagationisaneasyandexpedientsolutionwithouttools,butcreatesdata
managementproblems.
Dataintegrationtoolscanhelpsolvetheseproblems.Thecommontoolsetand
informationcollectedinthetoolmetadatamakeiteasiertounderstandandmanage
theflowofdata.Thisinturnsimplifiesmaintenancetasksandspeedsbothchangesand
newprojectsthatrequireaccesstoexistingdata.
Federation
Federationisamethodforcentralizingdatawithoutphysicallyconsolidatingitfirst.Thiscanbethoughtofascentrallymediatedaccessorondemanddataintegration.Thedata
accessandintegrationaredefinedaspartofamodel,andthatmodelisinvokedwhen
anapplicationrequeststhedata.
Federateddataappearstoanapplicationasifitwerephysically
integratedinoneplaceasatable,fileorwebservicecall.Inthe
backgroundaprocessaccessesthesourcedataintheremote
systems,appliesanyrequiredtransformationsandpresentsthe
results,muchlikeaSQLquerybutwithouttherestrictionthatallof
thedataoriginateinarelationaldatabase.
Becausefederationisaviewimposedontopofexternalsources,itsgenerallyaonewayflowofinformation.Itcantbeusedtosynchronizeormigrate
databetweentwosystems.Thismakesfederationappropriateforadifferentclassof
problemssuchasmakingdatafrommultiplesystemsappearasifitcamefromasingle
source,orprovidingaccesstodatathatshouldntbecopiedforsecurityorprivacy
reasons.
Federation
Federationisausefulapproachinscenarioswhereitwouldbetoocostlytocreateand
manageadatabasefortheintegrateddata.Forexample,inacustomerselfservice
portaltheremightbeadozenpossiblesourcesofdatathecustomercouldaccess.
Pullingtherequireddatafrommanysystemsintoasingledatabaseispossibleinthis
scenario.Thechallengeisprovidingrealtimedeliveryofthisinformation.Achangeinanyofadozensystemsmustbeimmediatelyreplicatedtothisdatabaseachallenging
andexpensivetask.Byfederatingaccessorconstructingadataservicelayer,the
applicationdeveloperscanbuildtheportalagainstaunifiedmodelwithouttheneedto
copydata.Thedataisaccesseddirectlyfromthesourcesothereisnoproblemwith
deliveringoutofdateorincorrectinformation.
Page 7 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
9/16
The Role of Open Source in Data Integrat ion January, 2009
Creating Solutions for Operational Data Integration Problems
Regardlessofthedataintegrationmodel,thefinaldecisionisusuallygovernedbythe
projectbudget,timelineandwhatthedevelopersarefamiliarwith.ITfocusonbudgetis
atanalltimehighmakingithardtojustifytheinvestmentneededfordataintegration
tools.Thisisanareawhereopensourcecanhelp.
The Most Common Practice: Custom Coding
Industrysurveysshowthatoperationaldataintegrationisbuiltbyhandformorethan
threequartersoftheapplicationprojectsinproductiontoday.
Products get better overtime. Hand-writt en codegets worse.
HandcodingiscommonbecauseDIisnotthoughtofintermsofinfrastructureanddata
management,butintermsofglueforapplications.Whilecopyingdatafromoneplace
toanotherisntoptimal,itsstillworkableinthecontextofasingleapplication.The
priceispaidintheoverallcomplexityofintegrationspreadthroughouttheenterprise.
Handcodedintegrationisgoingtochangedueinlargeparttothenewemphasison
externalintegrationforexamplewithautomatedbusinessprocessesinvolvingoutside
companies
or
with
the
increasing
use
of
SaaS
applications.
Databaseadministratorshavenoeasywaytomovedataoutsidethecompany.The
standardDBAtoolsdonotallowthemtosendandreceivedatatothewebservice
interfacesusedbymostSaaSapplications,nordoDBAshavetheexpertisetoprogram
totheseinterfaces.
Applicationdevelopershavetheskillstosendandreceiveremotedataandtoprogram
towebservices.Theproblemisthatoperationaldataintegrationismorethanthecore
tasksofextractingandmovingdata.Reliableproductionsupportmeanscreating
componentstodealgracefullywithexceptions,handleerrors,andtieintoscheduling,
monitoringandnotificationsystems.Theadditionalworkisenoughtoconstituteitsown
project.
Migrations,upgradesandconsolidationsareaslightlydifferentproblem.Thecomplexity
andscaleofmappinghundredstothousandsoftablesmakesthelaborofhandcodinga
poorchoice.Beyondtheamountofwork,problemsarehardtodebugandthereisno
traceabilityforthedata.Thelackofeasytraceabilitycancreatecomplianceandaudit
headachesafterthenewsystemisinproduction.
Handcodingforoperationaldataintegrationisadeadendinvestment.Productsusually
improveovertime.Extendingthiscodeorfixingminorproblemsisalowpriorityrelative
tootherITneeds.Sincethecodeiswrittenforaspecificprojectitcanrarelybereused
onotherprojectsthewayatoolcanbereused.
The Standard Option: Buy a Data Integration Product
Integration code issingle-purpose, toolsare multi-purpose. Youshould always go withtools when you canafford them.
Companiesarerecognizingtheproblemsassociatedwithhandcodedintegrationand
arestartingtoevaluateandusedataintegrationproducts.Codingrequiresproficiency
withtheoperatingsystem,dataformatsandlanguageforeveryplatformbeing
accessed.DItoolsimproveproductivitybyabstractingworkawayfromtheunderlying
platforms.Thisallowsthedevelopertofocusonthelogicratherthanunimportant
platformdetails.
www.ThirdNature.net Page 8
-
8/6/2019 Madsen WP Open Source Data Integration
10/16
The Role of Open Source in Data Integrat ion January, 2009
ThereareanumberofdifferenttoolsavailablethatcanworkforoperationalDI
problems.CompanieswithadatawarehouseareextendingtheiruseofETLtoolsinto
thisspace.TasksinvolvingconsolidationareparticularlywellsuitedtoETLtoolsbecause
theproblemdomainmatchestheircapabilitiesforlargebatchmovementofdata.The
useisonetimesothereislittledangerofneedingtopayformorelicenses.
The
large
ETL
vendors
are
shifting
their
product
strategies
to
address
operational
DI
needsandnowcallthemselvesdataintegrationvendors.Theirinitialfocushasbeen
migrationsandconsolidations,althoughallhavebeenreworkingthetoolstofunction
betterinpropagationandsynchronizationscenarioswherelowlatencydataaccessis
moreimportant.
ETLtoolsarestillapoorfitforpropagationandsynchronizationbecauseoftheir
inabilitytoaddresshighconcurrency,lowlatencyneeds.Otherproblemswithmanyof
theproductsaretheircomplexity,deploymentarchitecture,andcost.
Mostaredesignedascentralizedservers.Thisforcesallintegrationjobsontoasingle
serverorclusterwhichmustthenbesharedwithotherusers.Itispossibletorun
smaller
independent
servers
for
different
applications,
but
the
cost
of
doing
this
is
prohibitivebecauseoftheserverbasedlicensingmodel.
Companiesneedtoolsthatcanbedeployedinadistributedmanneratthepointofuse,
andthatcanbegiventoanyapplicationdeveloperwhoneedsthem.Enterpriseserver
licensingforETL,DIandSOAtoolsoftenpreventthis.
The Third Alternative: Open Source
Opensourceoffersathirdalternativetothetraditionalbuyversusbuilddecision.When
lookingfortoolsdirectedatdevelopers,thefirststepshouldalwaysbetolookforopen
sourcesoftware.Assumingthereisanacceptablesolution,itsclearthatyouwillsave
timeandmoneyovercustomdevelopment.
Open source avoids thepitfalls of coding andgains the advantages ofusing tools.
Giventhethreequartersofcompanieshandcodingintegration,itstimetorevisitthe
buyversusbuilddecision.Opensourcedataintegrationtoolscanaddressthe
shortcomingsofhandcoding.Asfullfeaturedtools,theyoffertheerrorhandling,
operationalsupportandavailabilityfeaturesthatmustbebuiltinmanualcoding
environments.
Anadvantagenotoftendiscussedistheproductivitythesetoolsbringtoapplication
developers.Asidefromthestandardintegrationtasks,theyofferasignificant
improvementwhendealingwithheterogeneoussystemsanddatabases.DItoolsexpand
theabilitytododataintegrationtoadeveloperaudiencewhowouldotherwiselackthe
necessaryplatformskills.
OpensourcealsohasadvantagesoverthecurrentcropofDItoolsonthemarketwhen
itcomestooperationaldataintegration.TheabilitytoexpandtraditionalETLtoolsfor
useinoperationalDIislimitedbecauseofthemismatchtheircentralizedarchitecture
andcostlylicensingmodelshavewithdistributedOpDIneeds.
ThebudgetforapplicationprojectscantabsorbthehighcostofenterpriseDIsoftware
whichmakesithardtojustifythepurchaseofatool.Spendingoninfrastructuregoes
againstprojectbasedbudgetingmodelsandtheROIishardtomeasure.
Page 9 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
11/16
The Role of Open Source in Data Integrat ion January, 2009
www.ThirdNature.net Page 10
Dataintegrationis,andwillcontinuetobe,viewedasapplicationgluesoorganizations
needanalternative.IfITcantaffordtofundanenterpriseDItoolasaninfrastructure
itemthenthealternativeiseitherhandcodingoropensource.
Opensourcedataintegrationtoolsprovidethecostadvantagesofhandcodingwiththe
productivityadvantagesoftraditionaldataintegrationsoftware.Thisistherealreward
for
using
open
source
development
tools.
The Benefits of Open Source for Data IntegrationPeopleoftenmisunderstandormisrepresentthebenefitsopensourceprovides.The
samecouldbesaidofpackagedsoftwareingeneral.
Thesadtruthofmostsoftwareisthatitisnondifferentiating.Itdoesnotconferany
competitivebenefittothecompanybecauseacompetitorcanacquireidentical
software.Likewise,dataintegrationtoolsarethemselvesnotadifferentiator.The
differenceisthatthesetoolsallowdevelopersthefreedomtodobettercustomized
integration.Theyareanenablingtechnologythatallowsacompanytodifferentiatehow
itconfiguressystemsandtheflowofinformation.Thisisthepointwheredifferentiation
occurs.
Forthisreason,mostdevelopmenttoolshavebeentakenoverbyopensource.No
companyoutcompetesanotherbydevelopingtheirowndataintegrationsoftwareany
moretheywouldfrombuildingtheirowngeneralledger.Withdevelopmenttools,
everyonewinsbypoolingtheircollectiveresources.Theparttheykeeptothemselvesis
whatgetsdonewiththosetoolsbecausethatswherethevalueis.
Dataintegrationsoftwareissquarelyinthecrosshairsofopensourcevendorsand
venturecapitalistsbecauseitfitsthesameprofileascompilers,languages,andother
developmenttools.Inallofthesecasestheshareddevelopmentanddistributionmodel
removedcost,improvedtoolquality,andbenefitedeveryone.
Therearetwoopensourcemodelsforsharingdevelopmentanddistributioncosts.One
isprojectbasedorcommunitybasedopensource.Theotheriscommercialopensource
software,orCOSSforshort.
Mostpeoplearefamiliarwithprojectbasedopensource.Thismodeltypicallyinvolves
somesortofnonprofitfoundationorcorporationtoownthecopyright,andpeople
contributetheireffortstodevelopmentandmaintenance.Theymayevenbefulltime
employees,buttheprojectdoesnotoperateinthesamewayatraditionalsoftware
companydoes.
Commercialopensourceevolvedwithrecognitionthatcompaniesarewillingtopayfor
support,service,andotherlesstangibleitemslikeindemnificationorcertifying
interoperability.Acommercialopensourcevendoroperatesjustlikeatraditional
softwarevendor,exceptthatthesourcecodeisnotshroudedinsecrecy.Thisenables
moreanddeeperinteractionbetweenthecommunityofcustomersanddevelopers,
makingtheopensourcemodelmoreuserfocusedthanthetraditionalmodel.
Incontrasttothemajorityofprojects,commercialopensourcevendorsemploymostof
thecoredevelopersfortheirprojectandexpecttomakeaprofitwhiledoingso.They
providethesameservicesandsupportthattraditionalvendorsdo,andfrequentlywith
-
8/6/2019 Madsen WP Open Source Data Integration
12/16
The Role of Open Source in Data Integrat ion January, 2009
moreflexibilityandlowercost.SomeCOSSvendorsborrowelementsoftheproprietary
vendors,likebuildingnonopensourceaddoncomponentsorfeaturesthatcanbe
purchasedinplaceof,orinadditionto,thefreeopensourceversionofthesoftware.
ThedifferencebetweenCOSSvendorsandtraditionalvendorsisasmuchabout
businesspracticesasitisaboutthecode.Proprietaryvendorscantopenthedoorsand
invite
bug
fixes,
design
suggestions
or
feature
additions,
nor
should
they.
The
key
force
drivingmanyopensourceprojectsisnotinnovativeintellectualproperty,butthe
commoditynatureofdevelopmentsoftware.
Studiesonopensourceadoptionhavenotedthatotherbenefitscanoutweighthecost
advantagesofopensource.Notallprojectsarejustifiedbasedonfinancialbenefit.
Whileadvantagesmaybetranslatedintofinancialterms,valuecancomefromsolvinga
particularproblemsooner,enablingworkthatwaspreviouslynotpossible,orproviding
efficiencythatallowspeopletobedeployedtoothertasks.
Accordingtoseveralmarketsurveysoverthepastfewyearsofcompaniesadopting
opensource,thefollowingthreebenefitsrisetothetopofthelist.
Flexibility
Thechallengewithflexibilityisdefiningit.Respondentsusethistermtomeananumber
ofdifferentelementsofflexibility.Thesethreearethemostfrequentlymentioned:
Evaluation.Organizationscantryopensourcetoolsattheirownpaceaccordingtotheir
owntimeline.Somecompaniesevaluatealltoolsinaproofofconceptandallotthe
sameamountoftimetoeach.Otherstryopensourcefirstandrunextendedtrialswhich
evolveintoprototypesorproductionuse.Unliketraditionalsoftware,therearenonon
disclosureagreementsortriallicensesthatlimitthedurationorextentofuse,noris
thereapresalesconsultantbreathingdowntheneckoftheevaluator.
Deployment.Asnotedabove,asuccessfultrialinstallationcanbeeasilyputinto
production.Therearefew,ifany,limitationsregardingdeployment.Forexample,one
firmmaychoosetocentralizedataintegrationwhileanotherchoosestodistributeit
closertoapplications.Scalingupbyaddingserversisnotusuallylimitedwithopen
sourcethewayitiswithtraditionalsoftwaremodelswheremorelicensesmustbe
purchased.Theunbundlingoflicense,support,andservicemeandecisionsaboutthese
itemscanbemadelater.
Adaptability.Opensourcetoolsmaybeusedinunrestrictedways,forpurposesthatthe
projectmightneverhaveintended.BuyingasixfigureETLtoolforasmallintegration
problemoraonetimemigrationisoverkill,butanopensourceETLtoolcanbeeasily
adaptedforuse.Anothersideofadaptabilityiscustomization.Mostcompanieswill
rarely,ifever,lookataprojectssourcecode.Itsstillnicetoknowthatthesoftwarecan
betailoredtofitasituationiftheneedarisesforexamplewiththeadditionof
customizedconnectors.
Vendor Independence
Abenefitofopensourcementionedbymanycustomersisvendorindependence.There
aretwoaspectstovendordependence.Oneisbeingbeholdentoagivenvendorforthe
useandsupportofthesoftware.Theotheristheproblemoftechnologylockin.
Page 11 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
13/16
The Role of Open Source in Data Integrat ion January, 2009
Theopensourcelicenseisthekeydifferenceforopensourcesoftware.Evenifa
customercontractswithaCOSSvendorinordertogetsupportorotherservices,there
isnorequirementtocontinuewiththatvendor.Thisopensupthepossibilityofusing
thirdpartiesforthesameservices,orforegoingthoseservicesbutcontinuingtousethe
software.
The
problem
of
technology
lock
in
is
much
less
likely
to
happen
with
open
source
software.Opensourceprojectstendtoadheretoopenstandards.Thereismore
motivationtouseexistingopenstandardsandreuseotheropensourcecodethantotry
tocreatenewstandards.Thefactthatthecodeisvisibletoeveryoneisanadditional
incentivetothedeveloperstowritebettercode.Studieshaveshownthatmanyopen
sourceprojectshavelowerdefectratesthancomparableproprietaryofferings.
Proprietaryvendorssometimesavoidopenstandardsbecauseproprietarystandards
ensurecontrolovertheirworkingenvironmentandthecustomerbase.Someproducts
arecloselytiedtovendortechnologystacks,anobviousexamplebeingdatabase
suppliedETLtools.Opensourcetoolsaremuchlesslikelytobetiedtoaspecific
platformortechnologystack,partlybecauseofhowthesoftwareisdevelopedand
partlyduetothediversityofthedeveloperandusercommunitieswhoquicklyport
usefulcodetotheirplatformofchoice.
Optimal Price
Itsimportanttodistinguishbetweencostsavingsandpayingtherightprice.Whilethe
opensourceproductionanddistributionmodelhascostadvantagesthattranslate
directlyintolowerlicenseprice,thisdoesnotguaranteethatacompanywillsave
moneybyusingopensource.Its hard to justify eventhe lowest cost tools fora system migrationbecause they becomeshelfware at the end of
the project.
Moreimportantthantryingtoevaluatecostsavingsislookingatpayingtherightprice
attherighttime.Opensourcegivesacompanytheoptiontopaynothing,pay
incrementally,orpayupfront.Thechoicedependsonfactorslikebudgetforinitialprojectstartup,howimportantsupportisduringdevelopmentandanticipatedgrowth
onceinproduction.
Thegreatestsavingsopportunitieswillcomefromnewprojectswherethehighcostof
dataintegrationtoolsfavorsopensource.Startupcostsforaprojectusingproprietary
DItoolscanbeexceptionallyhighandyoucantdeferpurchaseorsupportcostswith
traditionalsoftware.
Thenextbiggestsavingscomeswhenscalingforgrowth.Asthenumberofservers,data
source&targets,orCPUsgrows,thelicensecostofproprietarytoolskeepspace.Scaling
upcanquicklybecomecostprohibitive.
Opensourceisdeliveredinwaysthatallowforlowcostorevenzerocostscaling.Some
COSSvendorschargeforsupportbasedonfixedattributesorsimplesubscription
pricing.Otherschargeperdeveloperratherthanonaperserverbasis.
Foroperationaldataintegration,thistranslatesintoasignificantadvantageforopen
source.MostoperationalDIsoftwareisdistributedacrosstheenterprise,notinafew
centralizedservers.Thisposesseriouscostobstaclesforproprietaryvendors.
www.ThirdNature.net Page 12
-
8/6/2019 Madsen WP Open Source Data Integration
14/16
The Role of Open Source in Data Integrat ion January, 2009
Thecostbenefitsofopensourcetoolscanbeevenhigherfordataconsolidationtasks
wherethechallengeistojustifythepurchaseofatoolthatwillbeusedonce.
Traditionalenterprisedataintegrationtoolsarenotpricedforonetimeuse,putting
themoutofreachformostprojects.
Itshardforamanagertojustifyeventhelowestcosttoolsbecauseattheendofthe
project
they
become
shelfware.
For
cases
like
this
when
expensive
mainstream
IT
softwareisoutofreach,opensourcecansavetheday.
Page 13 Thi rd Nature
-
8/6/2019 Madsen WP Open Source Data Integration
15/16
The Role of Open Source in Data Integrat ion January, 2009
RecommendationsThewayorganizationsplanandbudgetfordataintegrationisnotgoingtochangeany
timesoon.Mostoperationaldataintegrationwillcontinuetobepaidforaspartof
individualprojects,continuingthelargelyadhocDIinfrastructure.Thismeansthesingle
highcostenterpriselicensingmodelfornewoperationalDItoolsisntlikelytofitmost
ITorganizations.
Operational DI is not thesame as ETL or analytic
DI. Keep this in mindwhen evaluating tools.
ITmanagersanddevelopersneedawaytomaketheintegrationjobeasier,repeatable
andmoreproductive.Opensourceisonewaytoaccomplishthesegoals.People
responsibleforselectingandmaintainingtoolsfordataintegrationcanbenefitfromthe
followingguidelines.
Differentiatebetweenanalyticdataintegrationandoperationaldataintegration.Businessintelligenceenvironmentshavespecificneedslikelargebatchvolumes,
manytooneconsolidationandspecializedtableconstructs.Whileapplicableto
consolidationprojects,ETLtoolsdesignedforthedatawarehousemarketwont
provideacompletesetoffeaturesforoperationaldataintegration.
Discouragehandcodeddataintegration.Therearemanydifferenttoolswhichcanbeusedtosolvedataintegrationproblems,andnewertoolsspecificallydesignedfor
operationaldataintegration.Encouragedevelopersonapplicationdevelopmentand
packageimplementationprojectstolookatthesetools.Thebenefitsovermanual
codingareobvious.
Usetherightdataintegrationmodelfortheproblem.Determinewhethertheintegrationproblemrequiresconsolidation,federationorpropagation.Eachofthese
isdifferentinbothapproachandrequiredtoolsorfeatures.Selectthetechnology
thatbestfitswiththeapproachtoavoidmismatchesthatwillleadtoproblemsduring
implementation.
Makeopensourcethedefaultoptionfordataintegrationtools.Wheninanenvironmentwithfewornotools,opensourceshouldbethefirstalternative.Itisthe
simplest,fastestandlikelytheleastexpensiveroutetosolvetheproblem.Itsthe
logicalnextstepaftermanualcoding.Looktoproprietarytoolsonlywhenopen
sourcetoolscantdothejob,orwhenyouhavetheminhousealreadyandthe
licensingissuesarenotanobstruction.
Augmentexistingdataintegrationinfrastructurewithopensource.Therewillbemanycaseswhereitisnoteffectivetoextendcurrentdataintegrationtoolstoanew
project.Thismaybeduetolackofspecificfeatures,poorfitwiththeapplication
architecture,orextendedcostduetolicensingortheneedforadditionalcomponents.
Manyproprietarydataintegrationtoolswillchargeextraforoptionslikeapplicationconnectors,dataprofilingordatacleansing.Inthesecases,opensourcecanbeused
toaugmenttheexistinginfrastructure.
www.ThirdNature.net Page 14
-
8/6/2019 Madsen WP Open Source Data Integration
16/16
The Role of Open Source in Data Integrat ion January, 2009
About the Author
MARKMADSENispresidentofThirdNature,aconsultingandtechnologyresearchfirm
focusedoninformationmanagement.Markisanawardwinningarchitectandformer
CTOwhoseworkhasbeenfeaturedinnumerousindustrypublications.Heisan
internationalspeaker,acontributingeditoratIntelligentEnterprise,andmanagesthe
opensourcechannelattheBusinessIntelligenceNetwork.Formoreinformationorto
contactMark,visit http://ThirdNature.net.
About the Sponsor
Talendistherecognizedmarketleaderinopensourcedataintegration.Hundredsof
payingcustomersaroundtheglobeusetheTalendIntegrationSuiteofproductsand
services
to
optimize
the
costs
of
data
integration,
ETL
and
data
quality.
With
over
3.3
millionlifetimedownloadsand700,000coreproductdownloads,Talendssolutionsare
themostwidelyusedanddeployeddataintegrationsolutionsintheworld.The
companyhasmajorofficesinNorthAmerica,EuropeandAsia,andaglobalnetworkof
technicalandservicespartners.FormoreinformationandtodownloadTalend's
products,pleasevisithttp://www.talend.com.
About Third Nature
ThirdNatureisaresearchandconsultingfirmfocusedonnewpracticesandemerging
technologyforbusinessintelligence,dataintegrationandinformationmanagement.
Ourgoalistohelpcompanieslearnhowtotakeadvantageofnewinformationdriven
managementpracticesandapplications.Weofferconsulting,educationandresearch
servicestosupportbusinessandITorganizationsaswellastechnologyvendors.
Page 15 Thi rd Nature
http://thirdnature.net/http://www.talend.com/http://www.talend.com/http://thirdnature.net/