Email Archiving Systems Interoperability
(Article begins on next page)
The Harvard community has made this article openly available.Please share how this access benefits you. Your story matters.
Citation Simpson, Joel. 2016. Email Archiving Systems Interoperability.Harvard Library Report.
Accessed May 13, 2018 2:54:24 PM EDT
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:28682572
Terms of Use This article was downloaded from Harvard University's DASHrepository, and is made available under the terms and conditionsapplicable to Other Posted Material, as set forth athttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA
Harvard Library ReportJuly 2016
Prepared by Joel Simpson
Email ArchivingSystemsInteroperability
TheHarvardLibraryReportEmailArchivingStewardshipToolsWorkshopislicensedunderaCreativeCommonsAttribution4.0InternationalLicense(CCBY4.0)<https://creativecommons.org/licenses/by/4.0/>
PreparedbyJoelSimpson,ArtefactualSystems,Inc.
ReviewedbyWendyMarcusGogel,HarvardLibraryandGrainneReilly,LibraryTechnologyServices,HarvardUniversity
Citation:Simpson,Joel.2016.EmailArchivingSystemsInteroperability.HarvardLibraryReport.http://nrs.harvard.edu/urn-3:HUL.InstRepos:28682572.
Table of Contents
ExecutiveSummary..........................................................................................................................3
BackgroundandContext..................................................................................................................4
ProjectObjectives............................................................................................................................4
ProjectApproach..............................................................................................................................4
ProjectResults..................................................................................................................................5
1.AssessmentoftheEmailToolsDataSharingFramework....................................................5
2.AnalysisFramework:RequirementsforInteroperability.....................................................6
3.AnalysisofToolsusingtheRequirementsforInteroperabilityFramework.........................9
4.KeyFindings:AnalysisofToolsandEmailToolsDataSharingFramework........................18
5.OpportunitiestoImprovetheInteroperabilityofEmailTools...........................................20
Acknowledgements........................................................................................................................22
Executive Summary
Earlierthisyear,HarvardLibraryconvenedtheHarvardEAST(EmailArchivingStewardshipTools)workshoptofostertheexpandingemailarchivingcommunity,sharebestpracticesandidentifydirectionsforfuturework.
Oneofthemainconclusionsoftheworkshopwasthatthereisnostandardworkflowthatcanbeuniformlyappliedineverysituation,butthatallarchiveshavesimilarfunctionalneedsforemailarchiving,andthatgiventheneedforflexibility,currentprocessescouldbeimprovedbyusingtheuniquestrengthsofdifferenttoolstogether.
HarvardLibraryengagedArtefactualSystemsInc.tobetterunderstandhowthetoolscanexchangedatatodayandcarryoutanalysistoidentifyopportunitiesforthecommunitytofurthersupportcomprehensivepreservationworkflowsforemail.
CommunitymembershavebeeninvitedtocontributetoanEmailToolsDataSharingFramework.Theintentionistoprovideahighlevelviewofhowemailcontentormetadatacanbeinputoroutputtoeachofthedifferenttools,usingacommonframeworktosupportcomparisonandanalysis.Thisworkisongoing,butenoughdetailhasbeencollectedtoenableanalysisandidentificationofsomeclearopportunitiesforimprovingtheinteroperabilityofthesetools.
Asetof“requirementsforinteroperability”wereidentifiedtosetoutthedifferentaspectsorconcernsinvolvedinusingmultipletoolsinanemailarchiving,processingorpreservationworkflow.Analysiswascarriedouttounderstandhoweachofthetoolssupportsthesedifferentrequirements.Keyfindingswerethenidentifiedineachoftheseareas.
Finally,asetof7draftrecommendationshasbeenproposedforthewidercommunitytoconsider.Thesearehighlevelrecommendationswithoutdetailednextsstepsoranysuggestionforpriority.Wefeeltheyareusefulindecomposingthiscomplexproblemspaceintodiscreteandwell-definedopportunitiesthatwillbeeasiertotackleinafastchangingenvironment.
Page 3 of 22
Background and Context
Earlierthisyear,HarvardLibraryconvenedtheHarvardEAST(EmailArchivingStewardshipTools)workshoptofostertheexpandingemailarchivingcommunity,sharebestpracticesandidentifydirectionsforfuturework.Theworkshopinvolvedstakeholdersfromdifferentinstitutions,includingsubjectmatterexperts,usersanddevelopersofseveralemailarchivingorpreservationtools.
Theworkshopconcludedthatthecommunityisveryinterestedinworkingtogethertosolvesharedproblems.Severaldirectionsforfutureworkwereidentified,including“theneedforanexchangestandardthatenablesinteroperablewaystoextract,packageandtransferdatabetweentools”.Thisconclusionwasbasedontheconsensusthatthereisnooneuniformworkflowforemailarchiving,butthatcurrentprocessescouldbeimprovedifarchiveswereabletoharnesstheuniquestrengthsofeachtoolselectively(usingonlythefunctionalityneededinwhateverorderisneeded).
HarvardLibraryPreservationServicesengagedArtefactualSystemsInc.tocarryoutashortconsultingprojecttobuildonthesefindingsandidentifyopportunitiesforthecommunitytofurthersupportcomprehensivepreservationworkflowsforemail.
Project Objectives
Thegoalsofthisconsultingprojectareto:
1. identifygapsoropportunitiestoimprovetheinteroperabilityofthenumerousemailtoolsbyshowingthetype,formatandstructureofdatawhichcanbeinputoroutputfromeachtool
2. informemailstewardsabouttheoptionsandconsiderationsinvolvedindefiningemailarchivingworkflowsusingmultipletools
Thisprojecthasnotattemptedtoprovideafunctionaldescriptionorcomparisonofthevarioustoolsunderconsideration.Averybriefoverviewofthetools,withlinksforfurtherdetailedinformationavailablefromtheproviders,isprovidedbelowinsection3.AusefulcomparisonofEmailArchivingtools(includingmanynotconsideredinthisproject)canbefoundattheLifecycleToolsforArchivalEmailChart:https://docs.google.com/spreadsheets/d/1V1N22xnr5e0EbDlZWx58bjYO6rkrMrYH9wGX9-CK8c4/edit#gid=986222267.
Project Approach
Thisprojectisproducingtwodeliverablestomeettheobjectivesdefinedabove.
ThefirstdeliverableisanEmailToolsDataSharingFrameworkthatsetsoutthecontentobjects(i.e.email)andmetadatathateachemailorpreservationtoolcaninputoroutput.Representativesfromeachtoolproviderwereaskedtocompletethedescriptionsoftheseinputsandoutputsusingagenericframework(withassociatedglossary)toenablecommonunderstandingoftermsandmakecomparisonbetweentoolseasier.
Amoredetaileddescriptionandassessmentofthetoolisprovidedbelowinsection2.
Page 4 of 22
TheseconddeliverableofthisprojectisthisConsultingReportwhich
1. assessesthecompletionandusefulnessoftheEmailToolsDataSharingFramework2. proposesagenericsetofrequirementsforinteroperabilitytouseasananalysisframework3. analyzes/summarizeshoweachtoolsatisfiesthoserequirementsforinteroperability4. setsoutseveralrecommendationsforimprovinginteroperabilityofthetoolsandfurther
establishingbestpracticesforthecommunityPleasenotethatthroughoutthisreportwhenwereferto‘digitalobjects’wemeananytypeofdigitalobjects,includingemailsthemselves,relatedcontentlikeattachments,oranyassociatedmetadata.Weuse‘data’interchangeablywith‘digitalobjects’simplybecauseitisshorter.(Wehavenotseentheneedtodistinguishtheseconceptswithmoreprecisedefinitions.)
Project Results
1. Assessment of the Email Tools Data Sharing Framework
1.1. About the Email Tools Data Sharing Framework
Theemailtoolsdatasharingframeworkincludesinformationon6differentemailorpreservationtools.Theintentionistoprovideahighlevelviewofhowemailcontentormetadatacanbeinputoroutputtoeachofthedifferenttools.
Theframeworkissetoutinaspreadsheet,withonesheettodescribeinputsandanothertodescribeoutputs.Eachsheetisorganizedtofirstdescribetheactual(or"physical")dataobjects(orinput/outputmechanisms,asinsomecasestheyareprogrammatic),followedbyadescriptionofthekindsofdataormetadatafoundinthoseobjects.
Separaterowsdistinguishbetweenthelevelofobligationdemandedtobeabletouseeachtool:
● mandatorycontentordata(systemwillnotacceptorworkproperlywithoutthis)● usefulcontentordata(isoptional,butenablesfunctionalitywithinthesystem-e.g.asensitivity
flagthatcanbeusedwhenfiltering)● additionalcontentordata(canbeconsumed,butisnotusedinanywaybyconsumingsystem--
e.g.attachmentsareincludedinMBOX,buttheparticularsystemmaynotallowuserstodoanythingwiththem)
Thegoalistodescribeineachofthesecolumns:
● thetypeorextentofdataprovided(e.g.specificfieldsusedasreferenceIDs,oramoregeneraldescriptionsuchas'preservationevents')
● formatofdata(isa'local'schemadefined,orisastandardschemaused,suchasPREMIS)● location/structureofdata(whereintheinput/outputisthisinformation--e.g.PREMISevents
arerecordedinMETS.xmlfile;folderinformationstoredinpathnameinMBOXetc.)Insomecasesthisinformationneedstobebrokendownintodifferentlevelsofgranularity,forinstancetoindicateinformationstoredatindividualemaillevelvs.collectionlevel.
Page 5 of 22
1.2. Assessment of the Email Tools Data Sharing Framework
Atthetimeofthiswriting,completionofthespreadsheetisinprogress.Weinvitecommentsorthoughtsfromallparticipantson:
● abilitytocompletethespreadsheetconsistently(orkeydifferencesininterpretation)● anythinglearnedwhilefillingitin● whetheritiscompleteenough,orneedsfurtherwork;wishlistadditions/amendments(e.g.
suggestionsforaddingmoredetail)● initialviewsonvalueoftheexercise● intenttousethetoolmovingforward
Datagatheringworkisongoingandwillberefinedasneededbythecommunitytosupporttheircollaborativeeffortstoimprovethesetoolsandestablishbestpracticesforemailarchivingandpreservation.
InitialfeedbackandobservationsfromArtefactual:
● Itisinterestingtoseethisparticularperspectivefromthedifferenttools,andenablesinterestinganalysisofsimilaritiesanddifferences(whichwillbeexploredfurtherintherestofthisreport).
● Thespreadsheetemphasizestwodimensions(datatypesincolumnsandsystemsinrows),butthereareinfactnumerousdimensionsofinterest(includinggranularityofgroupingofdata,levelsofobligation,typeofdatavs.formatsorstandardsemployed,etc.).Thismakesfittinginalloftherelevantinformationachallenge.
● Giventhespace,itdoesnotseempossibletoincludeenoughdetailedinformationforthistobeaveryhandson‘howto’tool--butitmaywellbeausefulanalyticordecisionsupporttool,todetermineifthereisenoughcompatibilitybetweenaparticularselectionoftoolsforadesiredworkflow.
2. Analysis Framework: Requirements for Interoperability
Thedatasharingframeworkisprimarilyfocusedontheinputsandoutputsofeachofthetoolsunderconsideration.Giventhebroaderintenttoenableemailstewardstodeterminewhetherandhowtheymightcraftworkflowsusingmultipletools,thisreportproposesasetofgeneric‘requirementsforinteroperability’.Thisprovidesamoreholisticviewofthedifferentaspectsofusingmultipletoolsthatoperatetogethertoenableacomprehensiveworkflowforemailprocessingorpreservation.
Theserequirementsaremoreananalyticalframeworkthanaconcretesetofrequirements.Theyarefocusedonthelevelofbusinessprocessesandworkflows,anddonotrepresentaparticularefforttoelicitrequirementsfromendusers.
Therequirementsandtheirrationalearedescribedbelow.Inthefollowingsection,eachofthe6toolsisassessedagainsteachrequirement.Thisallowsustocomparesimilaritiesanddifferencesinspecificareasofconcernandusethisasthebasisforrecommendationsforfutureworklaterinthereport.
Page 6 of 22
2.1. Support for data transmission
Themostbasicrequirementforaworkflowthatusesmultipletoolsworkingonacommonsetofdataistoenablethosetoolstoaccessthatdata.
Thisfunctionalitycanbeprovidedinmanyforms;userinterfacesforselectionofdataforingestfromaparticularlocation;automatedjobsthatingestdata;directsystemtosystemconnectivity;orpublishedAPIs.Thegoalhereistosimplyarticulatehoweachsystemsupportsthis,ratherthantojudgeonemethodoveranother.Thiswillallowustoseewhichtoolscansharedata(andhow),ataphysicallevel,withothertools.
2.2. Support for standard data formats
Oncewehavedeterminedaparticulartoolcanaccessasetofdataphysically,weneedtoensureitcaninterpretandprocessthatdata.Ataminimum,thedataformatmustbe‘standard’betweenthetoolsbeingconsidered.
Itiswellestablishedinthepreservationcommunitythatopen,non-proprietaryandwidelyusedstandardsarepreferableforpreservationformats.Whilenotalldatatobeexchangedneedstobe(orevencanbe)inapreservationformat,thesameprincipleswillimprovetheoddsthatanyparticulartoolwillbeinteroperablewithothers.
Supportforstandarddataformatsappliestoemailcontent,metadataandthepackagingofbothemailandmetadata.
2.3. Support for appropriate scope of exchangeable data
Emailcontentandmetadatacanexistorbegroupedatvariouslevelsofgranularity.Differentprocessingtoolsmayacceptdatawithanentirelyarbitrarydefinitionofscope(usingagenerictermsuchasa‘transfer’or‘packet’),ortheymayrequiredataormetadatatoconformtoaspecificdefinition(suchasclearlygroupingdataby‘account’).
Scopeofdataalsoreferstothetypeandextentofdatainanyparticulardataset.Forexample,Archivematicahasfunctionalitytoverifyhashes/checksums;ifchecksumshavebeencreatedinanothertool(e.g.BitCurator),thenideallyArchivematicashouldallowchecksumstobeimportedsothatverificationcanoccuronthosechecksums,notjustonchecksumscreatedbyArchivematica.Thisconceptisclearlytiedcloselywiththelevelofgranularity-achecksummaybemadeforafolderorcollectionofemails,oritmaybecreatedattheindividualemaillevel.
Emailstewardswillneedtounderstandwhatscopeofdataisrequiredorpossibleusinganyparticulartool.Similarlyanydecisiontouseaparticulardatastandardneedstoconsiderthescopeofdatathatformatallowsfororrequires.
Page 7 of 22
2.4. Ability to track processing history and provenance
Theabilitytoestablishandmaintaintheprovenance(includingprocessinghistory)ofcontentisawellunderstoodrequirementinthearchivalandpreservationcommunities.Whilethismaynotbearequirementforeveryonelookingtoprocessemails,itisafundamentalrequirementforthecoreusergroupsofmanyofthe6toolsweareevaluating.
Emailstewardswhodoneedtorecordandcaptureprovenancewillgenerallyneedamechanismtodothiswhenevertheyareprocessing,creatingorchangingdata.Thismeansthateitherthetoolstheyuseforprocessingneedtocaptureprocessinghistorydirectly,ortheyneedsomeabilitytotrackprocessinghistorymanuallyandstoreitappropriately.
2.5. Support for maintaining the identity and integrity of data
Asdataismoved,migratedorprocessedbydifferenttools,emailstewardsneedtobeabletoensurethattheidentityandintegrityofthedatatheyareprocessingisnotcompromised.
Maintainingtheidentityofthedatasetdependsinlargepartuponusingidentifierstolinkittoitsdescriptiveandadministrativemetadata,andensuringthatthislinkcannotbebroken.Mosttoolsgenerateuniqueidentifiers,buttheseareusuallylocal(assigned,storedandmaintainedwithinthetoolitself).Externalidentifiersmaybesupported,eitherinformally(e.g.byrecordinganaccessionnumberaspartofadirectorystructureorfilename)ormoreformally(asinhavingafieldwithadeclareddatatypethatalignstotheidentifierusedbyanothersystem).Somesystemsalsosupportidentifiersthatreferexplicitlytoexternalresourcesorauthorities(aconceptunderpinninglinkeddata).
Maintainingtheintegrityofdigitalobjectsisoftenachievedusinghashesorchecksums,withregularverification,toensurethatthecontentoftheingesteddatahasnotbeenalteredovertime.Thehashesorchecksumscanbeassignedtoboththeoriginalingestedcontentandtoanynormalizedorotherwisemodifiedversionsthatmaybegeneratedfromthatcontent.Hashesorchecksumsmayalsobeassignedtoassociatedmetadata.
Anothercommonpracticetosafeguardtheintegrityofdataistopackagecontentandmetadata‘together’fortransfer,reducingtheriskofcorruptionorloss(i.e.linksbetweenthetwobreakingatsomepoint).
2.6. System access and documentation to support interoperability
Abasicrequirementistheabilitytoaccessandusethesoftware,bothtechnicallyandwithappropriatepermissionsorlicensing.
Allofthecapabilitiesmentionedabovearelessusefulinpracticeifknowledgetousethemisnotcapturedwell.Technicalanduserdocumentation,trainingmaterialsandtrainingresources(i.e.trainersforhire)alladdtotheabilitytousethetoolaspartofanintegratedworkflow.Thestartingminimumisdocumentationonhowtousethetoolatall.Ideallyaknowledgebasewouldaddresstheexchangeof
Page 8 of 22
data,interoperabilitywithothersystemsandanylicenserequirements.
3. Analysis of Tools using the Requirements for Interoperability Framework
3.1. Archivematica
Archivematicaisanintegratedsuiteofopen-sourcesoftwaretoolsthatallowsuserstoprocessdigitalobjectsfromingesttoaccessandtoimplementpreservationplans.Usersmonitorandcontrolingestandpreservationmicro-servicesviaaweb-baseddashboard.ArchivematicausesMETS,PREMIS,DublinCore,theLibraryofCongressBagItspecificationandotherrecognizedstandardstogenerateArchivalInformationPackages(AIPs)forstorageinexternalrepositories.
Requirement SupportingFunctionality Observations
Supportfordatatransmission
Digitalobjectsneedtoresideinalocallyaccessiblefilesystemforingest.ArchivematicaisprovidedwithanaccompanyingapplicationcalledStorageServicesthatcanbeusedtoconfigureaccesstosourcesofdataforingest.ThereisanAPItoassignaccessionnumbers,butnodirectsupportformovingdataacrosshardware,networksetc.
Therearenumerousexternaltoolsavailableformovingdata.
Supportforstandardformats
Anydigitalobjectcanbeingested,soanyemailformatcanbeprocessedwithcorefunctionality.EmailinputinMBOXformatcanbeprocessedusingadditionalfunctionality(extractingattachmentsandmetadata).EmailinputinmaildircanbenormalizedandoutputasMBOX.TheBagItfilepackagingstandardissupportedforinputandoutput.Metadatainputincsvorjsonformatscanbeprocessed.Additionalmetadata(inotherformats)canbeincludedbutnotprocessed.Metadataoutputsarewellsupportedbywidelyadoptedstandards(METS,DublinCore,PREMIS,Bag)
NosupporttonormalizetoEMLformat(widelyusedemailformat).
Supportforappropriatescopeofdata
Transfer,Submission,ArchivalandDisseminationpackagescanbestructuredanddescribedusinganydefinitiontheuserchooses.Forexample,anemailaccountoraccountscanbeingestedasoneormoreSIPs,andmultipleSIPscanbecombinedintooneormoreAIPs.Somekeymetadata,suchasrightsmetadata,canonlybeinputorassignedduringprocessingatthepackagelevel.
Providescompleteflexibilitybutnonativesupportforcommonemailgroupings(e.g.account,folderetc.)Rightsmetadatacan’tbeassignedtoindividualemails,souserswouldhavetomanuallystructureinputsandoutputstoreflectdifferentrights(e.g.createoneAIPorDIPforrestrictedemails,andonefor
Page 9 of 22
non-restrictedemails).
Abilitytotrackprocessinghistoryandprovenance
ProvidesextensivefunctionalitytotrackprocessinghistoryandrecordusingPREMISProcessinghistoryfromexternalsourcescould“travelwith”anydatasets,butcurrentlynoabilitytomergeorconsolidateprocessinghistoryfrommultiplesystems.
Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.
Supportformaintainingtheidentityandintegrityofdata
ArchivematicaassignsUUIDstoallingestedobjectsandusestheUUIDsandIDattributesintheMETSfilestomaintainlinksbetweendigitalobjectsandtheirmetadata.Archivematicaalsosupportsawiderangeofexternalmetadata,sothereareseveralwaysexternalidentifiers(i.e.fromothertools)canbemaintained.Howeverthereisnodirectsupportfortyped/declaredexternalidentifiers(e.g.automaticallyaddingidentifierswhenimportingfromanexternalsystem).Fixityverificationissupportedusingbothinternallyorexternallycreatedhashes.
Emailstewardscouldcreatemanualprocessesforaligningandmaintainingreferentialintegrityacrosssystems(butmayneedtoplanthis-e.g.aligningpackagestructuretoexternalidentificationsystems)
SystemAccessandDocumentation
Documentationavailable,communitysupportwebsite/groups,aswellasforhireservicesforconsultancy,trainingetc.SourcecodeandtechnicalinfoavailableonGitHub.Documentationcanbequitetechnical.
3.2. ArchivesSpace
ArchivesSpaceisanopensource,webapplicationformanagingarchivesinformation.Theapplicationisdesignedtosupportcorefunctionsinarchivesadministrationsuchasaccessioning;descriptionandarrangementofprocessedmaterialsincludinganalog,hybrid,andborn-digitalcontent;managementofauthorities(agentsandsubjects)andrights;andreferenceservice.Theapplicationsupportscollectionmanagementthroughcollectionmanagementrecords,trackingofevents,andagrowingnumberofadministrativereports.Theapplicationalsofunctionsasametadataauthoringtool,enablingthegenerationofEAD,MARCXML,MODS,DublinCore,andMETSformatteddata.
(summary taken from: https://archivesspace.atlassian.net/wiki/display/ADC/ArchivesSpace)
ArchivesSpaceisnotadigitalassetordocumentmanagementsystemandcannotmanagedigitalfilesordigitizationworkflows.Thedigitalobjectsmodulecanbeusedtodescribedigitalobjectsandlinktodigitalfilesstoredelsewhere.ThemetadatacreatedcanbeexportedtoothersystemsasMODS,METS,orDublinCoreormadepubliclyaccessiblethroughthebuilt-inpublicinterface,thoughtheviewersin
Page 10 of 22
thepublicinterfacearemorelimitedintheirfunctionalitythanthoseofadigitalassetmanagementsystemordigitalrepository.
(detailondigitalobjectstakenfromFAQ:http://www.archivesspace.org/faq)
Requirement SupportingFunctionality Observations
Supportfordatatransmission
ArchivesSpacedoesnotprovideameansofmovingorstoringemailcontent.MetadatacanbeexchangedasfilesorthroughasetofAPIs.
Supportforstandardformats
ArchivesSpacesupportsarangeofwellestablishedstandardsfordescribingarchivalrecords-EAD,MARCXML,MODS,DublinCore,andMETSformatteddata.ArchivesSpacedoesnotsupportfunctionalityorprocessingofemailcontent(i.e.normalisation,searchoridentificationofauthoritiesetc.)
Supportforappropriatescopeofdata
ArchivesSpaceprovidesfunctionalityfordescribingthearrangementandrelationshipsofdigitalobjects.Itdoesnotsupportemailspecificconceptsdirectly(e.g.thenotionofanemailaccount)
Itcouldbeusefultoestablishconventionsorbestpracticesfordescribingemailaccountsandtheirpotentialrelationshipstocollections,agentsetc.
Abilitytotrackprocessinghistoryandprovenance
Supportformaintainingtheidentityandintegrityofdata
Supportforidentifiersandintegrityinternallywithinarepository.Thesystemsupportsstructuredcaptureofagentsandsubjectswhichwillimproveconsistencyandaccuracyofdescription
SystemAccessandDocumentation
ArchivesSpaceisanopensourceprojectwithconsiderabledocumentationavailable.ItissupportedbytheLyrasisorganisationwithfulltimestaffwhoaredevelopersandsubjectmatterexperts.
3.3. BitCurator
TheBitCuratorEnvironmentisbuiltonastackoffreeandopensourcedigitalforensicstoolsandassociatedsoftwarelibraries,modifiedandpackagedforincreasedaccessibilityandfunctionalityfor
Page 11 of 22
collectinginstitutions.TheBitCuratorsoftwareisfreelydistributedunderanopensourcelicense.ItcanbeinstalledasaLinuxenvironment;runasavirtualmachineontopofmostcontemporaryoperatingsystems;orrunasindividualsoftwaretools,packages,supportscripts,anddocumentation.
KeyfeaturesofBitCuratorinclude:
● Pre-imagingdatatriage● Forensicdiskimaging● Filesystemanalysisandreporting● Identificationofprivateandindividuallyidentifyinginformation● Exportoftechnicalandothermetadata
(summarytakenfrom:http://www.bitcurator.net/bitcurator/)
Requirement SupportingFunctionality Observations
Supportfordatatransmission
BitCuratordoesprovidesupportformigratingdatawithoutalteringitinanyway,startingwiththeconceptofcreatingforensicimagesbeforefurthertransmittingorprocessingdata.Uniquelyamongthetoolsconsideredhere,BitCuratorprovidessoftwarewrite-blockingfunctionalitytoensuretheintegrityofsourceobjects.
Asthisisanareanotwellsupportedbyothertools,itcouldusesomeelaboration/detail.
Supportforstandardformats
SupportsDFXML(DigitalForensicsXML)thatenablestheexchangeofstructuredforensicinformation.BitCuratorgeneratesPREMISmetadatawhentheuserrunsseveralofitscoredataforensicstools,providingarecordofkeyprocessingevents.Providessomeprocessingsupportforemail-e.g.usingreadpsttoconvertPSTemailobjectsintoMBOX.AlsosupportsBAGformatforoutput.
Supportforappropriatescopeofdata
TheBitCuratorenvironmentincludesnumerousapplicationstobeusedfordifferentpurposes,toberunagainstindividualitemsorcollectionsofterms.Oneofthemostcommonlyusedtoolsisbulk_extractor,whichcanbeusedtoidentifypotentiallysensitiveinformationondisks,diskimagesordirectories.Othercoretools,includingfiwalkandotherspecializedreportingtools,aredesignedtoberunagainstentirediskimages.Whenrunagainstadiskordiskimage,bulk_extractorreportsonthelocationofpatternsbasedabyteoff-setontothedisk.Otherreportingtools,includingfiwak,generatemetadatabasedonthefilesystem(filesandfolders).Inthecaseofemail,thefileswouldbelikelyinformatssuchas.pstormbox.Thosewishingto
Page 12 of 22
generatemetadataassociatedwithspecificmessageswithinthosecontainerfilescouldusereadpstandpipeitsoutputtoothercommand-linetools.BitCuratorisprimarilyconcernedwithidentificationanddescriptionofdigitalobjectsratherthanarrangement.
Abilitytotrackprocessinghistoryandprovenance
BitCuratorgeneratesPREMISmetadatawhentheuserrunsseveralofitscoredataforensicstools,providingarecordofkeyprocessingevents.
Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.
Supportformaintainingtheidentityandintegrityofdata
BitCuratorprovidessupportforindexing,characterizinganduniquelyidentifyingallcontentonadiskordiskimage.Bitcuratorsupportscreationandvalidationofhashes/checksums.
SystemAccessandDocumentation
BitCuratorisanopensourceprojectwithconsiderabledocumentationavailable.
3.4. DArcMail
DArcMail(forDigitalArchiveMailSystem)wascreatedbytheSmithsonianInstitutionArchives.DArcMailprovidesnormalization,itemlevelandbulkprocessing,intellectualarrangement,searchcapability,packagingandaccessfunctionalityforemail.
Requirement SupportingFunctionality Observations
Supportfordatatransmission
Digitalobjectsneedtoresideinanaccessiblefilesystemforingest.
Supportforstandardformats
EmailinputrequiresMBOXastheoriginalformatorasaninterimnormalizationformat.EmailinputinMBOXformatcanbeprocessedwithallcorefunctionalityincludingexportingpreservedemails,emailcollectionsoremailaccountsintheEMailAccountXML(EMA).EMAisacomprehensiveXMLschemadesignedforRFC5322compliantpreservationpurposesappliedtothefullrangeofemailobjects,i.e.,singlemessagetowholeemailaccount.AllelementsoftheoriginalemailisretainedinthepreservationEMAXMLoutput.User-definedsubsetsofemailmessagescanbecreatedandexportedinMBOXorEMAXMLformats.
NosupporttonormalizetoEML.TheEMAXMLschemaisnotwidelyadopted.Itisfullyimplementedintwootheremailarchivingtools,orinlimitedfashioninacoupleotherapplications.
Page 13 of 22
Supportforappropriatescopeofdata
DArcMailallowsuserstointeractwithemailsonanindividual,grouporaccountbasis.Complexsearching,filteringandmessagethreadtracking.Attachmentscanbesearched,viewedandseparatedfromemail.
Abilitytotrackprocessinghistoryandprovenance
TheDArcMailtoolisdesignedtobeusedforinitialappraisalandthenforpreservation(AIP)andaccess(DIP).ItnativelyretainsthelogicalarrangementoftheoriginalaccountinboththeAIPandDIPpackages.ItsflexibilityallowsforcreationofcustomsubsetsofemailforcreationofspecializedAIPsandDIPs.
TransferandaccessioningofemaildigitalobjectsoccuroutsideoftheDArcMailworkflow.Non-technicalmetadatasuchasrightsmetadatamustbecapturedandmaintainedinaseparatesystemormanually.
Supportformaintainingtheidentityandintegrityofdata
DArcMailmaintainsallUIDspresentintheoriginalemails.ItgeneratesSHA-1checksumsforeachmessageandforemailaccountsasawholewhichareembeddedintheEMApreservationformat.DArcMailalsoproducesexternalmetadataincludingthechecksumforeachmessagepreserved.
Theinternalmessageandaccountchecksumsareretainedevenifthepreservedemailaccountismovedtofromonerepositorytoanother.
SystemAccessandDocumentation
DArcMailisnotcurrentlyavailableoutsideoftheSmithsonian.Limiteddocumentationispubliclyavailable.TheSmithsonianintendstoreleaseitasopensourcewhentime/effortallows.
Makingthetoolpubliclyavailableisapreconditionforanyothercommunityusers.
3.5. Electronic Archiving System (EAS)
HarvarddevelopedtheEAStooltoenablearchivalprocessingofemailmessagesandattachmentsandautomatetheprocessofmakingdepositstoHarvard'spreservationrepository.Keyfeaturesinclude:
● NormalizationtoEML--anopenstandardforpreservation(anextensionofIMFRFC5322)--forlongtermpreservation.
● Summaryviewsofthemetadataassociatedwithemailorattachmentswithinaresultset.
● Batchanditemlevelprocessingoptionsforarchivists.
● Longtermpreservationofemailandattachmentsinasecureenvironmentapprovedforsensitivedataissupportedbyautomatedpackagingandtransfertothepreservationrepository–DigitalRepositoryService(DRS).
● CaptureofessentialrightsmanagementinformationusingPREMIS.
Page 14 of 22
● CaptureofsignificanteventstrackingtodocumentdeletionsofemailandattachmentsandformattransformationssuchastheconversionofthenativemailformattoEML.
(featurelisttakenfrom:http://hul.harvard.edu/ois/systems/eas/)
Requirement SupportingFunctionality Observations
Supportfordatatransmission
Dataneedtobemovedtoa‘dropbox’(directoryspaceinHarvardsystems).EASdocumentationdescribeshowtouseasecureFTPclienttomovethedatabutthisisnotpartoftheEASsolution.
Therearenumerousexternaltoolsavailableformovingdata.
Supportforstandardformats
EmailcontentcanbeinputinMBOXorPSTformat(whichcoversthemajorityofemailclientstandardsforoutputofemail).Attachmentobjectsofanytype(e.g..ppt,.doc)canbeembeddedintheemailsorprovidedseparately.Itisnotpossibletoinputmetadata(beyondthatcontaineddirectlyinMBOX/PSTorattachmentformats).EmailisoutputtoEMLformat,withattachmentsextracted.Overallmetadataiscapturedandoutputusingwellestablishedstandardformats(e.g.METSandMODS)andbothrightsandprocessinghistoryarecapturedinPREMIS.SomereferencemetadataisinlocalformatdefinedbyHarvard(forpackets,collectionsetc.),asismetadatarelatingtosecurity(access)andsensitivity(usinglocallydefined‘flags’).
Emailcontentformatswellsupported.WhileEMLformatforoutputisawellestablishedstandarditisnotacceptedbyallothertoolsforinput.Securityandsensitivitymetadatacouldpotentiallybecapturedusingmorewidelyusedstandard.ReferencingmetadatagearedtowardsHarvardintegrationwithDRSsystem.Maynotbeanyneedtostandardizethis,butsupportforexternalIDswouldenablebetterinteroperabilitywithothertools.
Supportforappropriatescopeofdata
Submissionpacketscanbestructuredanddescribedusinganydefinitiontheuserchooses.Itisnotpossibletoinputadditionalmetadataorcontentbeyondemail/attachments.Processingworkcanbecompletedatindividualitemlevel(emailorattachment)oratvariouslevelsofgrouping(folder,collectionetc.).Additionalgroupingscanbeadded(collectionsorseries).Outputswillalwayscontainthesamepacketstructureastheassociatedinput.Outputcontainsnormalized/processedcontent;doesNOTcontainoriginalinputfiles(i.e.inMBOXorPSTformat)
Providessupportforgrouping(incollectionsetc.)Inabilitytoinputadditionalmetadataorcontentsuggeststhistoolmayworkbestat‘start’ofaworkflow.Stewardswillneedtothinkthroughmanualprocessesformanagingmetadatacreatedusingothertools.
Abilitytotrackprocessinghistoryandprovenance
ProvidesfunctionalitytotrackprocessinghistoryandrecordusingPREMIS.Noabilitytomergeprocessinghistorywiththatfromothertools.
Emailstewardscouldcreatemanualprocessestomaintainmultipleprocessinghistoryfiles.
Page 15 of 22
Supportformaintainingtheidentityandintegrityofdata
Identifiersareinternal(e.g.EASmessageID)orlocaltoHarvard(e.g.DRScodesareforHarvardrepository).Integrationwith‘Wordshack’applicationensuressomedescriptiveoridentificationinformationisbasedoncontrolledvocabulariesusedinHarvard(i.e.alsointegratedwithHarvardDRSrepository).Thisimprovesconsistencyinuseofadmincategoriesandtopics,andimprovesidentificationqualityforpersonsororganisations.
Supportforexternalreferencingsystemswouldbetterenablemulti-toolworkflows.UseofcontrolledvocabularieslimitedtoHarvardcurrently-couldbeseveralapproachestoextendthis-e.g.publishingthosevocabulariesasopendata,orenablinguse/integrationofother(e.g.linkedopendata)vocabulariesasalternatives
SystemAccessandDocumentation
UserdocumentationavailableandsupportforHarvardusers.SystemisnotcurrentlyavailablebeyondHarvardusers.
AprojecthasbeenproposedtoreleasesystemasOpenSourceproject;butsometechnicalworkrequiredtomakereadyformoregenericuse.
3.6. ePADD
ePADDisasoftwarepackagedevelopedbyStanfordUniversity'sSpecialCollectionsandUniversityArchivesthatsupportsarchivalprocessesaroundtheappraisal,ingest,processing,discovery,anddeliveryofemailarchives.Theuserguide(https://docs.google.com/document/d/1joUmI8yZEOnFzuWaVN1A5gAEA8UawC-UnKycdcuG5Xc/edit#)providesthefollowingdescriptionofthemajormodulesinthesystem:
Appraisal:Allowsdonors,dealers,andcuratorstoeasilygatherandreviewemailarchivespriortotransferringthosefilestoanarchivalrepository.
Processing:Providesarchivistswiththemeanstoarrangeanddescribeemailarchives.
Discovery:Providesthetoolsforrepositoriestoremotelysharearedactedviewofemailarchiveswithusersthroughawebserverdiscoveryenvironment.
Delivery:Enablesarchivalrepositoriestoprovidemoderatedfull-textaccesstounrestrictedemailarchiveswithinareadingroomenvironment.
Requirement SupportingFunctionality Observations
Supportfordatatransmission
Theappraisalmodulewillacceptemailfilesdirectly(fromalocalfilesystem)andalsohastheabilityconnectdirectlytoemailserverstodownloademailusingIMAP.Othermodulesrelyonoutputs(files/directories)fromotherePADDmodules(i.e.appraisaloutputis
Therearenumerousexternaltoolsavailableformovingdata.Theabilitytoconnectdirectlytoemailserverisuniqueandsimpleifonlytransportingemailcontent(i.e.noadditional
Page 16 of 22
inputtoprocessingmodule,processingmoduleoutputisinputtodiscoverymoduleetc.)
content/metadata).
Supportforstandardformats
EmailcontentcanbeinputinMBOXorbydirectlyconnectingtoemailserver(thereforeexcellentsupportifonlyinterestinginingestingemailcontent).Itisnotpossibletoinputothercontent(attachments)orMetadata(beyondthatcontaineddirectlyinMBOXformat).EmailisoutputtoMBOXformat.AttachmentsareNOTextractedseparately.Metadatathatlinkscorrespondents,people,organisationsorlocationstoexternalauthorities(e.g.LCSubjectHeadings)canbeoutputwithURIsthatrepresenttheentitybytheexternalauthority.
Whiletheformatforwrappingmetadataappearstobenon-standard,theprocessforassigningthemetadataformanydescriptiveelements(correspondent,locationetc.)usesexternalauthorities(linkeddata)whicharewellestablishedstandardsforthosespecificvocabularies.
Supportforappropriatescopeofdata
ePADDingestsmaterialstructuredaroundaparticularpersonwhomayhavemorethanoneemailaccount.Itdoesnotappeartoofferthewiderflexibilityofallowinguserstoentertheirownarbitrarilydefined‘packets’.Itisnotpossibletoinputadditionalmetadataorcontentbeyondemail/attachments.Processingworkcanbecompletedatindividualitemlevel(emailorattachment)oratvariouslevelsofgrouping(folder,collectionetc.).Additionalgroupings,suchascollectionsorseries,canbeadded.Scopeofoutputscanvaryasuserscanselectindividualemailstoincludeorexclude.Onlydescriptivemetadatacanbeoutput(butnothingforrights,sensitivity,processinghistoryetc.)ePADDallowsforthere-useorsharingoflexiconfilesforentityanalysis.Lexiconfilesenablefulltextsearchingonarangeofdifferentterms,enablingstewardstoconductcomplextieredsearches.
Metadatacan’tbeinputwithemailcontent.Metadatacan’tbeoutputexplicitly,butisusedinprocessingsostewardscoulddefineworkflowsthatenablethemtoaligntothesemanually.forexample,thecartfunctionalitycanbeusedtoselectonlyemailswithacertainrightsvalueforoutput;thenrepeatforothervalues,creatinganMBOXoutputfileforeachmetadatavalue.
Abilitytotrackprocessinghistoryandprovenance
Notavailablecurrently.
Asnotedabove,couldbesomescopeformanuallyoutputtingdatathatisgroupedaroundaparticularprocessing‘event’-butnodirectsupportformaintaining,muchlessmerging,processinghistory.
Page 17 of 22
Supportformaintainingtheidentityandintegrityofdata
Identifiersareinternal(e.g.ePaddmessageID)IntegrationwithexternalauthoritiessuchasLCSubjectHeadings(FAST)ensuresconsistencyandimprovesaccuracyinapplyingdescriptivemetadata.
Supportforexternalreferencingsystemswouldbetterenablemulti-toolworkflows.LinkedopendataapproachfordescriptivemetadataisuniquetoePADDbutcouldbehelpfulifadoptedbyothertools.
SystemAccessandDocumentation
Userdocumentationavailable;technicaldocumentationandcodeavailableonGitHub.
4. Key Findings: Analysis of Tools and Email Tools Data Sharing Framework
Thissectionsetsoutanalysisandfindingsforeachofthe‘requirementsforinteroperability’basedonourunderstandingofthecapabilitiesavailableacrossallofthetoolstoday.Withtheexceptionofsomespecificintegrations(e.g.ArchivematicaandArchiveSpace),thesetoolswerenotdesignedtointeroperatewitheachother,andsotherearenaturallyanumberofchallengesorrisksintryingtodothatasthetoolsstandtoday.
4.1. Current state of data transmission
● Datatransmissionis,ingeneral,consideredoutofscopebythesetools.● Thereisarisktothechainofcustodyinherentinanyattempttochaintoolstogether.The
primaryriskistometadatathatispartofthedigitalobjectitself(e.g.createdon,createdby,modifiedon,modifiedbyetc.)whichcaneasilybechangedorlostaspartof‘moving’datafromonefilesystemtoanother.
● Manyofthesetoolsattempttominimizethisriskinternally,e.g.,Archivematica,Bitcurator,DArcMail,EAS,allbundleseveraltoolsinternallyandmanagedatatransmissionbetweenprocessingsteps.
4.2. Use of standard formats
● Emailcontentformostsystemsisbasedonwell-establishedformats,particularlyMBOXandEML.SofarallsystemscaninputMBOX.
○ EASoutputsonlyEMLandnotalltoolssupportthisasaninput.● Somesystemssupportonlyverylimitedemail-specificprocessing(e.g.Archivematica)andsome
donotatall(ArchiveSpace)-butasthesesystemsaredesignedtotakeinvirtuallyanydigitalobjectsthisisnotabarrierfortheirmoregenericprocessingcapabilities
● Identificationorreferencingmetadataisoftenexpectedina‘format’thatisnonstandardinseveralcases.MessageIDs,repositoryID,collectionIDareoftentiedtospecificexternalsystems(EASwithDRS,DArcMailwithCMS).
● PREMISisthestandardusedtocaptureprovenanceorprocessinghistorymetadataandrightsmetadata(forthosesystemsthatrecordthismetadata).
Page 18 of 22
● TheLibraryofCongressBagItstandardisafilepackagingformatusedbyatatleasttwoofthetools(ArchivematicaandBitCurator).
4.3. Scope of email data or metadata exchange
● Therearenosignificantbarrierstoexchanginganyparticularscopeofemailcontent,withtheexceptionthatsomesystems(e.g.ePADD)assumethatemailisdealtwithormanagedonanaccountbasis,whereanaccountistheemailassociatedwithonlyoneindividual.Inotherwords,theusercouldnotinputallemailsforanentireorganisationandprocessthemtogetheratonce(whilemaintainingallindividualaccountlevelmetadata).
● Severaltoolshavelimitationsonthescopeofmetadatathatcanbeinputoraccepted:
○ EAS,ePadd,DArcMaildonotacceptanymetadataasaninput
● Severaltoolshavelimitationsonthescopeofmetadatathatcanbeoutput:
○ ePADDdoesnotallowformanytypesofmetadatatobeoutput
4.4. Capabilities for recording provenance and/or processing history
● Ifmaintainingafullprocessinghistoryisnecessary,thenitmaynotbefeasibletousesystemsthatdon’tsupportthis(ePADD,DArcMail).
4.5. Capabilities for maintaining identity and integrity of data
Use of unique identifiers:
● Mosttoolsgenerateuniqueidentifiersfordataatvariouslevelsofgranularity(someforindividualemail,virtuallyallforaggregationsofsometypesuchasfolder,account,collectionetc.).
● Mosttoolsdonotacceptorstore‘external’identifiers(i.e.uniqueIDscreatedbyothersystems).Thismaypresentchallengeswhenusingmultipletoolsbecausetherearelimitedwaysofensuringthataparticulardataitemorgroupofdataiscorrectlyidentified(forinstance,iflookingataparticularemailinonetool,isthereawayofconfidentlyfindingandprocessingthesameexactemailinanothertool).
● Sometoolsdoprovidesomemeansofcapturingexternalidentifiers(e.g.inArchivematicabyprovidingIDswithinametadatacsvfileatthepointoftransfer).Howevernoneofthetoolsappeartosupportthisatthelevelofindividualemails.
Definition of key elements and aggregations:
● Manyofthetoolsallowuserstodefinetheelementsoraggregationsthatsuitthembest.Thisflexibilityisastrengthbutcouldleadtosomeconfusionifelementsoraggregationsarenotdefinedconsistentlybetweensystems.
Page 19 of 22
● ThedefinitionofanEmailAccountisprobablythemostsignificantconcernasitappearstobedefineddifferentlyindifferentsystems.Anemailaccountinonetoolmayappeartobesameemailaccountwhenviewedorprocessedinanothertool,buttheriskisthatitisn’tbecausethedefinitionsarenotconsistent.Thereisalsotheriskthatthedatamodelsarenotcompatible-forinstanceifonesystemonlyallowsoneemailaddressperaccountwhereanotherallowsmultipleaddresses.
4.6. System access and documentation
● Alloftheopensourcesystemshavepubliclyavailabledocumentationorknowledgeresources,howeveraccesstodevelopersorsubjectmatterexpertsmaynotbepubliclyavailable.
● NeitherEASnorDArcMailarecurrentlyavailablebeyondtheirinstitutions.Bothprojectteamsintendtoreleasethemwithopensourcelicenses,butworkisrequiredtodothisandmakethesoftwareavailabletothecommunity.
5. Opportunities to Improve the Interoperability of Email Tools
Severaldraftrecommendationsaresuggestedbelowfordiscussion.Atthisstagenoefforthasbeenmadetoprioritizetheseorsetoutconcretenextsteps.Wehavekeptthescopeofthesetoareasthatwefeeladdresstheinteroperabilityofthespecifictoolsassessedinthisreport.
Wehavenotmadeanyspecificrecommendationsregardingthechallengesoftransmittingdatabetweensystems.Whiletherearesomeclearrisks,asdescribedinthefirstpartofsection4.1(suchaschainofcustodyandfileintegrity),wefeelthata)theseareverybroadandapplytoallformsofpreservationusingmultipletoolsandb)theextentoftheproblemisnotwelldefinedoragreedon;forexample,someinstitutionsmaynotseeanyproblemswithdatatransmissionprotocolsthathappenbeforeformalaccession.Whilewefeelthisareawarrantsfurtherconsideration,thatmaybeoutsidethescopeofconcernforthisreport.
5.1. Enhance tools to support external reference identifiers
Attheveryleast,toolsneedtobeabletoacceptandmaintainexternalidentifierssothatemailstewardscankeeptrack(atmultiplelevelsofgranularity)whatdataisbeingprocessedthroughoutaworkflow.
Ingeneral,emailstewardsshouldbeabletousetheidentifiersforindividualitems,foldersorothergroupingsfromonesystemwhenexportingdataandcarryingoutfurtherprocessinginanothersystem.
Ideallyexternalidentifierswouldalsobecapturedwhencapturingprocessinghistorysothatitispossibletoclearlytrackthechainofcustody(forexamplebyassociatingtheidentifierwiththePREMISagentinvolvedinprocessing).
Page 20 of 22
5.2. Adopt standard approaches to capturing and respecting rights and sensitivity metadata
Giventhatemailcollectionsoftencontaincontentwithavarietyofdifferentrights,andthatthereisawidespectrumofprivacyandconfidentialityissuesthatcanbeinvolved,emailarchivingtoolsshouldsupportstandardwaysofcapturingrightsorsensitivitymetadata.
Manysystemsalreadyusestandardsforrights(forinstanceusingPREMISrightsentities);however,theredoesn’tappeartobeanequivalentapproachforrecordingsensitivityorprivacyinformation.
5.3. Establish MBOX as minimum standard for input and output of email content
MBOXisthemostwidelyusedstandardamongstthetoolsconsideredhere.EMLisalsoawidelyusedstandardandsupportedbyamajorityofemailclients.TheEAXSstandardusedinDArcMailmaybemorecomprehensivebuthassofarnotbeenwidelyadoptedandtherearenotoolsfordiscoveryandaccessinthatformat.
WethereforerecommendthattoolprovidersconsideraddingMBOX--complyingwithRFC4155(ApplicationMBOXMediaType)andRFC5322(InternetMessageFormat)--asastandardforbothinputandoutput(wherethatdoesn’talreadyexist).Thisdoesn’tnecessarilymeanobsoletinguseofEMLorEAXS,butsimplyprovidingadditionalsupporttoenablemaximuminteroperabilitybetweentools.
5.4. Establish a common exchange standard for packaging email with metadata
Astandardforpackagingdigitalcontent,describingthecontentsofthepackageandensuringintegrityofthepackageusinghasheswillgreatlyimprovetheabilitytotransferdatasafelybetweensystems.TheLibraryofCongressBagitstandardiswell-establishedandisalreadyusedbyatleasttwoofthetoolshere(ArchivematicaandBitCurator).
TheBagItstandardmaynotbeenoughinitselfhowever.Whilerecommendation5.3wouldensurethatemailcontentcanbetransferredusingtheMBOXstandard,additionalstructuralandmetadatastandardsmaybeneededtodefineminimumexpectationsforwhatcontentormetadataisrequired,optionaloracceptable.Forexample,toclarifywhetheritisacceptabletopackagemultipleemailaccountstogether.
5.5. Support capture of processing history
SeveraltoolsrecordprocessinghistoryusingthewellestablishedPREMISstandard.
Ideallyalltoolswouldprovidethiscapabilitysothatcomprehensiveprocessinghistorycanbecapturedthroughoutaworkflowusingmultipletools.
Page 21 of 22
Further consideration should be given to the consolidation of processing history files from differentsystems, or the ability tomanually addprocessing history (to fill any gapswhere a tool does not yetrecorditautomatically).
5.6. Establish standard definition and description of email collections
Itisn’tclearthatthedefinitionofwhatconstitutesanemailaccount(includingtherelationshipwithemailaddresses,orpeople)isconsistentbetweentools.Establishingacommondefinitionwillenablealignmentofdifferentdatamodelsusedandreducetheriskofconfusionormis-identificationofemailcollectionsatthisfundamentallevel.
Withaconsistentandstandarddefinition,itwillthenbepossibletodevelopacommonstandardfordescribingemailaccounts.Thiswouldhelpimprovetheprecisionofsearchanddiscoveryandbetterenabletheexchangeofdescriptivemetadatabetweentools.
5.7. Make local tools publicly available with an open source license
Toolsthatareonlyusablebyoneinstitutionarenotusefultothewideremailarchivingcommunity.Whilethereareclearlycoststomakingatoolmorewidelyavailableandtryingtocreateandmaintainanactivecommunityaroundit,wefeeltherearemanybenefitsthatcanoffsetthosecostsinthelongrun,includingopeninguptheprojecttoawiderbaseofdevelopers,testersandpotentialfunders.
Acknowledgements
ThisprojectbuiltonthegreatworkstartedattheHarvardEmailArchivingStewardshipTools(EAST)workshopinMarch2016.Wewouldliketothanktheoriginalparticipantsandacknowledgethemanycontributionsreceivedsince.
InparticularwewouldliketothankthecontributorstotheEmailDataSharingFramework;GlynnEdwards,JoshSchneiderandPeterChan(StanfordUniversity),AndreaGoethals,GrainneReillyandSkipKendall(HarvardUniversity),SarahRomkeyandJustinSimpson(ArtefactualSystemsInc.)andCalLee(UniversityofNorthCarolinaChapelHill).
Numerousreviewersprovidedhelpfulcontributionsandsuggestionsforthisreport.WewouldliketothankEvelynMcLellan,JustinSimpsonandSarahRomkey(ArtefactualSystemsInc.),AnthonyMoulen,AndreaGoethalsandGrainneReilly(HarvardUniversity),ChrisProm(UniversityofIllinoisatUrbana-Champaign),CalLee(UniversityofNorthCarolinaChapelHill)andRiccardoFerrante(SmithsonianInstitutionArchives).
WewouldliketothankHarvardLibraryfortheopportunitytoengageinthisworkandprovidingsupportanddirectionthroughout.
FinallythankstoWendyGogel(HarvardUniversity)forcontributionsonmanyfrontsandprovidingleadershipfortheproject.
Page 22 of 22