storage 2.0 (unstructured data)
TRANSCRIPT
Vikas Deolaliker2008
Executive Summary – IOpportunity Fixed content mining is a computationally intensive operation. A purpose built appliance with adequate integration
hooks to back end data warehousing systems with add-on/plug-ins for most popular BI clients will meet the requirements for departmental and small and medium sized businesses.
The key value multiplier for such a product is in its ability to seamlessly integrate with existing enterprise systems and generate reports which can be printed, imported (into excel, access etc).
The market for such an appliance is expected to reach $200M in 2010 (not counting the storage/server pullthrough).
Industry Unstructured data or “Fixed Content” refers to digital content that is generated outside a business context i.e. the
data does not have a schema and is not stored in databases. Normally, all non-transactional data such as email, IM, media, Web Content, metadata and customer generated content is considered unstructured.
It is increasing desired by businesses to add information from “unstructured data” to their library of information sources to improve business decision making. Business Intelligence industry offers numerous products that enable discovery, intercept, metadata extraction, semantic analysis, storage and lifecycle management (ILM) of unstructured data. Companies such as InterWoven, Vignette, Informatica, Manugistics, IBM and Oracle have products that enable warehousing of content that is considered “unstructured”. BusObj recently acquired Inxight software for text analytics.
Storage system vendors such as EMC and NetApp have created a new category of storage systems called “Content Addressable Storage” or CAS. IBM recently acquired XIV to offer a competition to EMC called Nextra.
SNIA, a storage industry standard body has a initiative called XAM (Extensible Access Management). XAM compliant f products offer a programmable interface for archiving applications to query (search), retrieve and control access to fixed content. It also allows compliance software to access fixed content for SOX and other regulatory compliance tests.
Executive Summary - IIMarket Fixed content is stored on NAS appliances and clusters. The market for rich media NAS is fast growing and expected
to surpass $1B in 2009. EMC is the current leader in NAS with 36% of the market share and is expected to lead the NAS market for fixed content as well.
BI on fixed content is an emerging market. the market for fixed content BI software is still in its infancy with annual revenues under $20M. This market came under focus with acquisition of Inxight software by BusObjs. This software mostly runs on windows and accesses storage using iSCSI over IP.
Content Service Providers (CSPs) like Google/Yahoo store fixed content in clusters of servers which run their own proprietary filesystems a.k.a Storage 2.0. This market is proprietary as those filesystems are source of differentiation for the CSPs.
The Trend: According to IDC, transactional data is growing at 32.3%
while fixed content (unstructured data) is growing at 63.7%. Replicated or back-up data is growing at 43% p.a.
IDC, Dec, 2007
Unstructured data growth means growth in file services over LAN/WAN
Replicated data growth is lower than expected.
Structured data growth is lowest because most of the data generated today is outside a transactional context.
The Opportunity: Client can take a three pronged strategy to
enter the fixed content market. (a) Develop asset management software (b) Develop Storage 2.0 infrastructure (c) Develop a purpose built BI appliance for fixed content.
2005 2006 2007 2008 2009 2010 2011 2012 2007 Share (%) 2007-2012 CAGR (%) 2012 Share (%)Unix 82 97 113 131 153 179 210 250 22.6 17.2 20.9Linux/other open source 6 7 9 11 14 17 22 26 1.8 23.9 2.1Windows 32 and 64 197 238 285 341 409 492 590 703 57.3 19.8 58.8Other 64 77 91 108 129 153 182 218 18.3 19 18.2Total 349 418 498 591 705 841 1,004 1,196 100 19.2 100
Digital Asset Management Software, IDC, 2008 $M
Client can enter the market with media asset management software. The software pulls server and storage infrastructure.
a) Digital Asset Management- Workflow and ILM management of fixed content- Content Intelligence Functionality such as text analytics, search,
visualizationb) File System Infrastructure (a.k.a Storage 2.0)
- Enhanced NAS storage for low latency, API based access to media content in filesystems
c) Business can add modules to BI pipeline and make a purpose built appliance for fixed content.
The Market: Fixed content market total is approximately
$1B. Majority of the market is storage hardware and software. NAS is the dominant organization for storage for fixed content. Emerging products such as Storage 2.0, Fixed media asset management software and BI appliances are currently in nascent stage.
Hardware (~$1B)EMC, NetApp, HDS, IBM, HPSell directly to CSPs like Yahoo, Google,
eBay Software (~$20M)
Asset Management: HP, IBMBusiness Intelligence: BusObj (Inxight)Fixed Content Search/Retrieval: Endeca,
Lucene, Microsoft (FAST)Content Editing: Adobe, Microsoft, Apple
Storage 2.0: Focus has shifted from IO in storage 1.0 to
higher level fileservices. The driver is no longer access protocol but content preference.
Feature Storage 1.0 Storage 2.0
Management Local Application Web Application
Access SCSI Over FC, GbE or Bus
SCSI Over IP
Provisioning LUN level granularity Filesystem size
Virtualization LUN, Volume Mgr. Object Level
Application Profile Write Many Read One WORM
Oversubscription 1:1 (provisioning is equal to allocation)
N:1 (Provisioning can be n-times allocation)
The Infrastructure Play: Filesystems
are turning into a platform with programming interfaces, data routers, load balancers, autonomic functions, analyzers, parsers.
Filesystem 1.0 Filesystem 2.0
Kernel Space
User Space
Block Driver
Disk
Buffer
FileSys
iNode Cache
Ob
ject
C
ach
e
VFS
Process/ Socket
Tools ClusterFS
Kernel Space
User Space
Name Server
MetaData ServerClient
Disk
Block Driver
File Driver
Most functional blocks in kernel space. Data and Control have the same path. Block level semantics is exposed to applications
Most functional blocks in user space. Data and Control have separate paths. Block level semantics is hidden from application.
Filesystem 2.0: Content centric filesystem
where the primitive is no longer a block but a file.
Content Addressable StorageNext generation of NAS devices that implement CAS
filesystem. The focus shift from block device driver to a file driver which manages the “chunks” of a file on multiple underlying block devices.
NAS with CAS does not use FC disks, prefers SAS/SATA disk for its low cost. Replaces backplane with high performance interconnect with RDMA.
Provides an API for application. When application queries meta data server for a file, it is given a fileID instead of iNode with block level addresses. It uses file addresses to locate a file. This creates a need for file data router.
The actual data can be stored on existing node level filesystem or a cluster filesystem
BI Play: Fixed content is being warehoused in the enterprise and
information from this content is being analyzed and delivered in many ways to the client. Client’s opportunity lies in adding support for fixed content across the BI pipeline.
Mining Tools Analytics Delivery
Visualization
Web Service
Reports
Real Time
Portal
Supply Chain
Customer Relationship
Financials
Sales Force
Human Resources
Image/Video
Query
Search
Media Search
Report Generators
OLAP
Router
ETL
Warehousing
Metadata Extraction
Scatter/ Gather
Data Storage
Workflow Integration
Fixed Content Support
BI Modules I: BI pipeline needs to be enhanced
to support 54% of the data that exists in the enterprise i.e. Fixed Content.
WarehouseExisting warehousing techniques depend upon ETL (Extract,
Transform & Log) methodology i.e. changing the format of the data and making it amenable for further processing.
Fixed content warehousing needs tools such as transcoder, metadata annotation tools, caching, Variable Bit Rate (VBR) encoding etc.
ToolsFixed content needs tools such as search which searches
content and metadata. Semantic analysis is going to end up in this domain once it is well specified in industry bodies.
Mashup is a tool that will probably end up in the BI domain once it is standardized. Current XHR based mashup is for web browsers only.
BI Modules II: Analytics
Fixed content analytics requires recognition and analysis of all media types: Text, Voice & Video. Streaming video is a challenge
Visualization backend is going to end up in this domain. Visualization transcodes data based on access device
DeliveryEnd user can ask for data delivery as real time
streaming or downloadable file or as a visual image or as a web service.
Content Management Play: Digitization
and automation is redefining the upstream portion of the value chain. SOA is increasingly being used as the integration technology to drive the C&P framework.
Processing
Film
DVD
Camera
File
Music
Broadcast
Cinema
Cable/Sat
Tape
Internet
Storage
Distribution Channels
Repository Processing Distribution
Content
StreamingDownloadRepackaging
- Static/Dynamic- Structured/Un- Format (s)- DRM- Batch/Auto
- Metadata Tagging/ Cataloging- Domain Specifics- Ontologies
iSCSI BlockFile StorageSearch
- Adaptation for devices- Integration with User Interface
- Formats- Encryption
Workflow Orchestration
The Product: Enhancing existing storage products to
make them metadata aware and global name space aware will enable Business to enter the fixed content market.
Target CSPs with enhanced FS for unstructured data.FS expects nodes to be shared memory. Enable support
of COTS serversSeparate name server, head node and store nodes
Target Enterprise unstructured ILM market with xFrame. Enhance Xframe to include unstructured data ILM by
moving time-sensitive data into near storage and tag stale data for archival.
GTM Strategy: Enterprise, CSPs and SMBs
form the three segments of the fixed content market.
To Target the segment Make alliances with and/or make enhancements to
With the following offering
Enterprises Fixed content asset management software companies
Enhanced FS on Cluster with enhanced xFrame
CSPs Integrate with CSPs proprietary metadata servers and tools
Server and storage
SMBs Create a CSP-In-A-Box solution
CSP-In-A-Box
Unstructured Data, Storage 2.0
Vikas Deolaliker2008