persistence of memory: in-memory is not often the answer
TRANSCRIPT
![Page 1: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/1.jpg)
On the Persistence of Memory (in Database Systems) i
© 2012 Hired Brains Inc. All Rights Reserved
On the Persistence of Memory…
In Database Systems
Picture credit Creative Commons
By Neil Raden Hired Brains, Inc.
December, 2012
![Page 2: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/2.jpg)
On the Persistence of Memory (in Database Systems) ii
© 2012 Hired Brains Inc. All Rights Reserved
Table of Contents
Executive Summary 1
The Basics 1
Database Memory and Processing Models 3
In-‐Memory Database 4
Why is in-‐memory, a fairly old concept, interesting again? 6
Limitations of iMDB 8 Cost 8 Persistence 8 Volume 9 Dual-Purpose OLTP and Analytics 9 Not so “green” 10
The Hybrid DBMS 10
Compare and Contrast 12
Conclusion 12
ABOUT THE AUTHOR 14
![Page 3: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/3.jpg)
On the Persistence of Memory (in Database Systems) 1
© 2012 Hired Brains Inc. All Rights Reserved
Executive Summary
Recent drop in computer memory prices and the introduction of early implementations
of In-‐Memory database solutions have recently raised the level of interest in in-‐memory
databases, but the topic of in-‐memory databases is not new. In fact, there are literally
dozens of in-‐memory database products, some in production for decades, but due to
the prohibitive cost differential between memory-‐based systems and disk-‐based
systems, none have found a place beyond certain niche markets. But the drastic and
remarkable (there is hardly a word to describe it) drop in the cost of memory combined
with an equally remarkable growth in density and capacity is driving the discussion into
the mainstream of computing architectures.
For the purposes of discussion, we refer to in-‐memory databases systems as iMDB and
current relational database systems incorporating large memory models with attached
storage (including traditional magnetic disk and solid-‐state devices) as hybrid-‐DBMS.
Though the discussion is occasionally technical, our conclusions are that:
• iMDB are leveraging lower-‐cost RAM for storage but still lack persistence and
data scalability while limiting the types of solutions supported by iMDB
architecture.
• Hybrid-‐DBMS is a proven technology and provides high performance and flexible
architecture to support a variety of analytics applications.
The Basics
![Page 4: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/4.jpg)
On the Persistence of Memory (in Database Systems) 2
© 2012 Hired Brains Inc. All Rights Reserved
All database management systems (DBMS), in fact, virtually all programs in conventional
computing environments behave exactly the same way. A central processing unit (CPU)
performs a single very low-‐level instruction on a single piece of data. While complex
application programs like DBMS have many layers of functionality and can be described
logically as a set of higher-‐level interworking pieces, the CPU has utterly no insight into
this, it just chugs along one instruction at a time. If you were to sit inside a CPU and
watch its stream of sequential processes, you would be unable to determine what the
controlling program was doing. So database software, or really, any software, is just a
logical structure that encapsulates all of the smaller steps. When things get calculated,
they bear no resemblance to the whole. A CPU doesn’t know what a join or an index is.
How those bits of work are presented to the CPU is the heart of the application design.
In other words, though there is no difference in how CPU’s execute from one application
to another, the order of those instructions is the key to performance.
Each step in execution is composed of a single instruction and a single piece of data
(though today’s CPU’s are composed of multiple “cores,” essentially multiple CPU’s on a
single chip). The instruction and the data have to be presented to the CPU through
memory, either system RAM or a memory cache on the CPU itself. It makes no
difference if the application is “in-‐memory” or disk-‐based, the CPU has to be presented
with the instruction (actually, the “instruction set” is burned into the CPU, what is
presented to it is an instruction for which instruction to execute). For this reason, an in-‐
memory architecture, where all instructions and data are in RAM should, in theory,
provide superior performance compared to DBMS that must fetch data from remote
mechanical disk drives.
Solid-‐state drives (SSD) mentioned above use solid-‐state memory chips, typically flash
memory (NAND), instead of spinning magnetic disk drives. Flash memory (NAND), is less
expensive and slower than RAM/SRAM, but it is non-‐volatile, meaning, it retains data
![Page 5: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/5.jpg)
On the Persistence of Memory (in Database Systems) 3
© 2012 Hired Brains Inc. All Rights Reserved
without power. It does not lose data in the case of a system shutdown. RAM is volatile
and must be powered continuously and requires backup, typically conventional disk
drives for reliability.
One could say that a DBMS with SSD instead of traditional disks could be an in-‐memory
device, but there are two fundamental differences. First, the “memory” chips of an SSD
are part of a disk drive “card” or assembly that uses the same block addressing as the
disks it replaces. In other words, even though the seek time finding data on an SSD is at
least an order of magnitude greater than a spinning magnetic disk (this is a
generalization), there is still a call for external data, handled by the disk controller and
passed to RAM. An interesting arrangement, typically used for add-‐on accelerators, not
primary database operations are SSD’s constructed from SRAM, boosting the seek time
on the drives. This is a special-‐purpose architecture and very expensive and not further
considered here.
Database Memory and Processing Models
To clear up confusion between various models for memory in databases, it’s useful to
describe the predominant versions. There is a difference between memory models in
database systems for processing. The two predominant memory models for the most
common database systems are shared memory and shared nothing. In both, memory is
used only for processing, not for persistent storage. This is the essential difference
between today’s iMDBs and more conventional on-‐disk or hybrid systems.
In the shared memory model, all database operations use the same single aggregation
of memory and the system allocates its memory and processing tasks. All memory is
available to every processor. In a shared nothing system, each separate node of
processors and memory do their own work in parallel and are, typically, controlled by a
![Page 6: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/6.jpg)
On the Persistence of Memory (in Database Systems) 4
© 2012 Hired Brains Inc. All Rights Reserved
master node (which can be physical or virtual). In reality, nodes in a shared nothing
environment may, themselves, operate as independent shared memory nodes. But in
neither case is data stored in memory until it is called for. The exception is when data is
cached (frequently used data in “pinned” in memory), but it is still volatile and the data
can be flushed at any time.
iMDB operate more or less like a shared memory systems, but everything, including
operating systems, software programs (executables), workspace, indexes and data are
stored in RAM. When these systems are scaled out with multiple nodes connected by a
network, they operate more like a grid or distributed network than like a true MPP-‐
engineered system. However, concepts of shared memory versus shared disk (shared
nothing) are a little obsolete now as CPU’s themselves are multi-‐core, meaning, the
processors themselves are capable of parallel processing, provided the software
program (DBMs) has been designed to take advantage of it)
This description is a simplification and there are many exceptions, but in general, no
database management system stores data persistently in memory except, of course,
iMDB. The difference between the various memory models described above is how
memory is used for processing data.
In-‐Memory Database
It is an unassailable truth that data processed from memory is orders of magnitude
faster than retrieving it from a disk drive, but that is only a small part of the story.
Historically, CPU processors have been “I/O bound,” meaning they spent a significant
amount of time waiting for the requested data to arrive, requiring extreme
countermeasures in software design to minimize the latency. With data streaming to
processors at the speed of random-‐access memory (RAM), just the opposite situation
![Page 7: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/7.jpg)
On the Persistence of Memory (in Database Systems) 5
© 2012 Hired Brains Inc. All Rights Reserved
can occur – the CPU’s may become flooded with data
and unable to process as quickly as it is presented. The
point cannot be stressed enough – merely boosting the
available RAM does not guarantee smooth, faster
executions of existing programs. This turn of events calls
for careful engineering and balance. In other words,
performance of complex applications is rarely resolved
by changing one thing, it usually requires rethinking of the whole approach. The result is
that software migration to in-‐memory usually requires a great deal of re-‐work; It is not
just move and drop.
Even the notion of iMDB is a bit of a misnomer as there
is still the requirement for separate conventional storage
devices for mirroring everything for persistence, and
keeping the iMDB refreshed and reliable. Systems can
fail, which means in-‐memory systems still have to
maintain multiple copies of the data, and a complete
reload if the system fails. Adding all of these factors
together can make the effort quite expensive despite the
seemingly reasonable price of memory today (though at multiple terabytes, you will feel
the pinch). In addition, to make maximum use of RAM, all database systems use
compression of data, to one degree or another. IMDBs typically employ aggressive
compression algorithms to maximize the amount of data that can be put in working
memory. Back-‐up of an iMDB is usually lightly-‐ or un-‐compressed so it can be read by
other processes, among other reasons. Assuming a realistic 3.5x compression for an
iMDB (not all RAM is available for the data), the back-‐up drives will need to be 5X the
size of RAM, and there may be multiple archives, and the backups themselves will likely
be mirrored. With even average-‐sized analytical data warehouses today running about
50 terabytes (there are, of course much larger ones), an iMDB to accommodate those
The point cannot be stressed enough – merely boosting the available RAM does not guarantee smooth, faster executions of existing programs.
Even the notion of iMDB is a bit of a misnomer as there is still the requirement for separate conventional storage devices for mirroring everything for persistence, and keeping the iMDB refreshed and reliable.
![Page 8: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/8.jpg)
On the Persistence of Memory (in Database Systems) 6
© 2012 Hired Brains Inc. All Rights Reserved
would need 75-‐100TB of separate disk drives to handle back-‐ups, snapshots, logs and
staging areas.
Another thing to consider is that a database still has to perform all of the database
functions, from loading data to presenting it as the result of a query. Conventional
relational database technology, including those platforms are that are designed
specifically for data warehousing and analytical work, as opposed to transactional
processing, must employ a host of services to be useful to an enterprise including:
• Workload management for efficient management of the resources
• Security
• Reliability
• High availability
• Use of performance statistics for query optimization.
They must also support, in addition to traditional row-‐based schema, columnar
organization of the data which is particularly effective for wide tables with many
attributes, but it is less effective with more normalized schema and has some serious
drawbacks in the ability to update the database in real-‐time. But columnar orientation is
not a feature limited to iMDBs – most analytical database systems incorporate or even
operate solely in columnar mode.
Why is in-‐memory, a fairly old concept, interesting again?
iMDBs have been used for quite some time but they have always been limited primarily
by three factors: The cost of memory, size of database, and the persistence of data.
Today, a dollar will buy 500 to 1000 times as much memory as it did in 1995, and the
capacity per square inch of the chips has increased in inverse proportion. Memory
speeds increased as well, though not as dramatically. If the amount of data that could
![Page 9: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/9.jpg)
On the Persistence of Memory (in Database Systems) 7
© 2012 Hired Brains Inc. All Rights Reserved
be stored in early in-‐memory systems was too small for most applications, 1000 times
more memory might be enough for in-‐memory to be feasible.
This extremely simplified diagram depicts the essential (but certainly not all) differences
between an iMDB and a hybrid-‐DBMS. iMDB maximizes the use of RAM but uses
essentially the same hardware architecture of 2 CPU’s with levels of on-‐board cache,
and RAM for holding the entire database, the database software, working space, caches
and embedded functionality. The only difference in the hybrid-‐DBMS is less reliance on
RAM and the ability to address vastly greater amounts of data from the storage
subsystem. The hybrid-‐DBMS has documented databases of greater than a petabyte.
iDBMS typically scale out to 16 servers with up to 1 terabyte of RAM each, but with a
significant amount of RAM taken up with operating system, working memory, etc,.
Therefore even with 5x compression, the maximum amount of uncompressed data per
![Page 10: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/10.jpg)
On the Persistence of Memory (in Database Systems) 8
© 2012 Hired Brains Inc. All Rights Reserved
server is no more than 40Tb. Given the expense of these large iMDB systems, scaling out
to sizes that are needed today is difficult.
Limitations of iMDB
In-‐memory databases are constrained by key overwhelming limitations:
• No matter how inexpensive RAM is today compared to historical cost, it is still
considerably more expensive than its alternatives limiting its useful for
enterprise level systems.
• Data cannot persist in memory indefinitely. It is inevitable that something will
fail, which requires mechanisms to protect the data that can erode the value
proposition.
• With today’s data volumes, it is still not practical to use an in-‐memory approach
for a data warehouse.
• iMDB rely on the system being up 24/7.
Cost Though RAM is 10,000 times faster to read than a mechanical disk drive, data volumes
today are enormous and growing. A petabye-‐sized in memory database would cost
more than $5 million, perhaps twice that. SSD for that capacity would cost 1/5 to 1/10
the price. And a hybrid-‐DBMS, hot/warm/cold hierarchical storage architecture would
cost far less than that.
Persistence In-‐memory architecture still requires conventional storage. RAM is volatile and if
something fails, or even just hiccups, there can be data loss. Therefore, everything in
memory has to have a copy on less volatile storage devices. Updating the memory
requires log files, “Snapshots” and “checkpoints which can slow down processing).
![Page 11: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/11.jpg)
On the Persistence of Memory (in Database Systems) 9
© 2012 Hired Brains Inc. All Rights Reserved
Volume In-‐memory cannot economically, or even practically, scale to the volumes of today’s
data warehouses. Ten years ago, a terabyte-‐size data warehouse was remarkable, but
today, there are dozens, perhaps even more than a hundred greater than a petabyte,
one thousand times larger. Projections are that this growth rate is not diminishing.
Dual-Purpose OLTP and Analytics Some iMDB products promise the ability to perform OLTP and analytical processing on
the same platform, with the same data. This would be a real advantage as it would
alleviate need to extract and transform data from operational systems and provide
analytical support without additional. Unfortunately, this is currently impossible.
iMDB platforms generally cannot support OLTP because they have to wait for a
transaction to complete on disk to be ACID compliant. When data is updated in
memory, it is held in log files usually stored on SSD drives. iMDB platforms use this disk
based “persistent” layer to “weather” a node failure, which, in a narrow sense, suggests
they have ACID properties. When the iMDB node comes back up (after the failed part is
replaced or the cold standby node takes over), the data that is resident on the disk
“persistent” layer is reloaded back into memory. It can be done in one of two ways –
“Lazy”, where the data is reloaded as queries enter the system and request a specific
table (which doesn’t really make sense since the iMDB appears in memory as one
dimensional table), or “Full” where queries must wait until all the data is reloaded. In
both cases, the log files stored on disk or flash have to be read and applied1.
There are features to handle different kinds of failure, though. Both the SSD area and
Disk Persistent layer have RAID capability to cover for a disk failure. So, if a node
has a problem, but keeps power, then all “may be” ok. It is an “error dependent” issue.
If there is a problem with a memory chip, it is unlikely the data will survive -‐-‐ requiring a
![Page 12: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/12.jpg)
On the Persistence of Memory (in Database Systems) 10
© 2012 Hired Brains Inc. All Rights Reserved
total reload.. If a node loses power, then a total reload of all the data that was on that
node is required.
Not so “green” At a time when most vendors are formulating a “green ” message, it turns out that
iMDBs require a lot of power, considerably more than spinning drives and significantly
more than solid-‐state drives (SSD – more on this below)) RAM is volatile and needs to be
powered 24/7 if the data is to persist.
The Hybrid DBMS IMDB vendors often portray disk-‐based systems as dinosaurs that have outlived their
usefulness, but in fact, they are the result of 30 years of research and development by
some of the most brilliant minds in the technology industry and have hardly been
standing still. In the same way relational database technology gradually gained new
hardware capabilities and evolved to become hybrid-‐DMBS, it seems likely that the
major database vendors will continue to evolve to leverage the advantages of more
memory over disk drives. The dramatic cost reductions of memory have benefits that
accrue to hybrid-‐DMBSs too – solid-‐state disk drives replacing traditional magnetic
drives with improvements in I/O speed. Teradata Virtual Storage for example
automatically manages the movement of the hot and the cold data. Large memory
models are common, too, even if the persistent data remains on attached storage
instead of completely in memory.
Another consideration is that for most database applications, there is a clear difference
between hot and cold data. In other words, data that is used at the moment as opposed
to data that is use less frequently. This tilts the decision between disk-‐only and in-‐
memory to an in-‐between alternative, a hybrid scheme with large memory, SDD drives,
and less expensive slower HDD for warm or cold data. Hybrid-‐DBMS leverage the speed
of SSD to reduce query response time delays by cutting the painful delay times
introduced by lengthy I/O queues in HDD storage. A query requires many I/O
![Page 13: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/13.jpg)
On the Persistence of Memory (in Database Systems) 11
© 2012 Hired Brains Inc. All Rights Reserved
operations to complete so the time spent with I/O requests in storage queues has a
direct impact. Not only does the speed and parallel channel capability of SSD result in
40X faster I/O completions, but the queue in the HDD are shortened by aiming 80% of
I/O at the SSD, this can result up to a 60X improvement in average response times.
A Hybrid scheme requires not only a physical assemblage of devices, but also an
intelligent data manager that continually and transparently optimizes the architecture
by moving data to its best location. The figure below represents Teradata’s version of
such as system.2
Notice that in this scheme, each node is balanced with a combination of CPU’s and their
characteristics, the amount of RAM and the storage devices. This provides for optimum
balance between processing, memory and addressable storage which leads to optimal
performance. It does, however, somewhat limit configuration flexibility as the drives
and CPU’s are fixed.
2 Teradata are working on extending the data management to the memory layer
![Page 14: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/14.jpg)
On the Persistence of Memory (in Database Systems) 12
© 2012 Hired Brains Inc. All Rights Reserved
Compare and Contrast
Today there are two ways to store data electronically: on arrays of solid-‐state memory
chips (on either a memory bus or on SSD) or on magnetic disk drives. Solid-‐state chips
are obviously faster than magnetic drives (although in some cases, the differential can
be overcome with good platform design and workload management). Solid-‐state chips
are considerably more expensive than magnetic drives, and volatile RAM chips are
considerably more expensive (and faster) than non-‐volatile RAM. We can’t see the
future with perfect clarity, but it is likely for the foreseeable future, this stratification of
memory and storage will not change, even as the price/performance of each continues
to improve. The faster RAM chips will remain volatile, making full in-‐memory databases
impractical for most uses.
iMDB lack the balance of CPU and storage could lead to flooding of the CPU’s. iMDB
trades the potential for I/O latency with the very real possibility of RAM out-‐performing
the processors. Without I/O bottleneck, processors can become saturated. This is
something that the software developers should be aware of, and design for, but given
the relative recency of certain iMDB’s, these features may not be well developed. It
may be the case that client applications may need to be rewritten to not only take
advantage of the memory resources but to keep them from bogging down.
iMDB rely on large banks of very fast, expensive RAM, but also on the other types of
memory and storage to operate for high availability and for backup. Hybrid-‐DBMS relies
on the same collection of memory and storage types, but in different proportion. A
hybrid system uses solid-‐state memory judiciously and attempts to keep as much data
pinned in memory as possible for active work, but relies on only one mechanism for
persistent storage.
Conclusion
![Page 15: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/15.jpg)
On the Persistence of Memory (in Database Systems) 13
© 2012 Hired Brains Inc. All Rights Reserved
iMDB vendors claim that In-‐Memory will replace traditional hybrid-‐DBMS, unless they
are new laws of physics, holding persistent data for months or years simply isn’t feasible
without resorting to a hybrid in-‐memory and disk-‐based system. In a way, one can think
of an iMDB as merely an accelerator for a conventional database because it cannot
meet the requirements durability on its own.
On the other hand, hybrid-‐DMBS are based on proven data warehousing technologies
and offer flexible architectures and deliver high performance with automatic storage
management.
It would be easy to predict that iMDBs, and that includes DBMS with all SSD drives, will
eventually overtake disk-‐based systems. However, the cost of memory will still be
greater, no matter what it is, than disk drives and though it is impossible to predict, the
amount of data captured and analyzed will continue to grow at a rate faster than the
price/Gb of memory.
![Page 16: Persistence of memory: In-memory Is Not Often the Answer](https://reader031.vdocuments.mx/reader031/viewer/2022020208/55ad0fa11a28ab3e678b47d9/html5/thumbnails/16.jpg)
On the Persistence of Memory (in Database Systems) 14
© 2012 Hired Brains Inc. All Rights Reserved
ABOUT THE AUTHOR
Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely
published author and speaker and the founder of Hired Brains, Inc.,
http://www.hiredbrains.com. Hired Brains provides consulting, systems integration and
implementation services in Data Warehousing, Business Intelligence, “big data:, Decision
Automation and Advanced Analytics for clients worldwide. Hired Brains Research
provides consulting, market research, product marketing and advisory services to the
software industry.
Neil was a contributing author to one of the first (1995) books on designing data
warehouses and he is more recently the co-‐author of Smart (Enough) Systems: How to
Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-‐Hall, 2007. He
welcomes your comments at [email protected] or at his blog at Competing on
Decisions.