hyperstream server disk space management -...
TRANSCRIPT
Hyperstream Server Disk Space management
Scope
Many users of Time Navigator with Hyperstream destinations expressed the need for a better
explanation of how disk space is consumed, reported and released in HSS and how they can cope
in the event they have ran out of space and need to free it up on an emergency basis.
This document explains the mechanisms used and provides a general methodology for frequently
seen situations. It complements the product documentation but users are cautioned to consult the
documentation as it will evolve with the product faster than this document and in case of
conflictual information the one in the product documentation should be considered accurate.
ATN 4.2 and 4.3SP2 HSS destinations
In the above Time Navigator versions backup destinations are presented as Tape libraries and
handed as such from the ATN perspective.
Hence you select the destination by selecting a mediapool who in its definition contains the tape
drives that can be used and the media that will be used for your backup/restore operation.
Looking at the representation of a HSS destination Tape drive shows us
2
General tab
Defines:
Hyperstream server (hostname , port, HSS username and password )
The size of the cartridges.
Note that the size of the cartridges is used to determine at the start of a backup to create a
new tapefile or tile on the same tape if the amount of data is less than the size or to use a
new tape to perform the backup. However an HSS backup is always’s single stream so
even if as in the screenshot the size is set to 10MB you will be able to backup several GB
or TB on it because ATN will NOT break up the backup in pieces.
Compression Algorithm: this tells us to compress on the agent using the algorithm
selected. Note: we recommend you compress on the agent during the backup for the
following reasons. The compression is CPU intensive hence by spreading this operation
over the different agents you avoid overloading the HSS server CPU. When data is send
across the network the compression will reduce the volume hence when over slow
networks you have a real benefit in performance and network load by compressing at the
source. You should only use RAW on agents that are very limited in CPU power to the
point the compression would impact the backup operation hence only on very old CPU’s.
Specific variables
Refer to the Hyperstream documentation for these variables that define the behavior
when restoring with replication , debug levels and overwriting compression
Hyperstream Server tab
Shows you information on the HSS server and the volumes of data movement as shown
below.
3
Note: Do NOT make any changes in this applet unless the drives are empty (eject media first).
Representation of an HSS cartridge
For Time Navigator a cartridge or tape consist of a LABEL and TILES.
Each of those elements will generate a STREAM in HSS.
The label is typically 1024 bytes, the tiles have the size of your backup.
The command line utility tina_dedup located in the bin directory under TINA_HOME can show
you the mappings.
tina_dedup –catalog catalog –listcart
In the above example you see 1 tape label def000010 contains 4 tiles + 1 label
UUID 0 is the label stream.
You can also run the full report in tina_dedup this will give you all the information from the
catalog on streams and from HSS on its streams. The end of the report will show you any
discrepancies found between the 2 views.
A shortened view of the report illustrating the different sections is show below.
tina_dedup – catalog catalog –report –verbose
4
The report starts with a listing of the streams found on the HSS server (this is the same as doing a
stream or ls command in hyperstream_admin on HSS)
Note the Stream Status it can be OK hence the stream is usable by ATN or TO DEL this
indicates the stream is marked for deletion typically when you spare or recycle the tape, or the
stream status can be DELETED hence it is no longer available the maintenance task “garbage
collector” has processed it. The DELETED streams are kept for statistics only they do NOT take
space in the streams directory but are just entries in the database. In 3.1SP3 these entries will be
archived to a separate table very 15 day’s so they will no longer appear in this report.
The maintenance tasks will be described further in this document.
The next section compares the information from HSS with the ATN catalog
The list of UUID’s (streams) only in HSS indicate some space is wasted for this catalog because
the stream exist in HSS but no use is made of it by ATN. This typically occurs when the user
restores an older catalog any of the backups made since this catalog version will result in what
we commonly refer to as Orphan Streams. They can easily be removed using an option on the
tina_dedup command that we will discuss in the how to correct section.
The next section shows the information the catalog has on the remaining streams you will notice
the JobID the backup host or application and strategy , the duplication status , the restorable
volume the number of objects ,the stream UUID, the size of the tile and the size stored on disk.
5
The last section is the summary of all issues found between the catalog view and the actual HSS
repository. In our sample you see the 11a Medium errors we discussed as Orphan streams
previously.
Media live-cycle
If we look at the ATN view of the above we see the following picture
This as we saw above results in 40 streams.
In ATN when a tape expires no immediate action is taken, this comes from the behavior on
magnetic tape media where there is no valid reason to waste time to destroy the content of the
tape before there is an actual need to use the media. The tape will be recycled during a future
backup if additional media is required. In HSS this behavior does not make any sense hence until
in a future version of ATN this changes you can manually SPARE the expired tapes to recover
the space periodically. Note you MUST USE SPARE and NOT Delete this is important as the
delete action only removes the tape from the catalog without notifying HSS and as such create
6
Orphan streams. The SPARE operation on the other hand will mark the stream to the status TO
DEL in HSS and remove it from the catalog hence allowing HSS to clean up the stream.
To do this select the expired tapes and hit spare as show below
For details on Data Integrity refer to the Time Navigator documentation
7
Notice the cartridges will be loaded in the drives, in some cases if you attempt to spare a tape
that no longer exists in HSS (restore of a catalog) you will get an error but the operation of
removing it from the catalog will have succeeded. Unfortunally you can only have 1 error per
drive before the operation fails and needs to be restarted.
Garbage collector
Now that ATN has marked a number of streams TO DEL the background maintenance tasks can
start doing their work.
The first one to be considered is the garbage collector this background service within HSS will
have to process the stream to understand this a little return on how HSS operates.The tina_dedup
program on the tina agent who will chop up the data in blocks calculate the digest send the digest
to the HSS server who will put it in the streamfile then look up the digest to see if the data
chunck is already present in this case the usage or reference count will be increased if not request
the data-chunck or block from the agent and add the block to its collection with a reference count
of 1. A stream is thus a succession of digests and the garbage collector will read the streamfile
one digest after the other and decrement the reference counter of the digests encountered when
completed it will delete the streamfile and mark the stream as DELETED. As you can see this
does NOT free up any space in the block repository yet only the stream repository will gain some
space once the streamfile is deleted.
To see the work queue of the garbage collector you can use the hyperstream_admin command gc
This will list the streams waiting to be processed by the garbage collector as shown below
8
To check if the gc is running you can use the service list command in hyperstream admin as
shown.
You can also see the status of the services in the GUI
Valid status for the garbage collector is Ready (if nothing to do) or Running when processing
streams. Make sure it is NOT paused or stopped.
The garbage collector makes extensive use of the database as such it will have an impact on
backup performance, to avoid this an environment variable is available to control the garbage
collector, compactor and compressor services and do not allow them to start when a HSS client
is connected. Remember as seen before tapes get recycled when media is needed hence during
the backup window. The variable cpr.during_bck shown below is set by default to 0 if your
system does NOT have idle time (backups and restores or duplications running all the time) you
may want to change it to 1 to make sure the maintenance tasks can run the time they require.
9
Compactor
We discussed how the gc decrements the reference counters of the digest no longer used by the
streams so as a result of this some of the unique data chunks or blocks will end up eventually
having a reference counter of 0 hence no longer used by anyone.
It is the compactor service that will then be used to free up the space these blocks occupy and
recover the space in the block repository.
To understand the way the compactor functions a quick overview of how the data is organized in
HSS. The blocks typically either 32k or 256k maximum in size are stored in Blockfiles of either
16GB in HSS 2.x or 32GB in HSS 3.x. At any given time only 1 blockfile is used for writing or
appending to the others are read only files. (In 2.x this is one blockfile of each extension,
compressed or raw) The file for appending is called the current blockfile.
You can see the blockfiles either by the hyperstream admin command file or in the GUI as
shown below.
10
The rightmost columns shown below are of particular interest as they show the amount of
recyclable space both absolute and in percentage.
You can order the display by clicking on the column header in the picture above we ordered by
% Recyclable. Notice the highest % on the system is below 50 this is not a coincidence the
compactor will take action based on the percentage of recyclable space and by default the
variable is set to 50%.
It is easy to change as shown above but we recommend you keep it at 50 under normal
circumstances if you need space on an emergency basis you can lower it temporarily.
If a blockfile exeeds the threshold the compactor will procces it.
Compacting a file consists in copying all block with a reference count > 0 into the current
blockfile and updating the entries of the digest accordingly in the database.
11
When done we will then end up with blockfile that has no referenced blocks anymore hence we
will notify the garbage collector to delete the file.
From this you will notice that in order to free up space by compacting the compacter needs space
to copy the blocks before the file can be deleted hence the reason why HSS will stop all activity
if less than 16 or 32GB is available in the block repository.
This value can also be adjusted as shown above.
The compactor is also controlled by the cpr.during_bck setting discussed earlier.
In version 3.x the compactor also performs compression during the compact operation in 2.x a
separate compactor service existed to compact the data blocks on the server side. Remember we
discussed that it is preferred to compress on the agents to spread the cpu load but if you are in the
situation this is not possible the compactor will do it a compact time.
As can be seen in the screenshot below the 50% threshold gives an average of 16% wasted space
Lowering the threshold will not allow you to gain much but since compacting is very labor
intensive both in terms of CPU and database (disk) access to frequent compacting will impact
backup/restore and duplication performance significantly.
12
Summarised
When a backup retention period expires :
ATN should recycle the HSS tape
Data Integrity might cause this not to happen
Depending on your schedules recycling might be delayed till big backups occur
Cartridges closed on error by ATN are not recycled
Best Practice periodically inspect media and SPARE manually
HSS background tasks must free the space
Garbage collector should delete the streams
Compactor must compact the files
Variable cpr.during_bck might not allow enough time
Services might be stopped or paused
Threshold might require adjustment
Best Practice periodically verify gc queue and files wasted space %
When doing catalog restores
HSS and ATN might get out of sync creating orphan streams
HSS and ATN might get out of sync showing streams TO DEL or DELETED in the
catalog.
Best practice run tina_dedup and correct catalog and/or HSS
Things that disrupt gc and compactor
License expired
Not enough space available
Replication issues (excessive backlog on resync)
System in HOT BACKUP mode during the app-list hss backup by tina or snapshots
Things that disrupt ATN
Tape label streams missing or in the wrong state
13
Commonly seen situations
Backups fail with HSS drive errors and HSS log shows entries like these
HSS1|WARN|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:295 @
DEDUP_FILE_T::open| Filesystem is near to be full (14549 MB available). Path is
"K:\Block03\dedup_00007.raw"|
HSS1|WARN|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:297 @
DEDUP_FILE_T::open| Try to create a new blocks file on another repository ...|
HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_server.cpp:2498 @
Dedup::add_new_block_file| Not enought disk space to create a new block file in repository
(space left is 43940 MB)|
HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:884 @
DEDUP_FILE_T::checkpoint_no_lock| ASM_ERR_FS_FULL => [39] - Filesystem is full|
HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:299 @
DEDUP_FILE_T::open| ASM_ERR_FS_FULL => [39] - Filesystem is full|
HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_cnx.cpp:2019 @ Cnx::Run|
ASM_ERR_FS_FULL => [39] - Filesystem is full|
In this case you have already ran out space
If the default thresholds were in place you can simply do the following
Stop ALL tina activity by putting the tapedrives on HSS in maintenance mode
If you are in a replication or mirror configuration pause the replication service on the secondary
system.
On the primary system
Make sure you are NOT in HOT backup mode (stop snapshot service and disable the app-list
HSS backup in tina). Then in hyperstream_admin perform the following commands db_backup_end
end
exit
Reduce the threshold hss.min_space_left to 1Gb or if possible add an addition directory for
storage on the repository path separating the path from the previous one by ;
Set cpr.during_bck to 1
Set the compact.threshold to 25%
Monitor the garbage collector using gc in hyperstream_admin and check the number of streams
is decrementing over time (note if you do NOT have an SSD disk hosting the database this might
be a significant time)
Check the garbage collector and compactor are Ready or Running
14
Verify the % recyclable as shown earlier files with more than 25% should show up in orange
background hence are eligible for compacting.
Wait while gc and compactor do the work and you recovered at least 60Gb of free space before
re-enabling Tina access and restoring replication and set the hss.min_space_left back to 32GB
On some systems typically Linux systems you might need to stop the HSS service and restart it
to have the OS become aware of the space freed up. Only recycle the service once the compactor
and garbage collector have become idle again (READY status).
Once you have recovered space and restored replicaton services:
Run tina_dedup –report –verbose –repair –catalog catalogname to eliminate orphan streams
In Tina re-enable the tape-drives but do not start backups yet
Check expired media and make the tapes SPARE
Unless you have a exceptional situation identified running out of space indicates your backups
exceed the capacity so you will have to provision more space to avoid the problem to repeat
itself later.
The hyperstream_server or ADE_server commands set have commands that allow you to move
blockfiles to a different location you can use these commands to move any files from temporary
provided storage back to the original storage. DO NOT use operating system commands to move
any files in the HSS repositories. ADE_server and hyperstream_server commands require the
HSS / ADE service to be stopped prior to running the commands.
If the added repository has been emptied you can remove it from the GUI.
In extreme cases the compactor and or garbage collector will not run or manage to clean the
streams/blocks in this case you will have to resort to using the repair option available in
ADE_server and hyperstream_server.
The repair operation is a non-interruptible process that can take a significant amount of time to
complete (several day’s) It will backup your existing database to a location specified on the
commandline and then rebuild the database from scratch eliminating any defective streams or
blockfiles. You will need to have HSS 2.2SP9 or 3.x to be able to run the repair.
Note that the repair itself will not free up any space but will correct whatever situation that
avoided the GC or compactor to do their job so after the repair you still need to let them perform
the cleanup before resuming normal operations.
Refer to the documentation on how to use the –repair command.
15
Monitoring the system
The GUI conveniently gives you graphs to allow you to monitor space usage
Depending on your retention periods you should see a recurring pattern on the restorable volume
if it keeps on growing you might want to revisit the backup setup or provision space accordingly.
The orange (stored) should indicate what needs to be provisioned.
If the green (recyclable) grows then you have potentially an issue with the maintenance tasks.
The streams information in the GUI also gives you the deduplication ratio of each stream hence
backup if you have different types of media (duplication to tapes as an exemple) you might want
to consider moving backups that do not yield decent deduplication ratio’s directly to tape to save
space and be more efficient.
The properties tab also provides you with a forecast indicating your available space usage the
last week growth and an estimate for when you might run out of space.
And as with many things preventing is better than repairing
16