hyperstream server disk space management -...

Hyperstream Server Disk Space management

Scope

Many users of Time Navigator with Hyperstream destinations expressed the need for a better

explanation of how disk space is consumed, reported and released in HSS and how they can cope

in the event they have ran out of space and need to free it up on an emergency basis.

This document explains the mechanisms used and provides a general methodology for frequently

seen situations. It complements the product documentation but users are cautioned to consult the

documentation as it will evolve with the product faster than this document and in case of

conflictual information the one in the product documentation should be considered accurate.

ATN 4.2 and 4.3SP2 HSS destinations

In the above Time Navigator versions backup destinations are presented as Tape libraries and

handed as such from the ATN perspective.

Hence you select the destination by selecting a mediapool who in its definition contains the tape

drives that can be used and the media that will be used for your backup/restore operation.

Looking at the representation of a HSS destination Tape drive shows us

2

General tab

Defines:

Hyperstream server (hostname , port, HSS username and password )

The size of the cartridges.

Note that the size of the cartridges is used to determine at the start of a backup to create a

new tapefile or tile on the same tape if the amount of data is less than the size or to use a

new tape to perform the backup. However an HSS backup is always’s single stream so

even if as in the screenshot the size is set to 10MB you will be able to backup several GB

or TB on it because ATN will NOT break up the backup in pieces.

Compression Algorithm: this tells us to compress on the agent using the algorithm

selected. Note: we recommend you compress on the agent during the backup for the

following reasons. The compression is CPU intensive hence by spreading this operation

over the different agents you avoid overloading the HSS server CPU. When data is send

across the network the compression will reduce the volume hence when over slow

networks you have a real benefit in performance and network load by compressing at the

source. You should only use RAW on agents that are very limited in CPU power to the

point the compression would impact the backup operation hence only on very old CPU’s.

Specific variables

Refer to the Hyperstream documentation for these variables that define the behavior

when restoring with replication , debug levels and overwriting compression

Hyperstream Server tab

Shows you information on the HSS server and the volumes of data movement as shown

below.

3

Note: Do NOT make any changes in this applet unless the drives are empty (eject media first).

Representation of an HSS cartridge

For Time Navigator a cartridge or tape consist of a LABEL and TILES.

Each of those elements will generate a STREAM in HSS.

The label is typically 1024 bytes, the tiles have the size of your backup.

The command line utility tina_dedup located in the bin directory under TINA_HOME can show

you the mappings.

tina_dedup –catalog catalog –listcart

In the above example you see 1 tape label def000010 contains 4 tiles + 1 label

UUID 0 is the label stream.

You can also run the full report in tina_dedup this will give you all the information from the

catalog on streams and from HSS on its streams. The end of the report will show you any

discrepancies found between the 2 views.

A shortened view of the report illustrating the different sections is show below.

tina_dedup – catalog catalog –report –verbose

4

The report starts with a listing of the streams found on the HSS server (this is the same as doing a

stream or ls command in hyperstream_admin on HSS)

Note the Stream Status it can be OK hence the stream is usable by ATN or TO DEL this

indicates the stream is marked for deletion typically when you spare or recycle the tape, or the

stream status can be DELETED hence it is no longer available the maintenance task “garbage

collector” has processed it. The DELETED streams are kept for statistics only they do NOT take

space in the streams directory but are just entries in the database. In 3.1SP3 these entries will be

archived to a separate table very 15 day’s so they will no longer appear in this report.

The maintenance tasks will be described further in this document.

The next section compares the information from HSS with the ATN catalog

The list of UUID’s (streams) only in HSS indicate some space is wasted for this catalog because

the stream exist in HSS but no use is made of it by ATN. This typically occurs when the user

restores an older catalog any of the backups made since this catalog version will result in what

we commonly refer to as Orphan Streams. They can easily be removed using an option on the

tina_dedup command that we will discuss in the how to correct section.

The next section shows the information the catalog has on the remaining streams you will notice

the JobID the backup host or application and strategy , the duplication status , the restorable

volume the number of objects ,the stream UUID, the size of the tile and the size stored on disk.

5

The last section is the summary of all issues found between the catalog view and the actual HSS

repository. In our sample you see the 11a Medium errors we discussed as Orphan streams

previously.

Media live-cycle

If we look at the ATN view of the above we see the following picture

This as we saw above results in 40 streams.

In ATN when a tape expires no immediate action is taken, this comes from the behavior on

magnetic tape media where there is no valid reason to waste time to destroy the content of the

tape before there is an actual need to use the media. The tape will be recycled during a future

backup if additional media is required. In HSS this behavior does not make any sense hence until

in a future version of ATN this changes you can manually SPARE the expired tapes to recover

the space periodically. Note you MUST USE SPARE and NOT Delete this is important as the

delete action only removes the tape from the catalog without notifying HSS and as such create

6

Orphan streams. The SPARE operation on the other hand will mark the stream to the status TO

DEL in HSS and remove it from the catalog hence allowing HSS to clean up the stream.

To do this select the expired tapes and hit spare as show below

For details on Data Integrity refer to the Time Navigator documentation

7

Notice the cartridges will be loaded in the drives, in some cases if you attempt to spare a tape

that no longer exists in HSS (restore of a catalog) you will get an error but the operation of

removing it from the catalog will have succeeded. Unfortunally you can only have 1 error per

drive before the operation fails and needs to be restarted.

Garbage collector

Now that ATN has marked a number of streams TO DEL the background maintenance tasks can

start doing their work.

The first one to be considered is the garbage collector this background service within HSS will

have to process the stream to understand this a little return on how HSS operates.The tina_dedup

program on the tina agent who will chop up the data in blocks calculate the digest send the digest

to the HSS server who will put it in the streamfile then look up the digest to see if the data

chunck is already present in this case the usage or reference count will be increased if not request

the data-chunck or block from the agent and add the block to its collection with a reference count

of 1. A stream is thus a succession of digests and the garbage collector will read the streamfile

one digest after the other and decrement the reference counter of the digests encountered when

completed it will delete the streamfile and mark the stream as DELETED. As you can see this

does NOT free up any space in the block repository yet only the stream repository will gain some

space once the streamfile is deleted.

To see the work queue of the garbage collector you can use the hyperstream_admin command gc

This will list the streams waiting to be processed by the garbage collector as shown below

8

To check if the gc is running you can use the service list command in hyperstream admin as

shown.

You can also see the status of the services in the GUI

Valid status for the garbage collector is Ready (if nothing to do) or Running when processing

streams. Make sure it is NOT paused or stopped.

The garbage collector makes extensive use of the database as such it will have an impact on

backup performance, to avoid this an environment variable is available to control the garbage

collector, compactor and compressor services and do not allow them to start when a HSS client

is connected. Remember as seen before tapes get recycled when media is needed hence during

the backup window. The variable cpr.during_bck shown below is set by default to 0 if your

system does NOT have idle time (backups and restores or duplications running all the time) you

may want to change it to 1 to make sure the maintenance tasks can run the time they require.

9

Compactor

We discussed how the gc decrements the reference counters of the digest no longer used by the

streams so as a result of this some of the unique data chunks or blocks will end up eventually

having a reference counter of 0 hence no longer used by anyone.

It is the compactor service that will then be used to free up the space these blocks occupy and

recover the space in the block repository.

To understand the way the compactor functions a quick overview of how the data is organized in

HSS. The blocks typically either 32k or 256k maximum in size are stored in Blockfiles of either

16GB in HSS 2.x or 32GB in HSS 3.x. At any given time only 1 blockfile is used for writing or

appending to the others are read only files. (In 2.x this is one blockfile of each extension,

compressed or raw) The file for appending is called the current blockfile.

You can see the blockfiles either by the hyperstream admin command file or in the GUI as

shown below.

10

The rightmost columns shown below are of particular interest as they show the amount of

recyclable space both absolute and in percentage.

You can order the display by clicking on the column header in the picture above we ordered by

% Recyclable. Notice the highest % on the system is below 50 this is not a coincidence the

compactor will take action based on the percentage of recyclable space and by default the

variable is set to 50%.

It is easy to change as shown above but we recommend you keep it at 50 under normal

circumstances if you need space on an emergency basis you can lower it temporarily.

If a blockfile exeeds the threshold the compactor will procces it.

Compacting a file consists in copying all block with a reference count > 0 into the current

blockfile and updating the entries of the digest accordingly in the database.

11

When done we will then end up with blockfile that has no referenced blocks anymore hence we

will notify the garbage collector to delete the file.

From this you will notice that in order to free up space by compacting the compacter needs space

to copy the blocks before the file can be deleted hence the reason why HSS will stop all activity

if less than 16 or 32GB is available in the block repository.

This value can also be adjusted as shown above.

The compactor is also controlled by the cpr.during_bck setting discussed earlier.

In version 3.x the compactor also performs compression during the compact operation in 2.x a

separate compactor service existed to compact the data blocks on the server side. Remember we

discussed that it is preferred to compress on the agents to spread the cpu load but if you are in the

situation this is not possible the compactor will do it a compact time.

As can be seen in the screenshot below the 50% threshold gives an average of 16% wasted space

Lowering the threshold will not allow you to gain much but since compacting is very labor

intensive both in terms of CPU and database (disk) access to frequent compacting will impact

backup/restore and duplication performance significantly.

12

Summarised

When a backup retention period expires :

ATN should recycle the HSS tape

Data Integrity might cause this not to happen

Depending on your schedules recycling might be delayed till big backups occur

Cartridges closed on error by ATN are not recycled

Best Practice periodically inspect media and SPARE manually

HSS background tasks must free the space

Garbage collector should delete the streams

Compactor must compact the files

Variable cpr.during_bck might not allow enough time

Services might be stopped or paused

Threshold might require adjustment

Best Practice periodically verify gc queue and files wasted space %

When doing catalog restores

HSS and ATN might get out of sync creating orphan streams

HSS and ATN might get out of sync showing streams TO DEL or DELETED in the

catalog.

Best practice run tina_dedup and correct catalog and/or HSS

Things that disrupt gc and compactor

License expired

Not enough space available

Replication issues (excessive backlog on resync)

System in HOT BACKUP mode during the app-list hss backup by tina or snapshots

Things that disrupt ATN

Tape label streams missing or in the wrong state

13

Commonly seen situations

Backups fail with HSS drive errors and HSS log shows entries like these

HSS1|WARN|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:295 @

DEDUP_FILE_T::open| Filesystem is near to be full (14549 MB available). Path is

"K:\Block03\dedup_00007.raw"|

HSS1|WARN|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:297 @

DEDUP_FILE_T::open| Try to create a new blocks file on another repository ...|

HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_server.cpp:2498 @

Dedup::add_new_block_file| Not enought disk space to create a new block file in repository

(space left is 43940 MB)|

HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:884 @

DEDUP_FILE_T::checkpoint_no_lock| ASM_ERR_FS_FULL => [39] - Filesystem is full|

HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_file.cpp:299 @

DEDUP_FILE_T::open| ASM_ERR_FS_FULL => [39] - Filesystem is full|

HSS1|ERR|Tue Jul 30 15:02:46 2013|1375189366|1480|dedup_cnx.cpp:2019 @ Cnx::Run|

ASM_ERR_FS_FULL => [39] - Filesystem is full|

In this case you have already ran out space

If the default thresholds were in place you can simply do the following

Stop ALL tina activity by putting the tapedrives on HSS in maintenance mode

If you are in a replication or mirror configuration pause the replication service on the secondary

system.

On the primary system

Make sure you are NOT in HOT backup mode (stop snapshot service and disable the app-list

HSS backup in tina). Then in hyperstream_admin perform the following commands db_backup_end

end

exit

Reduce the threshold hss.min_space_left to 1Gb or if possible add an addition directory for

storage on the repository path separating the path from the previous one by ;

Set cpr.during_bck to 1

Set the compact.threshold to 25%

Monitor the garbage collector using gc in hyperstream_admin and check the number of streams

is decrementing over time (note if you do NOT have an SSD disk hosting the database this might

be a significant time)

Check the garbage collector and compactor are Ready or Running

14

Verify the % recyclable as shown earlier files with more than 25% should show up in orange

background hence are eligible for compacting.

Wait while gc and compactor do the work and you recovered at least 60Gb of free space before

re-enabling Tina access and restoring replication and set the hss.min_space_left back to 32GB

On some systems typically Linux systems you might need to stop the HSS service and restart it

to have the OS become aware of the space freed up. Only recycle the service once the compactor

and garbage collector have become idle again (READY status).

Once you have recovered space and restored replicaton services:

Run tina_dedup –report –verbose –repair –catalog catalogname to eliminate orphan streams

In Tina re-enable the tape-drives but do not start backups yet

Check expired media and make the tapes SPARE

Unless you have a exceptional situation identified running out of space indicates your backups

exceed the capacity so you will have to provision more space to avoid the problem to repeat

itself later.

The hyperstream_server or ADE_server commands set have commands that allow you to move

blockfiles to a different location you can use these commands to move any files from temporary

provided storage back to the original storage. DO NOT use operating system commands to move

any files in the HSS repositories. ADE_server and hyperstream_server commands require the

HSS / ADE service to be stopped prior to running the commands.

If the added repository has been emptied you can remove it from the GUI.

In extreme cases the compactor and or garbage collector will not run or manage to clean the

streams/blocks in this case you will have to resort to using the repair option available in

ADE_server and hyperstream_server.

The repair operation is a non-interruptible process that can take a significant amount of time to

complete (several day’s) It will backup your existing database to a location specified on the

commandline and then rebuild the database from scratch eliminating any defective streams or

blockfiles. You will need to have HSS 2.2SP9 or 3.x to be able to run the repair.

Note that the repair itself will not free up any space but will correct whatever situation that

avoided the GC or compactor to do their job so after the repair you still need to let them perform

the cleanup before resuming normal operations.

Refer to the documentation on how to use the –repair command.

15

Monitoring the system

The GUI conveniently gives you graphs to allow you to monitor space usage

Depending on your retention periods you should see a recurring pattern on the restorable volume

if it keeps on growing you might want to revisit the backup setup or provision space accordingly.

The orange (stored) should indicate what needs to be provisioned.

If the green (recyclable) grows then you have potentially an issue with the maintenance tasks.

The streams information in the GUI also gives you the deduplication ratio of each stream hence

backup if you have different types of media (duplication to tapes as an exemple) you might want

to consider moving backups that do not yield decent deduplication ratio’s directly to tape to save

space and be more efficient.

The properties tab also provides you with a forecast indicating your available space usage the

last week growth and an estimate for when you might run out of space.

And as with many things preventing is better than repairing

hyperstream server disk space management -...

Documents