The rest of this chapter is arranged loosely in the form of a decision tree. The material covers a wide range of scenarios and possible actions.

As you begin the decision-making process, follow these general guidelines from this point:

IF THE SYMPTOM IS A FAILURE TO PROCESS, refer to section “H1–Process Hangs”.

IF THE SYMPTOM IS A MUPIP INTEG ERROR REPORT, refer to section “I1–MUPIP INTEG Errors”. If you are investigating a particular error message, refer to the "MUPIP INTEG errors" table in that section.

IF THE SYMPTOM IS A GT.M RUN-TIME ERROR REPORT, refer to section R1. Remember the GT.M Message and Recovery Procedures Manual may provide insights.

To facilitate use of the material as a troubleshooting guide, the text in these sections refers to other sections with alphanumeric designators. Each alphanumeric section describes suggested actions to employ in handling a particular situation.

The term "hang" refers to a failure to process. Processes may hang for a variety of reasons that have nothing to do with GT.M. However, hanging GT.M processes may indicate that a database has become inaccessible. When you suspect a hang, first determine the extent of the problem.

Your tools include:

WHEN MANY PROCESSES ON A SYSTEM ARE HANGING, determine if the hangs are confined to a particular application. If all applications are affected or if processes not using GT.M databases are affected, the problem is not a database-specific problem, but something more general, such as a UNIX problem. Refer to section H6.

WHEN ONLY ONE PROCESS IS HANGING, find out whether that process is the only one using a particular GT.M application. If it is the only process, start some appropriate second process and determine whether the second process is also affected.

IF A PROCESS HANGS WHILE OTHER PROCESSES ACCESSING THE SAME DATABASE CONTINUE TO PROCESS, the problem is not a database problem. Refer to section H3 and then to section H8.

WHEN ONLY GT.M PROCESSES RUNNING A PARTICULAR APPLICATION HANG, the problem may be a database problem. Refer to section H?.

Is the system "hung?" If so, consider the following additional questions:

The following is another way of testing the cache: If CRIT is cleared and DSE BUFFER hangs, management of the global buffer pool is not working properly. Use MUPIP STOP and/or CRIT -INIT -RESET to get everyone out of the segment, then use DSE WCINIT. After a WCINIT, make sure that you can successfully exit from DSE. Use MUPIP INTEG (-FAST) to check for damage which can be induced by WCINIT.

Use the following diagnostic steps and references to determine an appropriate course of action for database access problems.

IF THE DATA CAN BE BOTH READ AND WRITTEN, the problem is not a database problem. Refer to section H8.

IF DATA CANNOT BE READ OR WRITTEN, some process is unable to release full ownership of the database critical section. Determine the process identification number (PID) of the process using the DSE command CRITICAL. If the process exists, refer to section H4. If the process is non-existent, use DSE CRITICAL -REMOVE to emulate a release and re-examine the entire situation.

Example:

 Set pipe="pipe"
 Open pipe:(command="/bin/csh")::pipe
 Use pipe
 Set reg="",cmd=$ztrnlnm("gtm_dist")_"/mupip dumpfhead "
 For  Set reg=$View("GVNEXT",reg) Quit:""=reg  Do
 . Set reg(reg)="",file=$view("GVFILE",reg)
 . Write cmd,file,!
 . For i=1:1 read x(i):1 Quit:(x(i)["sgmnt_data.freeze")!$ZEOF!'$Test
 . Set pid=+$Piece(x(i),"=",2)
 . Set:pid frozen(reg)=pid
 Close pipe
 Set g="^%",$etrap="Write $ZStatus Set $ecode="""" Quit"
 Write !,"Attempting read access"
 If $Data(^%) Set reg=$View("REGION",g) Do read1
 For  Set g=$Order(@g) Quit:""=g  Set reg=$View("REGION",g)  Do:""=(reg(reg)) read1
 Set reg=""
 Write !!,"Attempting write access"
 For  Set reg=$Order(reg(reg)) Quit:""=reg  Do write1
 Write !
 Quit
read1
 Write !,"Read in region: ",reg," of ",g," successful"
 If ($Data(@g)#2) Set reg(reg)=g
 Else  Set reg(reg)=$Query(@g)
 Quit
write1
 If $Data(frozen(reg)) Write !,"Region ",reg," Frozen by PID ",frozen(reg) Quit
 If ""=reg(reg) Write !,"Region ",reg," has no data" Quit
 Write !,"Write to region: ",reg
 Set x=$Get(@reg(reg),"Yndef")
 Set @reg(reg)=1,@reg(reg)=x
 If "Yndef"=x ZKill @ref(ref); assumption that a value of Yndef is very unlikely
 Write " of ",reg(reg)," successful"
 Quit

This routine provides a generalized approach to automating some of the tasks described in this section. The routine issues a report if a region is frozen or completely empty of data. It may hang reading or writing a database. However, unless the region(s) holding ^% and the next global after ^% has a problem, it displays the name of the region that it is about to try. If this routine runs to completion, the databases in the current Global Directory are completely accessible. The routine can be extended by having it cycle through a set of Global Directories should that be appropriate.

[Note]Note

If you have a Global Directory mapping globals to multiple files, you may create an alternative Global Directory using different mappings to those same files. Such mapping prevents the test program(s) from touching the "real" data.

Example:

  Mapping      Production region   Test region
-----------------------------------------------
A   to   M     $DEFAULT            SCRATCH
N   to   Z     SCRATCH             $DEFAULT

To increase the access speed, GT.M buffers data exchanged between processes and database files in the shared memory cache. If information in the memory cache is damaged, it can block the transfer of data to the disk.

IF A PROCESS HAS BEEN DETERMINED (FROM SECTION H3) TO NEVER RELEASE FULL OWNERSHIP OF THE DATABASE CRITICAL SECTION, there may be a problem with the database cache. To determine where the problem is occurring terminate the process. If this clears the hang, the problem was not in the database but in the process, which was somehow damaged. Refer to section P1. Otherwise, another process showing the same symptoms takes the place of the terminated process. In this case, the cache is damaged.

IF THE CACHE IS DAMAGED, it must be reinitialized. It is crucial to stop all other database activity during cache initialization. Refer to section Q1 before continuing with this section.

To minimize database damage due to cache reinitialization, and to confirm that the problem is due to a damaged cache, use the DSE command CRITICAL SEIZE followed by BUFFER_FLUSH. The DSE command BUFFER_FLUSH attempts to flush the database cache which is a benign operation. Wait at least one minute for this operation to complete.

IF THE BUFFER_FLUSH DOES NOT HANG, the cache is not damaged, and you should review all previous steps starting with section H1.

IF THE BUFFER_FLUSH DOES HANG, use the DSE command WCINIT to reinitialize the cache. This command requires confirmation. Never use WCINIT on a properly operating database. After a WCINIT always perform at least a MUPIP INTEG FAST to detect any induced damage that has a danger of spreading. If the WCINIT command hangs, clear the critical section as described in section H5 and reissue the WCINIT.

The concurrency control mechanism allows only one process at a time to execute code within a "critical section." To gain access to the database requires a process to first gain ownership of the critical section. The errors described in this section occur when a problem occurs in ownership control of the critical section.

IF YOU HAVE DETERMINED WHICH PROCESS IS HOLDING THE CRITICAL SECTION (from section H2 using system utilities), try terminating that process. If this corrects the problem, the damage was to the process, rather than the critical section. Refer to section P1.

IF YOU CANNOT IDENTIFY THE PROCESS, or if terminating such a process causes other processes to exhibit the same problem(s), the critical section is damaged and must be reinitialized. Restrict database activity during the reinitialization. Refer to section Q1 before continuing with this section.

TO REINITIALIZE THE DATABASE CRITICAL SECTION: Reinitializing a critical section on an active database file carries some risk of causing database damage. You can minimize this risk by restricting database activity during the reinitialization. Refer to section Q1 before continuing with this section.

The DSE command CRITICAL INITIALIZE RESET re-establishes the database-critical section and induces errors for all processes currently accessing the database in question. You can avoid the induced errors in other processes by dropping the RESET qualifier. However, this technique may result in other processes attempting to use partially created critical section structures, possibly corrupting them or the database contents.

After the CRITICAL INITIALIZE, use the DSE commands CRITICAL SEIZE and CRITICAL RELEASE to verify operation of the critical section. Actions such as those described in section H3 test more thoroughly for proper operation.

Application problems may be caused by conflicting M LOCKs or OPEN commands in more than one process, or by a process waiting for completion of M READ or JOB command, which is dependent on an asynchronous event.

First, determine if processes are waiting, without relief, for M LOCKs using the LKE command SHOW ALL WAITING. M routines use LOCK commands to create mutual exclusion semaphores.

IF THE SHOW COMMAND HANGS, you have a cache or critical section problem. Restart your evaluation in section H5.

IF THE SHOW COMMAND DISPLAYS NO LOCKS WAITING, the problem is not a LOCK problem. If repeated use of SHOW does not display the one or more LOCKs that persist every time, the problem is not a LOCK problem. However, even if the problem is not a lock problem, continue with this section because it discusses the M commands JOB, OPEN, and READ, which may also produce hangs.

A LOCK identified as belonging to a non-existent process results from an abnormal process termination. GT.M automatically clears such LOCKs when some other process requests a conflicting LOCK.

You can prevent deadlocks by using timeouts on the LOCK commands. Timeouts allow the program to recognize a deadlock. Once a routine detects a deadlock, it should release its LOCKs and restart execution from the beginning of the code that accumulates LOCKs. Without timeouts, there is no way in M to break a deadlock. You must use outside intervention to terminate at least one deadlocked process, or use LKE to strip a LOCK from such a process.

Example:

 for  quit:$$NEW
 quit
NEW()  lock ^X(0)
 set ^X(0)=^X(0)+1
 quit $$STORE(^X(0))
STORE(x)
 lock +^X(x):10 if  set ^X(x)=name_"^"_bal
 lock
 quit $TEST

This uses a timeout on the LOCK of ^X(x) to cause a retry of NEW.

In addition to the LOCK command, the M JOB, OPEN, and READ commands can contribute to deadlocks.

Example:

Process 1         Process 2
---------         --------- 
LOCK ^A
                  OPEN "MSA0:"
                  OPEN "/dev/nrst0"
OPEN "MSA0:"
OPEN "/dev/nrst0"
                  LOCK +^A

This shows a sequence in which Process 1 owns ^A and Process 2 owns device /dev/nrst0. Again, each is trying to get the resource held by the other. Notice that the LOCK commands could be replaced by OPEN commands specifying some non-shared device other than /dev/nrst0.

An application may combine the technique of timeouts on "long" commands to protect the current process, with the technique of minimizing LOCK and OPEN durations, to minimize conflicts with other processes.

Another type of application hanging occurs when a process acquires ownership of a resource and then starts an operation that does not complete for a long period of time. Other processes that need the unavailable resource(s) then hang.

Example:

Process 1         Process 2
---------         --------- 
LOCK ^A
READ x
                  LOCK ^A

If the READ by Process 1 is to an interactive terminal, and the operator has abandoned that device, the READ may take what seems, at least to Process 2, forever. The M commands OPEN and JOB, as well as READ, can produce this problem. When this situation arises, take action to get long-running commands completed or to terminate the process performing those commands.

There are two programming solutions that help avoid these situations. You can either limit the duration of those commands with timeouts, or defer resource ownership until any long operations are complete.

Example:

 for  quit:$$UPD
 quit
UPD()  set x=^ACCT(acct)
 do EDITACCT
 lock ^ACCT(acct) 
 if x=^ACCT(acct) set ^ACCT(acct)=y
 else  write !,"Update conflict–Please Reenter"
 lock
 QUIT $TEST

This stores the contents of ^ACCT(acct) in local variable x, before the interactive editing performed by sub-routine EDITACCT (not shown). When the interaction is complete, it LOCKs the resource name and tests whether ^ACCT(acct) has been changed by some other process. If not, it updates the global variable. Otherwise, it informs the user and restarts UPD. This technique eliminates the "open update" problem, but it introduces the possibility the user may have to re-enter work. An application that needs to minimize the possibility of re-entry may extend this technique by testing individual fields (pieces) for conflicting changes.

Database errors reported by MUPIP INTEG differ in impact and severity. Some require an immediate action to prevent extending the damage. Action on other less severe errors may be delayed.

The next section provides general guidelines for determining your next course of action and a table with information related to the error messages you may encounter.

If you encounter an anomaly in your database or its operations, the following list may offer some help in determining your next course of action. The heading of each section indicates the level of urgency FIS attributes to those items listed below it.

The following list of INTEG messages classifies error severity using the "nature" codes, and refers you to a section identifying an appropriate follow-up action.

INTEG reports these codes for many of the errors and in so doing transforms Index errors on index block to DANGER and otherwise to Data, meaning the issue is confined to a level 0 data block, and so very localized. However, when an index block has damage, GT.M cannot correctly navigate the tree and if operations continue, subsequent updates can go where they do not belong, causing increasing damage. When a data block has damage, the worst thing that can happen from a GT.M standpoint is that you get an indefinite loop. More commonly, some confined set of nodes becomes inaccessible, which may or may not be important from an application perspective, most commonly the application gets an error when it tries to use the data in question. It is possible for a single issue to cause multiple reports, and, in such a case, focus first on the most serious report.

Repair Dangerous and Access errors immediately. You may assess the benefits of deferring correction of less severe errors until normally scheduled down-time.

MUPIP INTEG Error Messages

NATURE

MNEMONIC

ERROR MESSAGE

SECTION

B

T

B

I

I

BSIZTOOLARGE

BUFFLUFAILED

DBBADFREEBLKCTR

DBBADKYNM

DBBADNSUB

ffff Block larger than specified maximum size.

Error flushing buffers from rrrr for database file ffff.

Free blocks counter in file header: nnnn appears incorrect, should be mmmm.

Bad key name.

Bad numeric subscript.

O6

I7

I3

K1

K1

D

I

D

D

D

DBBADPNTR

DBBDBALLOC

DBBFSTAT

DBBNPNTR

DBBPLMGT2K

Bad pointer value in directory.

Block doubly allocated.

Block busy/free status unknown (local bitmap corrupted).

Bit map block number as pointer.

Blocks per local map is greater than 2K.

K4

K3

M1

K4

I3

D

D

I

I

A

DBBPLMLT512

DBBPLNOT512

DBBSIZMN

DBBSIZMX

DBBSIZZRO

Blocks per local map is less than 512.

Blocks per local map is not a multiple of 512.

Block too small.

Block larger than file block size.

Block size equals zero.

I3

I3

O1

O1

I3

T

I

I

I

A

DBBTUWRNG

DBCMPBAD

DBCMPNZRO

DBCOMPTOOLRG

DBCREINCOMP

The blocks-to-upgrade file-header field is incorrect. Expected nnnn, found mmmm.

Compression count not maximal.

First record of block has nonzero compression count.

Record has too large compression count.

Header indicates database file creation was interrupted before completion.

H2

K6

O1

O2

I3

I

T

A

D

A

DBDATAMX

DBFGTBC

DBFLCORRP

DBFSTBC

DBFSTHEAD

Record too large.

File size larger than block count would indicate.

Header indicates database file is corrupt.

File size smaller than block count would indicate.

File smaller than database header.

O2

I4

I8

I4

I3

I

A

D

A

I

DBGTDBMAX

DBHEADINV

DBINCLVL

DBINCRVER

DBINVGBL

Key larger than database maximum.

Header size not valid for database.

Block at incorrect level.

Incorrect version of GT.M database.

Invalid mixing of global names.

K7

I3

O1

I2

K3

I

I

I

I

I

DBKEYGTIND

DBKEYMN

DBKEYMX

DBKEYORD

DBKGTALLW

Key greater than index key.

Key too short.

Key too long.

Keys out of order.

Key larger than maximum allowed length.

K2

K1

K1

K2

K1

B

D

I

B

I

DBLOCMBINC

DBLRCINVSZ

DBLTSIBL

DBLVLINC

DBMAXNRSUBS

Local bit map incorrect.

Last record of block has invalid size.

Keys less than sibling's index key.

Local bitmap block level incorrect.

Maximum number of subscripts exceeded.

M2

K5

K2

M1

K1

B

B

B

B

B

DBMBMINCFRE

DBMBPFLDIS

DBMBPFLDLBM

DBMBPFLINT

DBMBPFRDLBM

Master bit map incorrectly asserts this local map has free space.

Master bit map shows this map full, in disagreement with both disk and INTEG result.

Master bit map shows this map full, agreeing with disk local map.

Master bit map shows this map full, agreeing with MUPIP INTEG.

Master bit map shows this map has space, agreeing with disk local map.

M1

M1

M1

M1

M1

B

B

B

B

T

DBMBPFRINT

DBMBPINCFL

DBMBSIZMN

DBMBSIZMX

DBMBTNSIZMX

Master bit map shows this map has space, agreeing with MUPIP INTEG.

Master bit map incorrectly marks this local map full.

Map block too small.

Map block too large.

Map block transaction number too large.

M1

M1

M2

M2

I6

B

D

B

A

A

DBMRKBUSY

DBMRKFREE

DBNONUMSUBS

DBNOREGION

DBNOTGDS

Block incorrectly marked busy.

Block incorrectly marked free.

Key contains a numeric form of subscript in a global defined to collate all subscripts as strings.

None of the database regions accessible.

Unrecognized database file format.

M1

M1

K1

I6

I3

A

I

D

D

D

DBNOTMLTP

DBNULCOL

DBPTRMX

DBPTRNOTPOS

DBRBNLBMN

Block size not a multiple of 512 bytes.

NULL collation representation differs from the database file header setting.

Block pointer larger than file maximum.

Block pointer negative.

Root block number is a local bit tmap number.

K1

K1

K4

K4

K4

D

D

T

D

D

DBRBNNEG

DBRBNTOOLRG

DBRDONLY

DBREADBM

DBRLEVLTONE

Root block number negative.

Root block number greater than last block number in file.

Database file ffff read only.

Read error on bit map.

Root level less than one.

K4

K4

I6

H7

O1

D

I

I

S

S

DBRLEVTOOHI

DBRSIZMN

DBRSIZMX

DBSPANCHUNKORD

DBSPANGLOINCMP

Root level higher than maximum.

Physical record too small.

Physical record too large.

Chunk of nnnn blocks is out of order.

Spanning node is missing. Block no nnnn of spanning node is missing.

O1

O2

O2

O5

O5

D

A

A

T

T

DBSTARCMP

DBSVBNMIN

DBSZGT64K

DBTNLTCTN

DBTNNEQ

Last record of block has nonzero compression count.

Start VBN smaller than possible.

Block size is greater than 64K.

Current tn and early tn are not equal.

Cannot reset transaction number for this region.

K5

I3

I4

I6

I4

T

A

T

A

A

DBTNRESET

DBTNTOOLG

DBTTLBLK0

DBUNDACCMT

FREEZE

Transaction numbers greater than the current transaction were found.

Total blocks equal zero.

Block transaction number too large.

Cannot determine access method; trying with BG.

Database for region rrrr is already frozen, not INTEGing

I6

I6

I6

I6

I6

A

B

I

A

MUSTANDALONE

NOGTCMDB

NULSUBSC

REGFILENOTFOUND

Could not get exclusive access to rrrr.

INTEG does not support operation on GT.CM database region.

Null subscripts are not allowed for database file: rrrr.

Database file ffff corresponding to region rrrr cannot be found.

I6

I5

K1

I6

These error messages reflect failures to find, open, or access a database file. Examine any secondary error messages to obtain additional information about the problem.

Use printenv to check gtmgbldir or use the M command WRITE $ZGBLDIR to verify that the "pointer" identifies the proper Global Directory. If the pointer is not appropriate, reset gtmgbldir or use the M command SET $ZGBLDIR= to name the proper file.

Examine the Global Directory using GDE. If the Global Directory is not appropriate, replace it, correct or recreate it with GDE. For more information on the use of GDE, refer to the "Global Directory Editor" chapter.

IF THE GLOBAL DIRECTORY IS DAMAGED BUT ACCESSIBLE WITH GDE, investigate who may have used GDE to perform the modifications. If the Global Directory is damaged and not accessible with GDE, investigate what program, other than GT.M and its utilities, might have written to the file. Except for GDE, all GT.M components treat the Global Directory as static and read-only.

IF THE GLOBAL DIRECTORY APPEARS CORRECT, use printenv to verify that any environment variables that it uses are properly defined for the process experiencing the problem. If the process has an environment to which you do not have access, you may have to carefully read the shell scripts used to establish that environment.

IF THE ENVIRONMENT VARIABLES APPEAR CORRECT, use the ls -l to examine the file protection. Remember to examine not only the file, but also all directories accessed in locating the file.

IF THE FILES APPEAR TO BE PROPERLY MAPPED by the Global Directory, properly placed given all environment variables, and properly protected to permit appropriate access, use the od or cat utility to verify access to the files, independent of GT.M.

IF YOU SUSPECT A VERSION MISMATCH PROBLEM, refer to section I2.

IF YOU SUSPECT A DISK HARDWARE PROBLEM, refer to section H7.

GT.M corrects certain errors automatically. If you find that any of these errors persist, contact your GT.M support channel.

"Block transaction number too large" indicates that the file header has a smaller transaction number than the database block.

If you are not running TP or incremental backup this is a benign error (from the database's point of view; application data consistency should be verified). GT.M automatically self-corrects these errors as soon as it performs sufficient updates to get the current transaction number of the database higher than any block's transaction number. If this error persists, perform the following steps:

"Current tn and early tn are not equal" indicates that the critical section has been damaged. "Reference count is not zero" indicates an improper file close. The first access that references a questionable database should correct these errors. Generally, these errors indicate that the file was not closed normally. This problem is typically caused by an unscheduled shutdown of the system. Review your institution's startup and shutdown procedures to ensure a controlled shutdown. Startup procedures should use journaling and/or replication to recover from an unscheduled shutdown.

"Cannot determine access method..." indicates that the fileheader has been damaged. When INTEG detects this error, it forces the access method to BG and continues. If there is no other damage to the file header, no other action may be required.

However, if the access method should be MM, use MUPIP SET ACCESS_METHOD=MM to correct the database.

This section describes appropriate actions when the error message indicates a damaged key. GDS transforms subscripted or unsubscripted global variable names into keys, which are part of the database record used to index the corresponding global variable data values. The keys are stored in a compressed form which omits that part of the prefix held in common with the previous key in the block. The compression count is the number of common characters. Except in the Directory Tree, all records after the first one have a non-zero count. The first record in a block always has a compression count of zero (0).

IF THE BLOCK IS A DATA BLOCK, that is, level zero (0), refer to section O3.

IF THE BLOCK HAS A LEVEL GREATER THAN ZERO (0), examine the record with the DSE command DUMP BLOCK= OFFSET where the block and offset values are provided by the INTEG error report. If the record appears to have a valid block pointer, note the pointer. Otherwise, refer to section O2.

After noting the pointer, SPAWN and use MUPIP INTEG BLOCK=pointer (if you have time constraints, you may use the FAST qualifier) to check the structure.

IF THE SUB-TREE IS INVALID, according to the MUPIP INTEG, DSE REMOVE the record containing the reported bad key, INTEG, and refer to section O4.

Otherwise use the DSE command DUMP BLOCK= RECORD=9999 to find the last record in the block and examine it using the DUMP RECORD= command. Continue using DSE to follow the pointer(s) down to level 0, always choosing the right-hand branch. Note the largest key at the data level. REMOVE the record containing the reported bad key. Determine the proper placement for the noted key using FIND KEY= and ADD KEY= POINTER where the key and the pointer are those noted in the preceding actions.

A doubly allocated block is dangerous because it causes data to be inappropriately mingled. As long as no KILLs occur, double allocation might not cause permanent loss of additional data. However, it may cause the application programs to generate errors and/or inappropriate results. When a block is doubly allocated, a KILL may remove data outside its proper scope.

A doubly allocated index block may also cause increasing numbers of blocks to become corrupted. Use the following process to correct the problem.

First, identify all pointers to the block, using FIND EXHAUSTIVE and/or information reported by MUPIP INTEG. If the error report identifies the block as containing inappropriate keys or a bad level, INTEG has identified all paths that include the block. In that case, INTEG reports all paths after the first with the doubly allocated error, and the first path with some other, for example, "Keys out of order" error.

IF THE INTEG REPORT DOES NOT MENTION THE BLOCK PRIOR TO THE DOUBLY ALLOCATED ERROR, use FIND EXHAUSTIVE to identify all pointers to that block.

IF THE BLOCK IS A DATA BLOCK, that is, level zero (0), DUMP it GLO, REMOVE the records that point to it, MAP it FREE, and MUPIP LOAD the output of the DUMP GLO.

IF THE BLOCK HAS A LEVEL GREATER THAN ZERO (0), you may sort through the block and its descendants to disentangle intermixed data. If the block has a level of more than one (1), this may be worth a try. The salvage strategy (discussed in section O4) may be time consuming and there may be only one misplaced node. However, in general, the salvage strategy is less demanding and less dangerous.

IF YOU CHOOSE THE SALVAGE STRATEGY, REMOVE the records that point to the block, MAP it FREE, and refer to section O4.

IF YOU DECIDE TO WORK WITH THE BLOCK, choose the path to retain, REMOVE the other pointer record, and relocate any misplaced descendants with DSE ADD and REMOVE.

Every block in the file has a corresponding bit in a bitmap. All blocks with valid data are marked busy in their maps; all blocks that are unused or no longer hold data are marked free. GDS uses bitmaps to locate free blocks efficiently. The errors discussed in this section indicate problems with bitmaps.

"Block incorrectly marked free" is the only potentially dangerous bitmap error. This error means that the block is within the B-tree structure, but that the bitmap shows it available for use (i.e., it is a "Block doubly allocated" waiting to happen). Immediately use DSE to MAP such blocks BUSY.

Bitmap information is redundant (i.e., bitmaps can be recreated by scanning the B-tree); however, the majority of bitmap errors reflect secondary errors emanating from flaws in the B-tree, which are often reported as key errors by MUPIP INTEG.

When INTEG encounters an error, it stops processing that leaf of the tree. When it subsequently compares its generated bitmaps to those in the database, it reports the blocks belonging in the tree that it could not find as "Block incorrectly marked busy." This error type can be viewed as a flag, marking the location of a block of lost data whose index is disrupted.

INTEG reports each block that it concludes is incorrectly marked, and also the local map that holds the "bad" bits. Furthermore, if the local map "errors" affect whether the local map should be marked full or not full in the master map, INTEG also reports the (potential) problem with the master map. Therefore, a single error in a level one (1) index block will generate, in addition to itself, one or more "Block incorrectly marked busy", one or more "Local bitmap incorrect", and possibly one or more "Master bitmap shows..." Errors in higher level index blocks can induce very large numbers of bitmap error reports.

Because bitmap errors are typically secondary to other errors, correcting the primary errors usually also cures the bitmap errors. For this reason and, more importantly, because bitmap errors tend to locate "lost" data, they should always be corrected at, or close to, the end of a repair session.

The DSE command MAP provides a way to switch bits in local maps with FREE and BUSY, propagate the status of a local map to the master map with MASTER, and completely rebuild all maps from the B-tree with RESTORE. Before beginning any MAP -MASTER operation, first ensure that the database has no active updaters and that there are no non-bitmap errors to resolve.

This strategy uses bitmap errors to locate data blocks containing information that belongs in the B-tree, but are no longer indexed because of errors and/or repairs to defective indices.

The algorithm is based on the fact that most bitmap errors are secondary to index errors. Therefore, it is optimistic about bitmaps and pessimistic about indices, and tends to error on the side of restoring more rather than less data to the B-tree. After using this technique, you should always check to see if obsolete, deleted data was restored. If data was restored, and GDS integrity has been restored, you can safely KILL the "extra" data.

IF THE INDICES HAVE BEEN DAMAGED FOR SOME TIME AND THE DAMAGE CAUSED DUPLICATE KEYS TO BE CREATED, this strategy raises the issue of which value is the "correct" value. Because most applications either form new nodes or update existing nodes rather than simply overlaying them, this issue seldom arises. Usually the application will fail in an attempt to update any "misplaced" node. If the problem does arise, the issue may not be determining the "correct" value, but the best available value.

IF YOU HAVE A DUPLICATE NODE PROBLEM, you can load the sequential file produced in DSE with an M program that detects and reports duplicate nodes. You can also use the block transaction numbers as clues to the order in which blocks were updated. However, remember that you generally cannot know which record was modified on the last update, and that DSE repair actions modify the block transaction number.

If the duplicate node problem poses a significant problem, you should probably not use DSE to repair the database, but instead, use journals to recover or restore from backups.

This strategy works well when the missing indices are level one (1). However, the time required increases dramatically as the level of the missing index increases. If you have a problem with a level four (4) or level five (5) index, and you have developed skill with DSE, you may wish to try the more technically demanding approach of repairing the indices.

Once you have corrected all errors except bitmap errors, SPAWN and use MUPIP INTEG FAST REGION NOMAP to get a list of all remaining bitmap errors. If the report includes any "Blocks incorrectly marked free", MAP them BUSY. Then use DUMP HEADER BLOCK= to examine each "Block incorrectly marked busy." If the level is one (1), DUMP the block ZWR. In any case, MAP it FREE. Once all blocks have been collected in a sequential file in this fashion, use MUPIP LOAD to reclaim the data from the sequential file.

salvage.m is a utility that removes all incorrectly marked busy blocks from the specified region. During execution it displays the DSE commands that it will execute and aborts execution when it encounters an error. It dumps the zwrite formatted content of blocks incorrectly marked busy to a file called <region>_db.zwr. Upon completion, it sets the abandoned_kills and kill_in_prog flags in the database fileheader to false. Click on Download salvage.m to download salvage.m program. You can also download salvage.m from http://tinco.pair.com/bhaskar/gtm/doc/books/ao/UNIX_manual/downloadables/salvage.m.

Steps to run the salvage utility are as follows:

  1. Perform an argumentless MUPIP RUNDOWN before running this utility.

  2. Ensure that there are no INTEG errors other than the incorrectly marked busy block errors.

  3. Run $gtm_dist/mumps -r ^salvage.

  4. Specify the region name. If no region is specified, the utility assumes DEFAULT.

  5. If the utility reports a DSE error, fix that error and run the salvage utility again.

After completing repairs with the salvage utility, open the <REGION>_db.zwr file and examine its contents. If there is a need to recover the data from the incorrectly marked busy blocks, perform a MUPIP LOAD <REGION>_db.zwr to load that data back to the database.

The following example shows how to salvage a damaged spanning node in ^mypoem.

  1. Run MUPIP INTEG to find the location of the damaged spanning node. A MUPIP INTEG report of a region that has damaged spanning nodes might look something like the following:

    Integ of region DEFAULT
    Block:Offset Level
    %GTM-E-DBSPANGLOINCMP, 
           7:10     0  Spanning node is missing. Block no 3 of spanning node is missing
                       Directory Path:  1:10, 2:10
                       Path:  4:31, 7:10
    Spanning Node ^mypoem(#SPAN1) is suspect.
    %GTM-E-DBKEYGTIND, 
           7:10     0  Key greater than index key
                       Directory Path:  1:10, 2:10
                       Path:  4:31, 7:10
    Keys from ^mypoem(#SPAN48) to ^mypoem(#SPAN3*) are suspect.
    %GTM-E-DBSPANCHUNKORD, 
           3:10     0  Chunk of 1 blocks is out of order
                       Directory Path:  1:10, 2:10
                       Path:  4:3D, 3:10
    Spanning Node Chunk ^mypoem(#SPAN4) is suspect.
    Total error count from integ:        3
    Type           Blocks         Records          % Used      Adjacent
    Directory           2               2           5.468            NA
    Index               1               4          13.476             1
    Data                4               5          76.562             4
    Free               93              NA              NA            NA
    Total             100              11              NA             5
    [Spanning Nodes:2 ; Blocks:3]
    %GTM-E-INTEGERRS, Database integrity errors

    Notice the lines that contain: "Block no 3 of spanning node is missing", "Key greater than index key", and ^mypoem(#SPAN48) and there is an extra chunk that is not connected to ^mypoem(#SPAN4).

  2. Confirm whether you have determined the spanning range of the node:

    Clearly, GT.M did not find block 3 and ^mypoem(#SPAN4) terminated the spanning node, so ^mypoem(#SPAN4) might be the last node. So, the parts of a spanning node that contain the value are ^mypoem(#SPAN2) through ^mypoem(#SPAN4).

  3. Use DSE to find the spanned nodes:

    DSE> find -key=^mypoem(#SPAN2)
    Key found in block  6.
        Directory path
        Path--blk:off
        1:10,    2:10,
        Global tree path
        Path--blk:off
        4:25,    6:10,
    DSE> find -key=^mypoem(#SPAN3)
    Key not found, would be in block  7.
        Directory path
        Path--blk:off
        1:10,    2:10,
        Global tree path
        Path--blk:off
        4:31,    7:10,
    DSE> find -key=^mypoem(#SPAN4) 
    Key found in block  3.
        Directory path
        Path--blk:off
        1:10,    2:10,
        Global tree path
        Path--blk:off
        4:3D,    3:10,
    DSE> f -k=^mypoem(#SPAN5)
    Key not found, would be in block  3.
        Directory path
        Path--blk:off
        1:10,    2:10,
        Global tree path
        Path--blk:off
        4:3D,    3:10,

    Notice that there is #SPAN2 and #SPAN4 but no #SPAN5. Therefore, #SPAN4 is the last piece. #SPAN3 was not found and is most likely the damaged node.

  4. Dump all the blocks in ZWRITE format to see what can be salvaged.

    DSE> open -file=mypoem.txt
    DSE> dump -block=6 -zwr
    1 ZWR records written.
    DSE> dump -block=7 -zwr
    1 ZWR records written.
    DSE> dump -block=3 -zwr
    1 ZWR records written.
    DSE> close
    Closing output file:  mypoem.txt
    $ cat mypoem.txt
    ; DSE EXTRACT
    ; ZWR
    $ze(^mypoem,0,480)="Half a league, half a league,Half a league onward,All in the valley of Death Rode the six hundred.  Forward, the Light Brigade!  Charge for the guns he said: Into the valley of Death Rode the six hundred.  Forward, the Light Brigade! Was there a man dismayed?  Not tho the soldiers knew Some one had blundered: Theirs not to make reply, Theirs not to reason why, Theirs but to do and die: Into the valley of Death Rode the six hundred.  Cannon to right of them, Cannon to left of "
    $ze(^mypoem,22080,480)="them, Cannon in front of them Volleyed and thundered; Stormed at with shot and shell, Boldly they rode and well, Into the jaws of Death, Into the mouth of Hell Rode the six hundred.  Flashed all their sabres bare, Flashed as they turned in air Sabring the gunners there, Charging an army while All the world wondered: Plunged in the battery-smoke Right thro the line they broke; Cossack and Russian Reeled from the sabre-stroke Shattered and sundered.  Then they rode back, but no"
    $ze(^mypoem,960,468)="t Not the six hundred.  Cannon to right of them, Cannon to left of them, Cannon behind them Volleyed and thundered; Stormed at with shot and shell, While horse and hero fell, They that had fought so well Came thro the jaws of Death, Back from the mouth of Hell, All that was left of them, Left of six hundred.  When can their glory fade?  O the wild charge they made!  All the world wondered.  Honour the charge they made!  Honour the Light Brigade, Noble six hundred!"

    Notice that block 3 (which is the second block above (because you started with block 2)) has the correct value but its internal subscript must have been damaged.

  5. Fix the starting position in the $ZEXTRACT statement:

    $ze(^mypoem,480,480)="them, Cannon in front of them Volleyed and thundered; Stormed at with shot and shell, Boldly they rode and well, Into the jaws of Death, Into the mouth of Hell Rode the six hundred.  Flashed all their sabres bare, Flashed as they turned in air Sabring the gunners there, Charging an army while All the world wondered: Plunged in the battery-smoke Right thro the line they broke; Cossack and Russian Reeled from the sabre-stroke Shattered and sundered.  Then they rode back, but no"

    Verify the value for correctness if you have the knowledge of the type of data in this global. This completes data recovery (whatever was possible).

  6. Kill the existing global:

    GTM>kill ^mypoem
    GTM>write ^mypoem
    %GTM-E-GVUNDEF, Global variable undefined: ^mypoem
  7. Load the salvaged global:

    $ mupip load -format=zwr mypoem.txt
    ; DSE EXTRACT
    ; ZWR
    Beginning LOAD at record number: 3
    LOAD TOTAL        Key Cnt: 3  Max Subsc Len: 8  Max Data Len: 480
    Last LOAD record number: 5
    $ gtm
    GTM>w ^mypoem
    Half a league, half a league,Half a league onward,All in the valley of Death Rode the six hundred.  Forward, the Light Brigade!  Charge for the guns he said: Into the valley of Death Rode the six hundred.  Forward, the Light Brigade! Was there a man dismayed?  Not tho the soldiers knew Some one had blundered: Theirs not to make reply, Theirs not to reason why, Theirs but to do and die: Into the valley of Death Rode the six hundred.  Cannon to right of them, Cannon to left of them, Cannon in front of them Volleyed and thundered; Stormed at with shot and shell, Boldly they rode and well, Into the jaws of Death, Into the mouth of Hell Rode the six hundred.  Flashed all their sabres bare, Flashed as they turned in air Sabring the gunners there, Charging an army while All the world wondered: Plunged in the battery-smoke Right thro the line they broke; Cossack and Russian Reeled from the sabre-stroke Shattered and sundered.  Then they rode back, but not Not the six hundred.  Cannon to right of them, Cannon to left of them, Cannon behind them Volleyed and thundered; Stormed at with shot and shell, While horse and hero fell, They that had fought so well Came thro the jaws of Death, Back from the mouth of Hell, All that was left of them, Left of six hundred.  When can their glory fade?  O the wild charge they made!  All the world wondered.  Honour the charge they made!  Honour the Light Brigade, Noble six hundred!

GT.M processes may detect errors at run-time. These errors trigger the GT.M error handling mechanism, which generally places the process in direct mode, or triggers the application programs to transcribe an error context to a sequential file or to a global. For more information on error handling, refer to the "Error Processing" chapter of the

GT.M Programmer's Guide

.

Most run-time errors are related to the application and its environment. However, some errors reflect the inability of a process to properly deal with a database. Some errors of this type are also, or only, generated by the GT.M utility programs.

For descriptions of individual errors, refer to the GT.M Message and Recovery Procedure Reference Manual.

IF YOU CANNOT REPRODUCE SUCH ERRORS WITH ANOTHER PROCESS PERFORMING THE SAME TASK, or with an appropriately directed MUPIP INTEG, they were most likely reported by a damaged process. In this case, refer to section P1.

The following table lists run-time errors, alphabetically by mnemonic, each with a section reference for further information.

Run-Time Error Messages Identifying Potential System Problems

ERROR MNEMONIC

ERROR MESSAGE TEXT

SECTION

BADDVER

BITMAPSBAD

BTFAIL

CCPINTQUE

CRITRESET

Incorrect database version vvv

Database bitmaps are incorrect

The database block table is corrupt

Interlock failure accessing Cluster Control Program queue

The critical section crash count for rrr region has been incremented

I2

M1

R3

R7

I8

DBCCERR

DBCRPT

DBFILERR

DBNOFILEP

DBNOTGDS

Interlock instruction failure in critical mechanism for region rrr

Database is flagged corrupt

Error with database file

No database file has been successfully opened

Unrecognized database file format

R7

I8

I5

I5

I5

DBOPNERR

DBRDERR

FORCEDHALT

GBLDIRACC

GBLOFLOW

Error opening database file

Cannot read database file after opening

Image HALTed by MUPIP STOP

Global Directory access failed, cannot perform database functions

Database segment is full

I5

I5

R4

I5

R5

GVKILLFAIL

GVORDERFAIL

GVPUTFAIL

GVQUERYFAIL

GVRUNDOWN

Global variable KILL failed. Failure code: cccc

Global variable $ORDER or $NEXT function failed. Failure code: cccc

Global variable put failed. Failure code: cccc

Global variable $QUERY function failed. Failure code: cccc

Error during global database rundown

R2

R2

R2

R2

I5

GDINVALID

GTMCHECK

GVDATAFAIL

GVDIRECT

GVGETFAIL

Unrecognized Global Directory format: fff

Internal GT.M error–report to FIS

Global variable $DATA function failed. Failure code: cccc

Global variable name could not be found in global directory

Global variable retrieval failed. Failure code: cccc

I5

R6

R2

I5

R2

GVZPREVFAIL

MUFILRNDWNFL

UNKNOWNFOREX

TOTALBLKMAX

WCFAIL

Global variable $ZPREVIOUS function failed. Failure code: cccc

File rundown failed

Process halted by a forced exit from a source other than MUPIP

Extension exceeds maximum total blocks, not extending

The database cache is corrupt

R2

I5

R4

R5

R3

The following table lists the failure codes, whether or not they require a MUPIP INTEG, a brief description of the code's meaning, and a section reference for locating more information.

Run-Time Database Restart Codes

FAIL CODE

RUN INTEG

DESCRIPTION

SECTION

* In the last retry may indicate a process problem

A

B

C

D

E

x

x

x

x

-

Special case of code C.

Key too large to be correct.

Record unaligned: properly formatted record header did not appear where expected.

Record too small to be correct.

History overrun prevents validation of a block.

O2

K1

O2

O2

R3

F

G

H

I

J

-

x

x

x

x

Not currently used.

Cache record modified while in use by the transaction.

Development of a new version of a block encountered a likely concurrency conflict.

Level on a child does not show it to be a direct descendent of its parent.

Block requested outside of file.

-

R3

R3

O2

O2

K

L

M

N

O

-

-

x

-

-

Cache control problem encountered or suspected.

Conflicting update of a block took priority.

Error during commit that the database logic does not handle.

Not currently used.

Before image was lost prior to its transfer to the journal buffer.

R3

R3

P1

-

R3

P

Q

R

S

T

-

x

x

-

-

Not currently used.

Shared memory interlock failed.

Critical section reset (probably by DSE).

Attempt to increase the level beyond current maximum.

Commit blocked by flush.

-

R7

R5

R8

R3

U

V

W

X

Y

x

-

-

x

x

Cache record unstable while in use by the transaction.

Read-only process could not find room to work.

Not currently used.

Bitmap block header invalid.

Record offset outside of block bounds.

R3

R9

-

M2

O2

Z

a

b

c

d

x

-

-

-

-

Block did not contain record predicted by the index.

Predicted bitmap preempted by another update.

History overrun prevents validation of a bitmap.

Bitmap cache record modified while in use by the transaction.

Not currently used.

O2

R3

R3

R3

-

e

f

g

h

i

x

-

-

-

-

Attempt to read a block outside the bounds of the database.

Conflicting update took priority on a non-isolated global and a block split requires a TP_RESTART.

The number of conflicting updates on non-isolated global nodes exceed an acceptable level and requires a TP_RESTART.

Journal state or before image changed.

Not currently used.

R3

R3

R3

R3

-

j

k

l

m

n

-

-

-

-

-

Backup or integ lost before image block.

Cache control state mismatch.

Phase two commit held off block acquisition.

Inhibit KILLs held off commit.

Trigger definition changed during trigger processing.

R3

R3

P1

P1

P1

o

p

q

r

s

-

-

-

-

-

Online rollback requires a restart.

Online rollback requires a restart.

Tentative block removed by MUPIP REORG -TRUNCATE.

Root block moved by MUPIP REORG.

Instance freeze blocked commit.

R3

R3

R3

R3

H3

y

z

x

-

Root block unreliable.

Currently not used.

K4

-

IF THE DATABASE FILLS UP AND CANNOT EXPAND, processes that try to add new information to the database experience run-time errors. The following conditions prevent automatic database expansion.

You can handle the first two cases by using the MUPIP EXTEND command. MUPIP EXTEND may also help in dealing with the third case by permitting an extension smaller than that specified in the file header. Note that the extension size in the file header, or /BLOCKS= qualifier to MUPIP EXTEND, is in GDS blocks and does not include overhead for bitmaps.

IF THERE IS NO MORE SPACE ON A VOLUME, you may use the M command KILL to delete data from the database. To KILL an entire global, the database file must contain one free GDS block. You may acquire these by KILLing a series of subscripted nodes or by doing a small extension.

You may also use UNIX utilities such as tar, cp, and lprm to remove files from the volume and place them on another volume.

Finally, you may create or add to a bound volume set with the MOUNT utility invoked by the DCL command MOUNT. If you change the RMS placement of the files, be sure to adjust the Global Directory and/or the logical names to match the new environment.

You can also add a new disk. If you change the placement of the files, be sure to also adjust the Global Directory and/or the environment variables to match the new environment.

loading table of contents...