Monday, June 27, 2011

Companies Standing in Queue to Get Exadata

"Computer is the Network" seems to be the mantra now as the server side technologies cross another level of advancement. Oracle Exadata is the premier database appliance winning the show by offering a truly working magic of servers, storage, networking and software enclosed in one box.

So much so is the success of this fabulous piece of wonder that companies are flocking to make a queue to buy it, like people use to stand in queue to buy iPad or iPod. Oracle announces that more than 1,000 Oracle® Exadata Database Machines are installed at customer sites globally,

Now that is some doing.

Mapping of Physical Disk, LUNS, Cell Disks, and Grid Disks


Exadata machine comprises of compute nodes and the cell nodes. Compute nodes host the RDBMS and ASM instances along with Oracle homes, whereas the cell nodes contain data. This data is stored under several layers of abstraction starting from physical disks to LUN to cell disks, to grid disks to asm disks to asm disk groups.

From cell node, from cellcli utility you can manage physical disks, LUNs, cell disks and the grid disks.

From compute node, you can manage the asm disks and the asm disk groups.

Following is the mapping of the storage layers in the Exadata.

First comes physical disk, then LUN, then cell disk, then griddisk, and then asm disks and then asm diskgroups. There is one to one mapping between physical disk, LUN, and cell disk. so that means that for each physical disk there is one LUN, and for that LUN there is one celldisk. Then comes griddisk. Many griddisks can comprise of many cell disks. Griddisk and asm disks are synonymous with each other in exadata contest, so they are same thing. Then comes asm diskgroup, and an asm diskgroup comprises of multiple griddisks. We have three disk groups SYSTEMDG, DATA, and RECO.

List the griddisk:

CellCLI> list griddisk DATA_TT09_test1test0 detail
         name:                   DATA_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:15:39+00:00
         diskType:               HardDisk
         errorCount:             1
         id:                     00000129-000c-6341-0000-000000000000
         offset:                 32M
         size:                   430G
         status:                 active


List the celldisk of the griddisk:


CellCLI> list celldisk TT09_test1test0 detail
         name:                   TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:12:14+00:00
         deviceName:             /dev/test
         devicePartition:        /dev/test
         diskType:               HardDisk
         errorCount:             2
         freeSpace:              0
         id:                     00000129-0009-4275-0000-000000000000
         interleaving:           none
         lun:                    0_9
         raidLevel:              0
         size:                   557.859375G
         status:                 normal


List the LUN of the celldisk:

CellCLI> list lun 0_9 detail
         name:                   0_9
         cellDisk:               TT09_test1test0
         deviceName:             /dev/test
         diskType:               HardDisk
         id:                     0_9
         isSystemLun:            FALSE
         lunAutoCreate:          FALSE
         lunSize:                557.861328125G
         lunUID:                 0_9
         physicalDrives:         20:9
         raidLevel:              0
         lunWriteCacheMode:      "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
         status:                 normal

List the Physical disk of the LUN:


CellCLI> list physicaldisk 20:9 detail
         name:                   20:9
         deviceId:               17
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          23
         errOtherCount:          0
         foreignState:           false
         luns:                   0_9
         makeModel:              "TEST ST360057SSUN600G"
         physicalFirmware:       0805
         physicalInsertTime:     0000-03-24T22:10:19+00:00
         physicalInterface:      sas
         physicalSerial:         E08XLW
         physicalSize:           558.9109999993816G
         slotNumber:             9
         status:                 normal


There can be M:M relationship between celldisks and the griddisk, but in following, there is 1:M relationship between celldisk and griddisks.

List griddisks in one celldisk:

CellCLI> list griddisk where cellDisk like 'TT09_test1test0'
         DATA_TT09_test1test0         active
         RECO_TT09_test1test0         active
         SYSTEMDG_TT09_test1test0     active

List griddisks in one celldisk (in detail with wild card):

CellCLI> list griddisk where cellDisk like 'TT09.*' detail
         name:                   DATA_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:15:39+00:00
         diskType:               HardDisk
         errorCount:             1
         id:                     00000129-000c-6341-0000-000000000000
         offset:                 32M
         size:                   430G
         status:                 active

         name:                   RECO_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:15:55+00:00
         diskType:               HardDisk
         errorCount:             1
         id:                     00000129-000c-a1d6-0000-000000000000
         offset:                 430.046875G
         size:                   98.6875G
         status:                 active

         name:                   SYSTEMDG_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:13:02+00:00
         diskType:               HardDisk
         errorCount:             0
         id:                     00000129-0009-fd52-0000-000000000000
         offset:                 528.734375G
         size:                   29.125G
         status:                 active



Listing that a griddisk only belongs to one celldisk in this case:


CellCLI> list griddisk DATA_TT09_test1test0 detail
         name:                   DATA_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:15:39+00:00
         diskType:               HardDisk
         errorCount:             1
         id:                     00000129-000c-6341-0000-000000000000
         offset:                 32M
         size:                   430G
         status:                 active

CellCLI> list griddisk RECO_TT09_test1test0 detail
         name:                   RECO_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:15:55+00:00
         diskType:               HardDisk
         errorCount:             1
         id:                     00000129-000c-a1d6-0000-000000000000
         offset:                 430.046875G
         size:                   98.6875G
         status:                 active

CellCLI> list griddisk SYSTEMDG_TT09_test1test0 detail
         name:                   SYSTEMDG_TT09_test1test0
         availableTo:
         cellDisk:               TT09_test1test0
         comment:
         creationTime:           0000-07-14T02:13:02+00:00
         diskType:               HardDisk
         errorCount:             0
         id:                     00000129-0009-fd52-0000-000000000000
         offset:                 528.734375G
         size:                   29.125G
         status:                 active

Thursday, June 23, 2011

Hang Analysis in Exadata

In half rack of Exadata machine, from node got the following error:


PMON failed to acquire latch, see PMON dump



First I checked that whether the database was hung or not.

Database was up and all the four instance were up too along with ASM instances.

Database was also accepting the logins.

The above error occurred just once, so that was a relief and seemed a temporary

glitch.



Checked the PMON dump, and it showed that a process temporarily blocked PMON

as it held the lock on child library cache.

PMON unable to acquire latch 60106398 Child shared pool level=7 child#=2



But wanted to make sure that things are fine, so used the blocking was temporary

and the database is not hung or having serious problem, used the oradebug utility

to do the hang analysis.

In summary, the hang analysis suggested that there were no cycles (true blocking

at the internal kernel level). There were also not any session which was blocking

many (10) sessions. There were also no Open Chains (many sessions in waiting).

There have been mention of waiting of sessions on the Mutex.



Did the hang analysis as follows:

Opened two putty sessions and fired SQL Plus on both of them with the node:



In Session 1: (gathering hang analysis)

SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug hanganalyze 3



Hang Analysis in /u01/app/oracle/diag/rdbms/test/test4/trace/test4_ora_20189.trc

In Session 2: (gathering system state)

SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug dump systemstate 266

Then wait for at least 1 minute, and then again do following for comparison:

In Session 1: (gathering hang analysis again)

SQL> oradebug hanganalyze 3

Hang Analysis in /u01/app/oracle/diag/rdbms/test/test4/trace/test4_ora_20189.trc


In Session 2: (gathering system state again)



SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug dump systemstate 266

/u01/app/oracle/diag/rdbms/test/test4/trace/test4_ora_20189.trc
contains the hang analysis and you should check for cycles, blockers, open chains.



Reference: MOS [ID 452358.1]

Friday, June 17, 2011

ORA-00494: enqueue [CF] held for too long

There was a high load at the Linux (RHAS4) due to the backups of three 11.1.0.6 databases going on simultaneously. So much was the load, that server became unresponsive, and then one of the instances crashed with the following error in the alert log file:

Fri Jun 17 02:20:26 2011
System State dumped to trace file

/u01/app/oracle/admin/orcl/bdump/orcl_lgwr_1.trc
Killing enqueue blocker (pid=191) on resource

CF-00000000-00000000 by killing session 56.1

Found a metalink note: ID 779552.1


This note states about kill blocker interface, a mechanism through which Oracle kills any blocking process during high loads. If that blocking process is background process, then the instances crashes too. Oracle prefers crashing the instance, instead of a hang situation. This behavior can be override in two ways. Both ways take help of hidden parameters. In one case, kill blocker interface can be disabled altogether and in the other case, kill blocker interface can be prevented from killing background processes.

So at the time of this instance crash, there was very high load at the server due to RMAN backups, and so that is why the kill blocker interface killed the background process and so the instance crashed. 

Sunday, June 5, 2011

Please Vote for Oracle OpenWorld Suggest-a-Session on Oracle Mix

Back by popular demand, Oracle OpenWorld Suggest-a-Session on Oracle Mix

Please vote for my session (Administration of Automatic Diagnostic Repository )for the Oracle OpenWorld 2011

To vote, click here.