Overblog
Suivre ce blog
Editer l'article Administration Créer mon blog
16 août 2010 1 16 /08 /août /2010 21:04

 

 

I'm really surprised when I still see Solaris servers configured without the possibility of taking a dump. These next few lines explain how to do that especially for Solaris x86 (use interrupt NMI).

 

 

To achieve this, it's necessary to configure several elements in the Solaris system:

  • Activation system in debug mode
  • Correct setting for taking dump
  • System configuration for NMI interrupts

 

How to start in debug mode Solaris 10x86 ? It's really simple, just add parameter "kadb" at the end of line "multiboot" in the file "menu.lst" then reboot.

 

As you can see below:

 

# pwd
/rpool/boot/grub

# cat menu.lst
[...]
title s10x_u9wos_14a
bootfs rpool/ROOT/s10x_u9wos_14a
findroot (pool_rpool,0,a)
kernel$ /platform/i86pc/multiboot -B console=ttyb,$ZFS-BOOTFS kadb
module /platform/i86pc/boot_archive

[...] 

 

How to configure correct setting for taking dump ? Just use the command "dumpadm". Two things to check: the dump device and the savecore directory exist with the correct size (the size depends on RAM - two different policies on this subjetc: the size of dump device is the same as the RAM or not).

 

For exemple:

 

# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/zlap
  Savecore enabled: no
   Save compressed: on

# prtconf | grep -i memory
Memory size: 16384 Megabytes

# zfs get volsize rpool/dump
NAME           PROPERTY  VALUE    SOURCE
rpool/dump  volsize       512M     -

# df -h /var/crash/zlap
Filesystem            Size  Used Avail Use% Mounted on
rpool/ROOT/solaris     54G  9.9G   16G  39% / 

 

How to configure system for NMI interrupt ? Just add the following lines in the file /etc/system then reboot.

 

For exemple: 

 

# egrep apic /etc/system
set pcplusmp:apic_kmdb_on_nmi=1
set pcplusmp:apic_panic_on_nmi=1

 

Now, if the system hang, you can send an interrupt NMI and thus take a dump. Either you use the "ipmi" command (if ipmi command are available on ILOM's server) or you use the website of the ILOM's server to generate an interrupt NMI.

 

For exemple (ipmi command):

 

# ipmitool -I lanplus -H server-rsc -U root chassis power diag

 

 

A simple demonstration...

 

On server :

 

$ ssh zlap

 

# uname -a

SunOS zlap 5.10 Generic_142910-17 i86pc i386 i86pc

# prtdiag

System Configuration: HP ProLiant DL360 G5

BIOS Configuration: HP P58 05/18/2009

BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

[...]

 

 

On Rilo server  :

 

$ ssh admin@zlap-rsc

root@zlap-rsc's password:
User:root logged-in to ILOLD75MU6996.(XXX.XXX.XXX.XX)
iLO 2 Advanced 2.05 at 15:38:15 Dec 17 2009
Server Name: DL360G5P-34-13
Server Power: On

</>hpiLO->
</>hpiLO-> nmi server

</>hpiLO-> vsp

 

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.

</>hpiLO-> Virtual Serial Port active: IO=0x02F8 INT=3

[1]> 
[1]> ::showrev
Hostname: zlap
Release: 5.10
Kernel architecture: i86pc
Application architecture: amd64
Kernel version: SunOS 5.10 i86pc Generic_142910-17
Platform: i86pc

[1]> $<systemdump
nopanicdebug:   0               =       0x1

panic[cpu1]/thread=fffffe80005e0c60: BAD TRAP: type=e (#pf Page fault) rp=fffffe80005e0980 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference

sched: #pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0xfffffe80005e0a78, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
cr2: 0 cr3: 14717000 cr8: c
        rdi: fffffffffbc7eab0 rsi:              2f8 rdx:              2f8
        rcx:                a  r8:                0  r9: ffffffffa4f71c90
        rax: fffffffffbcecbe0 rbx: ffffffffef8f1250 rbp: fffffe80005e0a80
        r10: fffffe80005e09c0 r11:                0 r12: fffffe80005e0af0
        r13:                1 r14: fffffffffbc561c0 r15:                1
        fsb:                0 gsb: ffffffffa44fb800  ds:               43
         es:               43  fs:                0  gs:              1c3
        trp:                e err:                0 rip:                0
         cs:               28 rfl:            10002 rsp: fffffe80005e0a78
         ss:               30

fffffe80005e0890 unix:die+da ()
fffffe80005e0970 unix:trap+5e6 ()
fffffe80005e0980 unix:cmntrap+140 ()
fffffe80005e0a80 0 ()
fffffe80005e0a90 genunix:kdi_dvec_enter+d ()
fffffe80005e0ab0 unix:debug_enter+66 ()
fffffe80005e0ac0 pcplusmp:apic_nmi_intr+94 ()
fffffe80005e0ae0 unix:av_dispatch_nmivect+1f ()
fffffe80005e0af0 unix:nmiint+17e ()
fffffe80005e0be0 unix:i86_mwait+d ()
fffffe80005e0c20 unix:cpu_idle_mwait+125 ()
fffffe80005e0c40 unix:idle+89 ()
fffffe80005e0c50 unix:thread_start+8 ()

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 0:01 100% done
100% done: 320938 pages dumped, dump succeeded
rebooting...

 

 

It's very simply... no ?

 

 

For you computer culture, here are some links on the topic:

 

 

 

 

 

Partager cet article

Published by gloumps - dans kernel
commenter cet article

commentaires