Starting MRP reboots Linux Server

Hi.

I have faced quite unusual situation when starting Oracle Managed Recovery resulted in Linux Server reboots.

It was version 10.2.0.4 and results were consistent – in a few seconds after issuing RECOVER MANAGED STANDBY DATABASE DISCONNECT Linux server just went down for reboot…

Wed Oct 14 15:35:58 2011
ALTER DATABASE RECOVER  managed standby database parallel 1 disconnect
Wed Oct 14 15:35:58 2011
Attempt to start background Managed Standby Recovery process (XXX)
MRP0 started with pid=16, OS id=21890
Wed Oct 14 15:35:58 2011
MRP0: Background Managed Standby Recovery process started (XXX)

after server rebooted we have next lines in ALERT.LOG

Wed Oct 14 15:47:17 2011
Starting ORACLE instance (normal)

It looks like just starting oracle instance, but without shutting it down  and as I sad after hardware server was rebooted…

It took some crazy time to diagnose this unusual issue – I just didn’t believe that Oracle’s MRP was responsible for server reboots… but results were consistent. What made my additionally crazy is that system logs didn’t contain any useful information for question ‘Why server was rebooted?’

Later I have found some interesting for me lines in system log that brought me to right direction:

messages:Sep 30 15:53:46 oraclehost cmaidad[6892]: Accelerator Board Battery Failed: Slot 0.
messages:Sep 30 17:44:31 oraclehost cmaidad[6870]: Accelerator Board Status Change: Slot 0. Status is now Temporarily Disabled.
messages:Sep 30 17:44:32 oraclehost cmaidad[6870]: Accelerator Board Bad Data: Slot 0. Accelerator cache board has lost battery power. Data loss possible.

So we have situation when RAID Controller was claiming about Battery Failed|Accelerator cache board has lost battery power and Data loss possible.

So I just speculated that mentioned data loss was possible when ARCHiver was writting particular archive log to disk and this archive log BECAME BROKEN!

I deleted ALL archived logs on standby site and started MANAGED RECOVERY process, which DIDN’T FAILED and successfully fetched by FAL all requited logs from primary and applied them.

PS. unresolved questions:

  • I still don’t know how it’s possible that server may be rebooted by MRP ? definitely there is some BUG in Linux or drivers’ software that made it possible
  • why Oracle just didn’t say ‘this archive log is broken’ ? by default DB_BLOCK_CHECKSUM=TRUE and Oracle writes checksum in every redo block

“Oracle Database uses the checksum to detect corruption in a redo log block. The database verifies the redo log block when the block is read from an archived log during recovery and when it writes the block to an archive log file. An error is raised and written to the alert log if corruption is detected.”

Advertisements

One thought on “Starting MRP reboots Linux Server

  1. Pingback: Starting MRP reboots Linux Server (once more) « Oleksandr Denysenko's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s