Exadata Database UNKNOWN issue after Grid Upgrade PRCR-1133 PRCR-1132 CRS-2680 CRS-5807 CRS-2680 CRS-5807

In this post ,we are going to discuss about a recent issue we had after completing the exadata grid upgrade from 12.1.0.2 to 12.2.0.1. After completing the grid upgrade, we started seeing some databases status in UNKNOWN status with crsctl status command. The srvctl command show the status of the Database as not running though they are actually running. So let’s see the step by step troubleshooting of this issue.

NOTE : When you are troubleshooting any issue it is very important to first understand the issue and all its symptoms. We can’t just search for error string in google or in Oracle Support portal and start applying the solutions. This is not the right approach. We always suggest my blog followers and my students to first understand the issue and all its error messages. Let’s start discussing our issue….

After completing the GRID upgrade, the following symptoms are identified ….

srvctl shows the database status as “not running” though the database is actually up and running
Can’t manage the database(start or stop ) using the srvctl utility
SQLPLUS is the only utility to start and stop any instance of the RAC database
CRSCTL shows the database resource status as UNKNOWN.

With above mentioned symptoms of the issue, any experienced RAC should be able to tell you that the database is not properly talking to clusterware to update its status. That being said, the clusterware is unable to manage the database. These kind of situations arise when any cluster rac nodes gets rebooted abruptly. In our case, It may or may not be the grid upgrade that caused this issue, but there could be some previous occurances where the rac node got rebooted and there was a mismatch of Oracle database status in Oracle Clusterware. So how do we let the clusterware tell to reset the database status and start managing the database? Simple , just disable the resource and re-enable the resource using srvctl utility. But many DBAs are not convinced with the solution provided by the experience, they need a step by step troubleshooting steps …

Check 1:

Validate all the symptoms of the issue specified above. This confirms the status mismatch with Oracle Clusterware.

I am considering one database to show you the commands, but this issue is with many database on Exadata.

The following output shows thath the database is not running on any of the clusterware exadata compute node.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status database -d ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is not running on node exaaskmdb02
Instance ASKMDB3 is not running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$

But the above output is not correct. The database instance is actually running on node2 exaaskmdb02.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ sqlplus '/as sysdba'

SQL*Plus: Release 11.2.0.4.0 Production on Fri Sep 14 00:36:16 2018

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> sho parameter spfile

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string +ASKM_DG/ASKMDB/spfileASKMDB.ora

If I try to start this instance using srvctl , it gives me the following error messages …..

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl start instance -d ASKMDB -i ASKMDB2
PRCR-1013 : Failed to start resource ora.askmdb.db
PRCR-1064 : Failed to start resource ora.askmdb.db on node exaaskmdb02
CRS-2680: Clean of 'ora.askmdb.db' on 'exaaskmdb02' failed
CRS-5802: Unable to start the agent process

But the same instance can be started and stopped using sqlplus.

The CRSCTL status of this database resource shows the following UNKNOWN status ….

[grid@exaaskmdb01 ~]$ crsctl status res ora.askmdb.db -t
--------------------------------------------------------------------------------
Name           Target State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.askmdb.db
       1        ONLINE ONLINE       exaaskmdb02                 Open,HOME=/u01/app/o
                                                              racle/product/11.2.0
                                                              .4/dbhome_1,STABLE
       2        OFFLINE OFFLINE                               STABLE
       3        OFFLINE OFFLINE                               STABLE
       4        ONLINE UNKNOWN      exaaskmdb02                 STABLE
       5        ONLINE OFFLINE                               STABLE
       6        ONLINE OFFLINE                               STABLE
--------------------------------------------------------------------------------
[grid@exaaskmdb01 ~]$

Check 2 :

With Check1, we confirmed that all the symptoms we specified are met. Now lets see what is being populated in logs and what logs files we need to review …….

All cluster related issues are logged to clusteware logs. These logs are located in /u01/app/grid/diag/crs/exaaskmdb02/crs/trace. Check the logs alert.log , crsd.trc and crsd_oraagent_oracle.trc files to check the error messages that states the cluster resource status changed to UNKNOWN status with the reason for this status change. As a DBA we should be able to understand the log formatting messages of the clusterware trace files.

Solution

The following steps are listed to let any DBA understand the sequence of steps to execute to correct the database UNKNOWN status.

1. List the database status and check all the instances status

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status database -d ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is not running on node exaaskmdb02
Instance ASKMDB3 is not running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06

2. List the services status and check where the services are running.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status service -d ASKMDB
Service ASKMDBSC is running on instance(s) ASKMDB2

3. If the database is running a single instance,
3.a. Enable new database instance on a different node using srvctl

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl enable instance -d ASKMDB -i ASKMDB3

3.b. Start this new instance using srvctl

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl start instance -d ASKMDB -i ASKMDB3

3.c. Modify the database services configuration to make the new instance as the preferred instance

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl modify service -d ASKMDB -s ASKMDBSC -n -i ASKMDB3 -a ASKMDB1,ASKMDB2,ASKMDB4,ASKMDB5,ASKMDB6

3.d. Verify the database configuration that the new instance is listed as the preferred instance.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl config service -d ASKMDB -s ASKMDBSC
Service name: ASKMDBSC
Service is enabled
Server pool: ASKMDB_ASKMDBSC
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Preferred instances: ASKMDB3
Available instances: ASKMDB1,ASKMDB2,ASKMDB4,ASKMDB5,ASKMDB6
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$

4. If the database is running multiple instances,
4.a. List the database configuration and check if the services have multiple instances as preferred instances.
4.b. Modify if we need to adjust the service configuration, to have atleast one running instance(Other than problematic instance) as preferred instance.
5. Relocate the database services to the new instance.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status service -d ASKMDB -s ASKMDBSC
Service ASKMDBSC is running on instance(s) ASKMDB3

6. Stop the problematic instance using srvctl. <<< If there are any errors during this step, it's expected. Don't worry about the error messages, but proceed to the next step.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl stop instance -d ASKMDB -i ASKMDB2
PRCR-1133 : Failed to stop database ASKMDB and its running services
PRCR-1132 : Failed to stop resources using a filter
CRS-2680: Clean of 'ora.askmdb.db' on 'exaaskmdb02' failed
CRS-5807: Agent failed to process the message
CRS-2680: Clean of 'ora.askmdb.db' on 'exaaskmdb02' failed
CRS-5807: Agent failed to process the message
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$

7. Disable the instance using srvctl

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl disable instance -d ASKMDB -i ASKMDB2

8. Enable the instance using srvctl

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl enable instance -d ASKMDB -i ASKMDB2

9. Start the instance using srvctl

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status database -d ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is not running on node exaaskmdb02
Instance ASKMDB3 is running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ ps -ef | grep -i pmon | grep -i ASKMDB
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl start instance -d ASKMDB -i ASKMDB2
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status database -d ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is running on node exaaskmdb02
Instance ASKMDB3 is running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$

10. Now check the status of the database resource using crsctl

[grid@exaaskmdb02 ~]$ crsctl status res ora.askmdb.db -t
--------------------------------------------------------------------------------
Name           Target State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.askmdb.db
       1        OFFLINE OFFLINE                               Instance Shutdown,ST
                                                              ABLE
       2        OFFLINE OFFLINE                               Instance Shutdown,ST
                                                              ABLE
       3        OFFLINE OFFLINE                               STABLE
       4        ONLINE ONLINE       exaaskmdb02                 Open,HOME=/u01/app/o
                                                              racle/product/11.2.0
                                                              .4/dbhome_1,STABLE
       5        ONLINE OFFLINE                               STABLE
       6        ONLINE OFFLINE                               STABLE
--------------------------------------------------------------------------------
[grid@exaaskmdb01 ~]$

11. Once it is confirmed that the database shows the correct status using crsctl, proceed to bring down the temporary instance that we started.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl stop instance -d ASKMDB -i ASKMDB3

12. Relocate the services back to the original instance once the problem is resolved with that instance.
13. Stop the temporary instance that we started.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status database -d ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is running on node exaaskmdb02
Instance ASKMDB3 is not running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl status service -d ASKMDB
Service ASKMDBSC is running on instance(s) ASKMDB2
[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$

14. Disable the temporary instance that we started.

[oracle@exaaskmdb02.askmlabs.com:ASKMDB2]$ srvctl disable instance -d ASKMDB -i ASKMDB3

Hope this helps …

askMLabs