Once I was contacted by a customer complaining of poor RAC performance, hangs, etc…
It was 4 node RAC with ASM as a storage solution. There were about 10 databases.
Alert.logs were full of messages like:
ORA-00240: control file enqueue held for more than 120 seconds ORACLE Instance XXX - Can not allocate log, archival required Thread 2 cannot allocate new log, sequence NNNN All online logs needed archiving
One notice about the customer’s system: there was one special 5.6 TB diskgroup collected from 3 LUNs, each of them was a RAID-5 of SATA disks. Don’t laugh…:-) It worked quite good as a place for archive log destination and for disk backups, especially compressed. It wasn’t supposed for any other types of database files. It’s quite important!
Quick look and I discovered that mentioned diskgroup contained some other types of files, like primary copy of controlfiles, second member of redo logs, spfiles…
The first recommendation was to move all this files from mentioned diskgroup to other diskgroups which supposed for such types of files.
It was a quick solution and it helped for some time, but… after several days I was contacted by the customer again with the same issue.
After some investigation I identified, that the mentioned diskgroup contains more than 100 thousands files – most of them – archivelogs, and aggregation queries on V$ASM_FILES took almost 5 minutes to complete! Some work with Metalink – and the BUG was identified and backport for the customer’s platform requested. But let’s have a look at the issue step by step.
We are interested in ASM performance with HUGE number of files.
The Test Case.
Some details about my demo system – it’s HP rx2620 single CPU server with one 10K SCSI 36GB disk dedicated to ASM diskgroup. It’s a single node RAC cluster, but RAC is not an issue, except that in RAC environment the situation will get even worse…
ASM known fact(feature) – ASM is out of Database I/O path, but when we need to create new ASM file or extend existing one, we have to request (in synchronous mode) ASM to create new file, allocate new allocation units and provide us extent map.
Let’s see how fast we can query ASM for the number of files grouped by type
--on ASM instance SET TIMING ON SELECT group_number,type,COUNT(*) FROM v$asm_file GROUP BY group_number,type / GROUP_NUMBER TYPE COUNT(*) ------------ --------------- ---------- 1 DATAFILE 4 1 TEMPFILE 1 1 BACKUPSET 5 1 ONLINELOG 10 1 ARCHIVELOG 15 1 CONTROLFILE 2 1 PARAMETERFILE 1 7 rows selected. Elapsed: 00:00:00.02
Notice elapsed time.
So, let’s add some more files to our single disk asm diskgroup.
Remember – it will take some time, may be hours…
--on DB instance BEGIN FOR i IN 1..25000 LOOP EXECUTE IMMEDIATE 'alter system switch logfile'; END LOOP; END; /
Let’s query again for the number of files grouped by type
--on ASM instance SELECT group_number,type,COUNT(*) FROM v$asm_file GROUP BY group_number,type / GROUP_NUMBER TYPE COUNT(*) ------------ --------------- ---------- 1 DATAFILE 4 1 TEMPFILE 1 1 BACKUPSET 5 1 ONLINELOG 10 1 ARCHIVELOG 25015 1 CONTROLFILE 2 1 PARAMETERFILE 1 7 rows selected. Elapsed: 00:00:05.12
Look at the elapsed time – more then 5 seconds(!!!) – 256 times slower…
Bad news is You will never revert to “old good” performance until You drop and recreate this diskgroup!
So, You may ask – what’s a problem with that 5 seconds ?
- 5 seconds was on the idle system!
- 5 seconds is not a limit! On our customer’s system I have seen almost 5 minutes on SATA disk based ASM diskgroup(!)
- query on V$ASM_FILE, V$ASM_ALIAS, … executed quite often by Database Console & Enterprise Manager Grid Control Agents from every node of cluster…
- because of BUG:6270137 queries on V$ASM_FILE, V$ASM_ALIAS,… block some ASM operations like file extending or new file creation…
I have performed some tests to identify impact on system performance with and without the patch for BUG:6270137 applied, with and without V$ASM_FILE activities:
- 4400 archive log switches per hour without any activity on V$ASM-views
- 164 archive log switches per hour without fix for BUG:6270137 installed and with constant activity on V$ASM_FILE view. Notice: 26.8 times worse! with constant alert.log messages “All online logs needed archiving“. And I believe that with more files and/or SATA-disks performance will be worse!
- 2223 archive log switches per hour with fix for BUG:6270137 installed and with constant activity on V$ASM_FILE view. Notice: almost 2 times worse than without any activity on V$ASM_FILE view, 13.5 times better than without patch for BUG:6270137 installed.
- the situation is not too mythical – 5TB SATA based diskgroup is quite cheap and it can accommodate more than 100 thousand files with default redo log file size of 50MB
- after diskgroup was filled with a lot of files, the only way to revert to good performance at this time is to drop and recreate the diskgroup. I think that Enhancement Request should be posted to cleanup old unused file entries in ASM (for example after REBALANCE operation).
- high activity on V$ASM-views may be generated by Database Control or Enterprise Manager Grid Control Agents from every node of the cluster, so the temporary workaround is to disable gathering ASM-information from Database or Grid Control.
- patch for BUG:6270137 fixes the issue, but it conflicts with some other important ASM patches, so MLR may need to be requested. Patch for this bug included in 10.2.0.5, 188.8.131.52, 11.2