Troubleshooting ‘Log File Sync’ Waits

I have been contacted by one of our customers to provide reference information on troubleshooting Oracle Log File Sync waits.

I think that this information worth short blog post.

Reasons:

  • Log File Sync waits occur when sessions wait for redo data to be written to disk
  • typically this is caused by slow writes(IO subsystem saturation,…)
  • spikes in Log File Parallel Write, as shown by James Morle
  • or the application is committing too frequently
  • improper Operating System configuration(check 169706.1)
  • CPU overburning(very high demand => LGWR on run queue, check Kevin Closson post)
  • high Log Parallelism, which saturates filesystem/OS, as investigated by Nikolay Savvinov
  • BUGs in Oracle(especially with RAC option) and 3rd Party software(like ODM/DISM)

Recommendations:

  • tune LGWR process to get good throughput, especially when ‘log file parallel write‘ high too:
    • do not put redo logs on RAID 5 without good write cache
    • do not put redo logs on Solid State Disk (SSD)

    It looks like last recommendatin was based on old experience working with SSD disk, which is obsolete now and even Oracle recommends using SSD disks for REDO logs(1566935.1 Implementing Oracle E-Business Suite 12.1 Databases on Oracle Database Appliance):

“Move REDO log files to +REDO diskgroup on Solid State Disks (SSDs).”

  • if CPUs are overburned(check runqueue with vmstat):
    • check for non-oracle system activity, like GZIP or BZIP2 running in business hours…
    • lower instance’s CPU usage(for example, tune SQL for LIOs)
    • increase LGWR priority(renice or _high_priority_processes),
  • decrease COMMITs count for applications with many short transactions
  • use COMMIT [BATCH] NOWAIT(10g+) when possible
  • do some processing with NOLOGGING(or may be even with _disable_logging=TRUE if just testing performance benchmark/impact), but think about database recoverability
  • lower system’s CPU usage or increase LGWR priority
  • if you see spikes in Log File Sync, try to disable Adaptive Log File Sync(_use_adaptive_log_file_sync=FALSE)
  • check if there is some 3rd party software, or utilities like RMAN, activity on the same disks as redo logs placed, like trace/systemstate dump files, e.t.c
  • if you are on multy CPU/Core system, try restrict Log Parallelism (_log_parallelism_max=1)
  • trace LGWR as the last option for troubleshooting OS/3rd party issues 😉

References:

9 thoughts on “Troubleshooting ‘Log File Sync’ Waits

  1. Interesting post, Oleksandr, thanks

    But why “do not put redo logs on Solid State Disk (SSD)”?
    Oracle itself very actively uses SSD in Exadata (for redo), and we have some positive experience with SSD for redo usage
    What can be wrong with SSD?

    • Igor,
      It’s because SSD even IS NOT USED for redo-logs storage in Exadata as THE ONLY STORAGE option.
      It’s used as a COMPLEMENT to traditional HDD(to overcome another issue –
      not having enough disk spindles, and OIPS as a result):
      Db nodes gets their WRITE CONFIRMATION in a case redo data has been successful
      written to HDD or SSD(whatever was the first), so SSD and HDD COMPLEMENT EACH OTHER,
      because no one is perfect…

      It’s because of SSD Write Penalty, because of Garbage Collection.

      SSD is especially GOOD for READ(and even WRITE) SMALL IOs,
      which we don’t see in case of REDO writting profile.

      Additionally, you may be interested in reading:
      Gwen Shapira: “De-Confusing SSD (for Oracle Databases)”

      • I’ve read pointed recommendations and some others from different MOS notes too, and I’m sure this is a bit outdated theory

        Look at practical results from AWR for 1-hour period –

        1) ASM-iSCSI-SAS configuration:

                                                                       Avg                
                                                  %Time Total Wait    wait    Waits   % DB/bg
          Event                             Waits -outs   Time (s)    (ms)     /txn   time
          -------------------------- ------------ ----- ---------- ------- -------- ------
        ...
          log file sync                    43,605     0      2,963      68      0.1   10.1
        ...
          log file parallel write         193,830     0      2,067      11      0.3   31.2

        2) ASM-iSCSI-SSD configuration:

          log file sync                    55,095     0        574      10      0.1    2.7
        ...
          log file parallel write         302,177     0        622       2      0.4   21.5

        And for another heavy-loaded OLTP system with direct-attached SSD and ASM:

        Event                       Waits %Time -outs Total Wait Time (s) Avg wait (ms) Waits /txn % DB time
        ...
        log file sync	        4,741,351           0              12,006             3       0.31     12.88
        ...
        log file parallel write	3,367,407           0               1,047             0       0.22     10.74

        The same recommendations may be found in Steve Shaw Improve Database Performance: Redo and Transaction Logs on Solid State Disks (SSDs)

        May be the “do not put redo logs on Solid State Disk (SSD)” is a bit tricky tip from Oracle HW Company, with some marketing influence? 🙂

  2. Igor,
    I support your thoughts about “this is a bit outdated theory” in a way that
    SSD manufacturers constantly improve their Garbage Collection techniques making them asynchronous e.t.c
    and I suppose that enterprise level SSDs have to be good for redo too,
    but my personal opinion is that SSDs is the only tool and have to be used in proper place
    where we may get most of it.
    BTW:
    “ASM-iSCSI-SSD” and “another heavy-loaded OLTP system with direct-attached SSD”
    may use RAM-based SSD 😉

      • Ugor,
        it was just a some kind of joke to show
        that I really don’t know your environments
        and there are some other attributes that
        need to be taken in attention to make correct judgment,
        like:
        – direct attached SSD is not the same as network attached SSD
        – what “network” mean in particular case
        – what SSD mean in particular case
        – …
        so there are more questions to ask than answers to conclude

  3. Nice summary of a complex and sometimes confusing topic, Oleksander. And thank you for linking to my blog.

  4. Igor, getting 2ms “log file parallel write” time on SSD, vs. 11ms on iSCSI mostly means you solved a problem unrelated to spinning disks using SSD… writing to a disk without seeks should not take 11ms, which means either something went wrong with the storage config (buffers? saturation?) or you are sharing the disks and therefore performing seeks. On the other hand 2ms is SLOW for SSD, and you may want to check why you are not getting the write speed you clearly payed for…

Leave a comment