48

The default journal mode for Ext4 is data=ordered, which, per the documentation, means that

"All data are forced directly out to the main file system prior to its metadata being committed to the journal."

However, there is also the data=journal option, which means that

"All data are committed into the journal prior to being written into the main file system. Enabling this mode will disable delayed allocation and O_DIRECT support."

My understanding of this is that the data=journal mode will journal all data as well as metadata, which, on the face of it, appears to mean that this is the safest option in terms of data integrity and reliability, though maybe not so much for performance.

Should I go with this option if reliability is of the utmost concern, but performance much less so? Are there any caveats to using this option?

For background, the system in question is on a UPS and write caching is disabled on the drives.

Tim
  • 717

2 Answers2

43

Yes, data=journal is the safest way of writing data to disk. Since all data and metadata are written to the journal before being written to disk, you can always replay interrupted I/O jobs in the case of a crash. It also disables the delayed allocation feature, which may lead to data loss.

The 3 modes are presented in order of safeness in the manual:

  1. data=journal
  2. data=ordered
  3. data=writeback

There's also another option which may interest you:

commit=nrsec    (*) Ext4 can be told to sync all its data and metadata
                    every 'nrsec' seconds. The default value is 5 seconds.

The only known caveat is that it can become terribly slow. You can reduce the performance impact by disabling the access time update with the noatime option.

Coren
  • 5,010
  • 3
    You point that disabling delayed allocation is safer. However, I cannot find a case where data=journal will provide safer result than data=ordered + nodelalloc. Do you have one? – Jérôme Pouiller Dec 13 '16 at 14:45
  • It is not disabling the delayed allocation that can lead to data loss. – ctrl-alt-delor Feb 11 '19 at 10:55
  • 1
    as for this link https://www.kernel.org/doc/Documentation/filesystems/ext4.txt it says that enabling data=journal will disable delayed allocation and O_DIRECT support. – Alex Nov 12 '20 at 21:52
  • It might be safer in general, but if you follow the pattern from the Wiki link fd=open("file.new"); write(fd, data); close(fd); rename("file.new", "file"); (also mentioned in man ext4(5) BTW) and leave the auto_da_alloc option enabled, then data=ordered (even w/o fsync()) should be equivalent and no less safe thandata=journal, right? – Machta Feb 28 '21 at 11:04
  • That is my take-away from the documentation as well, but this is the exact scenario I frequently see data loss with data=ordered (eg a power-loss shortly after rename results in "file.new" existing as an empty file) – jdizzle Jan 12 '24 at 15:04
5

This thread is super old, but still relevant.

We wanted to merge many tiny writes on a MySQL database, running as a VM under KVM using Ceph RBD images.

Guest: CentOS 6 VM's /etc/fstab:

/dev/sda1               /                       ext4    defaults,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,noatime,nodiratime,commit=60,data=journal,discard 1 1

The '/dev/sda' device (1 TiB) is in a compressed erasure coded NVMe pool, with a relatively tiny (128 MiB) dedicated journal device in a triple replicated NVMe pool.

Herewith the commands we used in a rescue environment:

Detach the journal:

tune2fs -O ^has_journal /dev/sda1;

Check the file system for inconsistencies:

fsck.ext4 -f -C 0 /dev/sda1;

Obtain block size:

tune2fs -l /dev/sda1;

Format dedicated journal device (WARNING):

Minimum journal size should be 1024 * block size (we use 128 MiB to be safe)

Set block size to match that of /dev/sda1

mke2fs -O journal_dev -L root_journal /dev/sdb1 -b 4096;

Attach the dedicated journal device to the file system:

tune2fs -j -J device=LABEL=root_journal /dev/sda1;

MySQL settings:

[mysqld]
innodb_old_blocks_time = 1000           # Prevent buffer pool pollution. Default as of MySQL 5.6
innodb_buffer_pool_size = 24576M        # MySQL Cache
innodb_log_buffer_size = 128M           # 25% of log_file_size
innodb_log_file_size = 512M             # 25% of the buffer_pool (no, not really)
query_cache_size = 128M                 # Query Cache
table_cache = 512                       # Make it large enough for: show global status like 'open%';
#mysqltuner.pl:
innodb_flush_method = O_DSYNC           # Don't validate writes. MySQL 5.6+ should use O_DIRECT
innodb_flush_log_at_trx_commit = 2      # Flush MySQL transactions to operating system cache
join_buffer_size = 256K
thread_cache_size = 4
innodb_buffer_pool_instances = 16
skip-innodb_doublewrite