Update on fsync Performance: How to Achieve Constant Throughput with Unbuffered IO

cherrylweiner00553
Aug 14, 2023
7 min read

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Update on fsync Performance

Download: https://shurll.com/2vDkZo

A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.

The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.

If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.

The first and most obvious type of IO are pages reads and writes from the tablespaces. The pages are most often read one at a time, as 16KB random read operations. Writes to the tablespaces are also typically 16KB random operations, but they are done in batches. After every batch, fsync is called on the tablespace file handle.

To avoid partially written pages in the tablespaces (a source of data corruption), InnoDB performs a doublewrite. During a doublewrite operation, a batch of dirty pages, from 1 to about 100 pages, is first written sequentially to the doublewrite buffer and fsynced. The doublewrite buffer is a fixed area of the ibdata1 file or a specific file with the latest Percona Server for MySQL 5.7. Only then do the writes to the tablespaces of the previous paragraph occur.

Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads and good for writes. But as you will see below, not that great for fsyncs.

The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:

In all cases, the resulting rates have more than doubled. The fdatasync call has a troubled history, as there were issues with it many years ago. Because of those issues, InnoDB never uses fdatasync, only fsyncs. You can find the following comments in the InnoDB os/os0file.cc:

So, even with fdatasync, operations like extending an InnoDB tablespace will update the metadata correctly. This appears to be an interesting low-hanging fruit in term of MySQL performance. In fact, webscalesql already have fdatasync available

Why do we need a fsync or fdatasync with O_DIRECT? With O_DIRECT, the OS is not buffering anything along the way. So the data should be persisted right? Actually, the OS is not buffering but the device very likely is. Here are a few results to highlight the point using a 7.2k rpm SATA drive:

In Performance Analysis and Troubleshooting Methodologies for Databases, Percona CEO Peter Zaitsev discusses the USE Method and Golden Signals methodologies and how we can apply these methodologies to data infrastructure performance analysis troubleshooting and monitoring. MySQL is used as the primary example, but the webinar is applicable to other database technologies as well.

I tested your fsync.py using an Intel Optane SSD (900P 280GB, ext4). It took 0.054s to complete but it appears that some of that time was outside the for-loop. 10 000 iterations took 0.43s or 0.043 ms per fsync ( 23 250 fsyncs / seconds).

I tested your fsync.py using our SAN HPE 3PAR StoreServ 8400 storage.It is relatively high level flash-based storage device.10 000 iterations took 19.303s or 1.903 ms per fsync ( 518 fsyncs / seconds).

So now I want to apply some techniques to bring the performance of the specs on par with SQLite with no code modifications (ideally just by setting the connection options, which is probably not possible).

This is one of the only acceptable uses for the fsync=off setting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.

Needless to say, you should never enable fsync=off in production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writes off, as it no longer does any good then. Beware that fsync=off and full_page_writes apply at the cluster level, so they affect all databases in your PostgreSQL instance.

For production use you can possibly use synchronous_commit=off and set a commit_delay, as you'll get many of the same benefits as fsync=off without the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.

As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=off and almost none if you're using UNLOGGED tables.

Finally, tune your queries. Make sure that your random_page_cost and seq_page_cost reflect your system's performance, ensure your effective_cache_size is correct, etc. Use EXPLAIN (BUFFERS, ANALYZE) to examine individual query plans, and turn the auto_explain module on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.

On newer kernels, you may wish to ensure that vm.zone_reclaim_mode is set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.

Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ... to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).

From information on ensuring data is on disk ( -write-caching-part-2-an-overview-for-application-developers/), even in the case of e.g. a power outage, it appears that on Windows platforms you need to rely on its "fsync" version FlushFileBuffers to have the best guarantee that buffers are actually flushed from disk device caches onto the storage medium itself. The combination of FILE_FLAG_NO_BUFFERING with FILE_FLAG_WRITE_THROUGH does not ensure flushing the device cache, but merely have an effect on the file system cache, if this information is correct.

Given the fact that I will be working with rather large files, that need to be updated "transactionally", this means performing an "fsync" at the end of a transaction commit. So I created a tiny app to test the performance in doing so. It basically performs sequential writes of a batch of 8 memory page sized random bytes using 8 writes and then flushes. The batch is repeated in a loop, and after every so many written pages it records the performance. Additionally it has two configurable options: to fsync on a flush and whether to write a byte to the last position of the file, before beginning the page writes.

The performance results I am obtaining (64 bit Win 7, slow spindle disk) are not very encouraging. It appears that "fsync" performance depends very much on the size of the file being flushed, such that this dominates the time, and not the amount of "dirty" data to be flushed. The graph below shows the results for the 4 different settings options of the little benchmark app.

As you can see, performance of "fsync" exponentially decreases as the file grows (until at a few GB it really grinds to a halt). Also, the disk itself does not seem to be doing a whole lot (i.e. resource monitor shows its active time as just around a few percent, and its disk queue mostly empty for most of the time).

I had obviously expected "fsync" performance to be quite a bit worse than doing normal buffered flushes, but I had expected it to be more or less constant and independent of file size. Like this it would seem to suggest that it is not usable in combination with a single large file. 2ff7e9595c

Update on fsync Performance: How to Achieve Constant Throughput with Unbuffered IO

Update on fsync Performance

Recent Posts

Comments