-   y-cruncher (
-   -   Parallel Disk I/O (

Mysticial 2019-10-30 18:33

Parallel Disk I/O
The topic of parallel disk I/O has been a recurring topic and has just come up again. So I figured I create a thread to keep everything together.

First of all, there's multiple forms of parallel disk I/O:[LIST=1][*]Parallel access to different files and physical drives.[*]Parallel access to different parts of the same file.[*]Parallellizing a large sequential access to a single file.[/LIST]
#1 has been supported since antiquity. The approach is fairly straightforward. Given a large object, stripe it across multiple drives as separate files and file handles. Then for each logical access (offset + bytes), determine which portion lies on which file and perform the accesses to each file in parallel on separate threads.

Performance scales almost linearly provided that all the drives have the same performance and there are no hardware bottlenecks.

#2 is a topic that came up earlier this year during Google's 31 trillion digit Pi world record. Right now, y-cruncher's only disk I/O parallelism is built into the RAID striping. But computationally, all disk access is still serial.

The network topology used by Google computation had separate channels for ingress and egress. This means that there is benefit to performing reads and writes in parallel. In order to utilize this (in a more general sense), the computation itself needs to be able to issue parallel accesses to different parts of the swap file.

Those who have played with v0.7.8's new swap mode menu may have noticed the new "parallel disk access" option which is currently locked to "no parallelism". This is precisely what the option is for. Parallel disk access has been added to the internal API.

But neither the computation nor the far memory implementations support it yet. The computation still issues disk access serially and the implementations have a global lock to force any concurrent requests to run serially.

#3 is an even newer topic that has come up in the context of NVMe raid.

As mentioned above, y-cruncher's only disk I/O parallelism is built into its RAID striping. Thus if you don't use the built-in RAID, you get only one thread to perform disk access. Historically this hasn't been a problem since it was virtually impossible for any amount of disk access to saturate a single worker thread. But this isn't the case anymore with modern high-end systems - namely RAID of multiple NVMe SSDs.

If you RAID0 a bunch of NVMe SSDs and expose it as a single path to y-cruncher, it will only get one worker thread with its default 64 MB buffer. In short, a single thread with such a small buffer cannot keep up with 10+ GB/s of data. Increasing the buffer size may help, but it doesn't solve the problem.

Right now, the work-around is to not do RAID yourself. Instead, expose the drives individually to y-cruncher so it can micromanage them and parallelize the work across multiple threads.


Thus we now raise the question of parallelizing disk access with the same file (and same file handle) as this is required to support both #2 and #3.

The low level disk access APIs are:[LIST][*]Windows: ReadFile()/WriteFile()[*]Linux: read()/write()[/LIST]
These are all single-threaded API calls. On Windows, parallel access to the same file can be achieved using the "overlapped" property flags. Thus #2 is supported. By extension, #3 can be supported by breaking up a large disk access into smaller segments split across different threads.

What about Linux? I haven't really looked into this yet.

In any case, there are some loose ends which need to be examined.[LIST][*]Is there any real performance gain to parallelizing within the same file or even different files on the same physical device? Or will the OS itself serialize everything?[*]How safe is parallel access to the same file? In the case of Windows with the "overlapped" property, parallel access is supposedly safe as long as the regions don't overlap. But what if the regions are adjacent and land on the same sector?[*]The sector alignment thing above is moot since it's impossible when raw I/O is enabled. But that defers the problem to y-cruncher's own sector alignment code. So I need to keep a synchronized sparse map of sector locks? Ugh... (granted, I'm already doing nastier things for the checksums)[/LIST]
This is all thinking out loud for now. In reality, I'm not going to have any time to implement anything for a while.

All times are UTC. The time now is 10:32.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.