WinFsp Design
This document presents the design of WinFsp, the Windows File System Proxy. WinFsp is a set of software components for Windows computers that allows the creation of user mode file systems. In this sense it is similar to FUSE (Filesystem in Userspace), which provides the same functionality on UNIX-like computers.
Overview
Developing file systems is a challenging proposition. It requires the correct and efficient design and implementation of a software component that manages files and it also requires the creation of an interface that adheres to historical standards and expectations that are times difficult to get right or perhaps even ill-conceived (e.g. atime in POSIX).
Developing file systems for Windows is an order of magnitude more difficult. Not only the operating system imposes some very hard and at times conflicting limitations on what a file system driver can do, it is not well-documented and requires a lot of trial and error and long debugging sessions to get right. On top of this kernel mode API’s and Driver entry points are heavily overloaded and can be invoked in "unexpected" ways and the system can crash or deadlock for a variety of reasons.
Compounding all this there is a single book that describes how to write file systems for Windows and it was publised in 1997! While the book is invaluable, it is out of date and contains many errors and omissions. Still if you are thinking about writing file systems get a copy of "Windows NT File System Internals: A Developer’s Guide" by Rajeev Nagar. Also study the Microsoft FastFat source!
WinFsp attempts to ease the task of writing a new file system for Windows in the same way that FUSE has done so for UNIX. File system writers only need to understand the general semantics of a file in Windows and the simple WinFsp interface. They do not need to understand the intricacies of kernel-mode file system programming or the myriad of details surrounding access control, file sharing, conflicts between open and delete/rename, etc. In this sense WinFsp may save 6-12 months of development cost (and pain) for a file system writer. It also allows developers who have no inclication to understand kernel mode programming to write their own file systems.
Some of the benefits and features of using WinFsp are listed below:
-
Allows for easy development of file systems in user mode. There are no restrictions on what a process can do in order to implement a file system (other than respond in a timely manner to file system requests).
-
Support for disk and network based file systems.
-
Support for NTFS level security and access control.
-
Support for memory mapped files, cached files and the NT cache manager.
-
Support for file change notifications.
-
Support for file locking.
-
Correct NT semantics with respect to file sharing, file deletion and renaming.
WinFsp Components and NTOS
WinFsp consists of a kernel mode FSD (File System Driver) and a user mode DLL (Dynamic Link Library). The FSD interfaces with NTOS (the Windows kernel) and handles all interactions necessary to present itself as a file system driver to NTOS. The DLL interfaces with the FSD and presents an easy to use API for creating user mode file systems.
When a familiar Windows file API call such as CreateFile, ReadFile, WriteFile, etc. is invoked by an application, NTOS packages the call into an IRP (I/O Request Packet) that is used to describe and track the call. NTOS forwards this call to the appropriate system component or driver. When the WinFsp FSD receives such an IRP it processes it and determines whether any further processing is required by the user mode file system. If that is the case it posts the IRP to a special queue from which the user mode file system can pull it and process it further. When the user mode file system finishes its processing it returns the IRP back to the queue, where the FSD picks it up, does any post processing and then "completes" it. Completing an IRP tells NTOS that the IRP processing is now finished, any associated side effects have taken place and any results have been computed and can be delivered to the original API caller.
As mentioned NTOS must determine the appropriate system component or driver to forward an IRP to. For this purpose NTOS maintains a special namespace called the Object Manager Namespace. The Object Manager Namespace is a hierarchical namespace that can be used to store path → object associations. There also exist special container objects which can contain other objects (directories) and pointer objects that can point to other objects (symbolic links). In this respect the Object Manager Namespace is similar to the file system namespace. On the other hand files in a file system are not objects within the Object Manager Namespace (at least not directly).
A special type of object within the Object Manager Namespace is the Device Object. Device Objects have the important ability to implement their own namespace, which basically allows NTOS to provide file system like functionality on top of the Object Manager Namespace. This works roughly as follows: when an application opens a file name X:\Path\File, NTOS looks up the X: name in a special directory within the Object Manager Namespace. Under normal circumstances the X: name will point to a symbolic link which will likely point to a Device Object with path \Device\Volume{GUID} (or \Device\HarddiskVolumeX). NTOS will open the device at \Device\Volume{GUID} and will then send it a "CREATE" IRP with a file name of "\Path\File". It is now the responsibility of the device driver behind the \Device\Volume{GUID} Device Object to handle the IRP and "open" the file.
Finally note that not all kernel objects have names and in fact unnamed Device Objects are important in the implementation of NTOS file systems.
WinFsp Device Namespace
When the WinFsp FSD starts up it registers two devices:
-
\Device\WinFsp.Disk
-
\Device\WinFsp.Net
The first device (WinFsp.Disk) is used to create disk-like file systems (i.e. file systems that present themselves to Windows as disk based file systems). The second device (WinFsp.Net) is used to create network-like file systems (i.e. file systems that present themselves to Windows as network based file systems). Devices of this kind are called Fsctl by WinFsp internally.
These devices can be considered as "constructor" or "factory" devices, because they can be used to create additional devices that act as file systems. There are some differences in behavior when opening the Disk vs Net devices, so we will describe them separately below.
Upon opening the WinFsp.Disk device with the right parameters the following actions will be performed:
-
An unnamed "Volume Device Object" will be created. This device will act as "the file system", which means that NTOS will send it file system related IRP’s. Devices of this kind are called Fsvol by WinFsp internally.
-
A named "Virtual Volume Device Object" will be created. This device will have a \Device\Volume{GUID} name and will act as the "disk" on which the file system is housed. Devices of this kind are called Fsvrt by WinFsp internally. The Fsvrt device contains a special structure called a VPB (Volume Parameter Block) which points to the Fsvol device. This architectural requirement is mandated by NTOS.
-
A Windows API HANDLE will be created. This HANDLE can be used to interact with the newly created Fsvol device using the DeviceIoControl API. Closing the HANDLE will delete the created Fsvol and Fsvrt devices.
Upon opening the WinFsp.Net device the actions performed are similar with one important exception:
-
The Fsvol device will be created.
-
The Fsvol device will be registered with the MUP (Multiple UNC Provider). This system component is responsible for handling UNC paths (\\Server\Share\Path). No Fsvrt device will be created in this case. However a \Device\Volume{GUID} symbolic link pointing to the MUP device will be created.
-
A Windows API handle will be created. As before the HANDLE can be used to interact with the Fsvol device via the DeviceIoControl API. Closing the HANDLE will unregister the Fsvol device from the MUP (and delete the corresponding symbolic link) and then delete the Fsvol device.
It is important to note here that when a process terminates under NTOS (normally or otherwise), NTOS will close all its handles including any WinFsp handles. This ensures that the Fsvol and Fsvrt devices will get deleted even if the user-mode file system suddenly crashes.
I/O Queues
As mentioned, IRP’s are the primary means that NTOS uses to describe and track I/O. The WinFsp FSD receives IRP’s on its Fsctl, Fsvrt and Fsvol devices. IRP’s sent to Fsctl devices (WinFsp.Disk, WinFsp.Net), have to do with creating and managing volume (file system) devices and are handled within WinFsp. IRP’s sent to Fsvrt devices (virtual volume devices) are mostly ignored as WinFsp does not implement a real disk device (it is a file system driver, not a disk driver). Finally IRP’s sent to Fsvol devices (volume devices) are the ones used to implement file API’s such as CreateFile, ReadFile, WriteFile.
When an IRP arrives at an Fsvol device, the FSD performs preprocessing such as checking parameters, allocating memory, preparing buffers, etc. In some case the FSD can complete the IRP without any help from the user-mode file system (consider for example a ReadFile on a file that has been already cached). In other cases the FSD needs to forward the request to the user mode file system (consider for example that when opening a file the user mode file system must be contacted to perform access checks and allocate resources).
The I/O queue (internal name FSP_IOQ) is the main WinFsp mechanism for handling this situation. An I/O queue consists in reality of two queues and one table:
-
The Pending queue where newly arrived IRP’s are placed and marked pending.
-
The Process table where IRP’s are placed after they have been retrieved by the user-mode file system. This structure is a dictionary (hash table) keyed by the integer value of the IRP pointer. This allows IRP’s to be completed by the user mode file system in any order.
-
The Retried queue where IRP’s are placed whenever their completion needs to be retried (a rare circumstance).
Let us now follow the life time of an IRP from the moment it arrives at the Fsvol device up to the moment it is completed. Suppose an IRP_MJ_READ IRP arrives and the FSD determines that it needs to post it to the user mode file system for further processing (for example, it is a non-overlapped non-cached ReadFile from a user mode application). In order to do so the FSD may have to do preparatory tasks such as prepare buffers for zero copy (in the case of IRP_MJ_READ) or capture process security state or copy buffers, etc. (in other cases). This processing happens in the thread and process context that the IRP_MJ_READ was received (for example the thread and process context of the application that performed the ReadFile). The FSD then posts the IRP to the Pending queue of the corresponding Fsvol device and returns. However NTOS does not immediately return to the application as the ReadFile call is not completed yet, instead it waits on an event for the IRP to complete (recall that the ReadFile was non-overlapped).
The user mode file system has a thread pool where each thread attempts to get the next IRP from the Pending queue by executing a special DeviceIoControl (FSP_FSCTL_TRANSACT). This DeviceIoControl blocks the user mode file system thread (with a timeout) until there is an IRP available. The FSP_FSCTL_TRANSACT operation combines a send of any IRP responses that the user mode file system has already processed and a receive of any new IRP’s that require processing. Upon receipt of the FSP_FSCTL_TRANSACT code the FSD pulls the next IRP from the Pending queue and then enters the Prepare phase for the IRP. In this phase tasks that must be performed in the context of the user mode file system process are performed (for example, in the case of an IRP_MJ_READ IRP the read buffers are mapped into the address space of the user mode file system to allow for zero copy). Once the Prepare phase is complete the IRP is placed into the Process table and the user mode version of the IRP called a "Request" (type FSP_FSCTL_TRANSACT_REQ) is marshalled to the file system process. The Request includes a "Hint" that enables the FSD to quickly locate the IRP corresponding to the Request once user mode processing is complete.
The user mode file system now processes the newly arrived Read Request. Assuming that the Read succeeds, the file system places the results of the Read operation into the passed buffer (which recall is mapped in the address spaces of both the calling application and the file system process) and eventually performs another FSP_FSCTL_TRANSACT with the response (type FSP_FSCTL_TRANSACT_RSP). This Response also include the Request Hint.
Upon receipt of the FSP_FSCTL_TRANSACT operation the FSD uses the Hint to locate (and remove) the corresponding IRP in the Process table. The IRP now enters the Complete phase. In this phase the effects of tasks performed in the Prepare phase are reversed (for example, in the case of an IRP_MJ_READ IRP the read buffers are unmapped from the address space of the user mode file system process). The Complete phase usually results in IRP completion, which signals to NTOS that it is now free to complete the original ReadFile call.
In some rare cases (e.g. because of pending internal locks) the IRP cannot exit the Complete phase immediately. In this case the IRP is entered to the Retried queue to retry IRP completion at a later FSP_FSCTL_TRANSACT time. Note that the Prepare, Complete and Retried phases always execute in the context of the user-mode file system process.