flipcode - Programming a Virtual File System

Programming a Virtual File System - Part II
by (22 August 2001)

Introduction

First, I want to say "thank you" for your great feedback. I received about 50 e-mails containing lots of helpful hints, comments, criticisms and ideas. I also want to express a special thank you to John Rush, the guy who not only gave me lot of ideas, but who also implemented half of the filter system and showed me how to use Structured Storage. Thanks.

Winds Of Change

Because of your great comments, the interface of the VFS has changed enormously. So the first subpart of this tutorial will be to show you all the interface changes. The second part will be to introduce you to the real implementation of the VFS. I will show two possible implementations for the VFS, one more and one less platform independent.

For those of you that actually want CODE now *hehe*, I've added the code of my old VFS here (which is less well-structured than the VFS, but still worth checking out).

Before we dive into the parts of VFS that have changed, have a look at the header file.

Major Changes

One major change, is that the the interface is now completely platform independent. To achieve this it uses an ANSI C++ compliant header file for the interface and a separate header file for any data types that are needed. This is a common approach to platform-independence and is used by OpenGL, among others. The data types area defined in VFS_TYPES.H and are prefixed with VFS_ so as to be easily recognizable. This file may need to be modified for use on your platform. VFS_LONGLONG for example is defined as __int64 which is Microsoft specific. You will need to replace that with an appropriate 64-bit integer for your platform. Additionally, I changed the error handling behavior to use an enumeration VFS_ErrorCode which contains all possible error codes. I'll probably add more error codes to this enumeration in the future. Notice that the error code isn't directly returned by the functions, but through the VFS_GetLastError() function. Here's a small table containing the error return values for the return value types:

Data type	Error return value
VFS_BOOL	VFS_FALSE
VFS_DWORD	VFS_ERROR = 0xFFFFFFFF
VFS_Handle	VFS_INVALID_HANDLE
(Pointer Type)	>NULL

This means if a function VFS_DWORD VFS_GetNumFilters() returns VFS_ERROR, this indicates an error. The error code and a string containing the description of the error code can be retrieved like this:


VFS_ErrorCode eError = VFS_GetLastError();
VFS_String strError;
VFS_GetErrorString( eError, strError ); 
cerr << "Error" << strError << endl;

So this is pretty easy stuff.

The last change I made (which you'll probably have noticed), is that I am now using Unicode. Core components such as this are intended to be used throughout the world. To best support this, all the interfaces are now exclusively Unicode. This will also allow maximum file-naming flexibility without breaking file-format compatibility between different locals. Although many Win9x users may not be aware of it, many file systems already support Unicode filenames. Your code should not have to change much, though. You only have to use the VFS_TEXT macro for your hard-coded strings and use some ANSI<->UNICODE conversion routines if necessary.

Filter System Changes

According to a suggestion by John Rush, I changed the Filter Structure to a class and added two member functions LoadData() and SaveData() to the VFS_Filter, allowing the filter to store filter-dependent settings (one setting entry per archive!!!). So, a Filter can store a compression table, encryption keys, etc. in the archive and load that if it needs to do so. These two functions need not be implemented by every Filter -- the base class provides stubbed functionality so just don't override them unless you need this feature.

The filter's LoadData() method is called whenever an Archive is loaded, and the SaveData() is called when an archive is created or closed.

Another interesting change is that I don't use LPCBYTE pointers anymore for the filter callback functions, but Reader/Writer function callbacks. This is better since the Filters don't need to allocate memory anymore, they just stream from/into the file.

Here's the up-to-date version of my ONEADD filter called MULTIADD, which now uses the SaveData/LoadData functions (and which is still as useful as in the last issue ;-):


// The Filter Class.
class MultiAddFilter : public VFS_Filter
{
    signed char m_cAdd;
public:
    MultiAddFilter()
     : VFS_Filter( "MultiAdd", "Adds a definable number to each byte" )
    {}

    VFS_BOOL Encode( VFS_FilterReadProc Reader, VFS_FilterWriteProc Writer )
    {
        if( !Reader || !Writer )
            return VFS_FALSE;

        VFS_BYTE bByte;
        while( Reader( &bByte, 1 ) )
        {
            bByte += m_cAdd;
            Writer( &bByte, 1 ); 
        }
    }

    VFS_BOOL Decode( VFS_FilterReadProc Reader, VFS_FilterWriteProc Writer )
    {
        if( !Reader || !Writer )
            return VFS_FALSE;

        VFS_BYTE bByte;
        while( Reader( &bByte, 1 ) )
        {
            bByte -= m_cAdd;
            Writer( &bByte, 1 ); 
        }
    }

    VFS_BOOL LoadData( VFS_FilterReadProc Reader )
    {
        return Reader( &m_cAdd, 1 );
    }

    VFS_BOOL SaveData( VFS_FilterWriteProc Writer )
    {
        return Writer( &m_cAdd, 1 );
    }

    void SetAddCount( signed char cAdd )
    {
        m_cAdd = cAdd;
    }

    signed char GetAddCount()
    {
        return m_cAdd;
    }

} g_MultiAddFilter;

You just have to register the filter using VFS_Register( &g_MultiAddFilter ) before opening any Archive and that's it ;-) Neat, isn't it ???

Streaming

For the streaming stuff some of you guys mentioned: My implementation of the VFS will only load the entire file into memory if any filters are applied to a file. Otherwise it will stream per default, so we don't need any special streaming functions.

Feeling Safe

I changed some #define-itions to enumerations (just for type safety). For the same reason, I'm using


enum VFS_Handle
{ VFS_HANDLE_FORCE_VFS_DWORD = 0xFFFFFFFF };

now instead of


typedef DWORD VFS_Handle;

The Utility Interface

In the Utility Interface you can now find these functions:


VFS_BOOL VFS_Util_GetPath( const VFS_String& strFileName, VFS_String& pszPath );
VFS_BOOL VFS_Util_GetName( const VFS_String& strFileName, VFS_String& pszName );
VFS_BOOL VFS_Util_GetBaseName( const VFS_String& strFileName, VFS_String& pszBaseName );
VFS_BOOL VFS_Util_GetExtension( const VFS_String& strFileName, VFS_String& pszExtension );

These are small utility functions used to split a file name into its component parts. Consider the path name "Textures/Sarge/AlphaTex.bmp". The following table shows the return values (i.e. the value pszPath/pszName/pszExtension contains after the function call) for each of the utility functions:

Utility Function	Return Value
VFS_Util_GetPath	Textures/Sarge
VFS_Util_GetName	AlphaTex.Bmp
VFS_Util_GetBaseName	AlphaText
VFS_Util_GetExtension	Bmp

So, that's it for the interface changes. There are many more small changes that you'll notice when looking at the header. But these are easily recognizable and not important enough to be covered here.

This section of the Tutorial is fairly important, so please tell me if I made any mistakes or if you want to show me any additional implementation possibilities (that make sense ;-) or if you have ideas on improving the implementation.

Implementing the VFS - Approach 1

The first approach, let's call it VFS1, is the approach I used when I created my old VFS (you can download this badly-structured, but (I think) well-commented piece of code here).

In VFS1, archive files have the following layout:

HEADER	The Archive Header.
FILTERS	A series of Filter Entries.
DIRS	A series of Dir Entries.
FILES	A series of File Entries.
CONFIGURATIONDATA	The plain configuration data for the filters.
DATA	The plain data.

The Header has the following structure:

ID	A 4-byte ID like 'VFS1', used to distinguish an Archive file from other files.
Version	A 4-byte Version (HIWORD = Major Version, LOWORD = Minor Version).
NumFilters	The Number of Filters this Archive knows. To update this field, call VFS_Archive_UpdateFilterList().
DefaultFilters	Each bit B in this bit set defines if the Filter at the index B is used.
NumDirs	The Number of Directories contained in the Archive.
NumFiles	The Number of Files contained in the Archive.
NumConfigurationData Bytes	The Number of Bytes used by Filter Configuration Data.
NumDataBytes	The Number of Bytes used by File Data.

The Filter Structure has the following structure:

Name	A maximal-255-chars-long name for the Filter.
DataOffset	The Offset of the Configuration Data relative to the CONFIGURATIONDATA Section.
DataSize	The Size of the Configuration Data used by this Filter.

The Dir Structure has the following structure:

Name	A maximal-255-chars-long name for the Directory.
ParentDirIndex	The Index of the Parent Directory, or -1 if this is a directory in the root directory.
DefaultFilters	Each bit Bin this bit set defines if the Filter at the index B is used.
NumChildren	The Number of Children of this Directory (stored to speed up VFS_Dir_GetInfo() calls).

The File Structure has the following structure.

Name	A maximal-255-chars-long name for the File.
ParentDirIndex	The Index of the Parent Directory, or -1 if this is a directory in the root directory.
Filters	Each bit Bin this bit set defines if the Filter at the index B is used.
DataOffset	The Offset of the File Data in the Archive file relative to the DATA section.
DataSize	The Size of the File Data.
DecodedDataSize	The Encoded Data Size (stored to speed up VFS_File_GetInfo() calls).

So what do you think about this structure? It seems relatively reasonable to me ;-) But it has the following disadvantages:

- Archive Manipulation Speed: The Archive Manipulation Speed is very slow. We'd better use some kind of cluster algorithm (like the one approach 2 uses).

And, of course, since VFS1 (Approach 1) is a platform-independent approach (and so the implementation will be), we've the next disadvantages:

- Lack of speed & Memory consumption: If you wanted to implement VFS1 totally platform-independent, you can't use file-mapping and so on. You could though, if you only want to hold the archive format platform-compatible.

And of course here are the advantages:

- Easy file format.

- Platform compatible.

Implementing the VFS - Approach 2

The second approach, is not platform-independent at all. It uses Microsoft's Structured Storage API. Here's a short overview about Microsoft's Structured Storage (extracted from the Platform SDK February 2001 Edition):

"...Structured Storage provides file and data persistence in COM by treating a single file as a structured collection of objects known as storages and streams.

The purpose of Structured Storage is to reduce the performance penalties and overhead associated with storing separate objects in a single file. Structured Storage provides a solution by defining how to treat a single file entity as a structured collection of two types of objects � storages and streams � through a standard implementation called Compound Files. This lets the user interact with and manage a compound file as if it were a single file rather than a nested hierarchy of separate objects..."

and

"...Traditional file systems face challenges when they try to store efficiently multiple kinds of objects in one document. COM provides a solution: a file system within a file. COM structured storage defines how to treat a single file entity as a structured collection of two types of objects � storages and streams � that act like directories and files. This scheme is called structured storage. The purpose of structured storage is to reduce the performance penalties and overhead associated with storing separate objects in a flat file...."

Benefits mentioned by the Platform SDK Documentation (and I agree to the Platform SDK ;-) are:

"Incremental access. If a user needs access to an object within a compound file, the user can load and save only that object, rather than the entire file.
Multiple use. More than one end user or application can concurrently read and write information in the same compound file.
Low-memory saves. Structured storage provides facilities for saving files in low-memory situations."

Another benefit: Microsoft uses the "cluster" approach mentioned above, so creating/deleting a "stream" (their name for what we call "files" ;-) is faster than in our method above. So, IMHO this looks better than VFS1. The Structured Storage has an API to manipulate the so-called Compound Files / DocFiles. These are files that can contain many subfiles and subdirs (exactly what we need ;-). Here's a screenshot of the DocFile Viewer application provided by the platform SDK. I created a simple Document File with one Directory, and here's the screenie:

You see the hierarchical structure ;-)? And did you notice that Microsoft Word uses DocFiles, too? Another benefit of using Structured Storage is simply that we don't have to write all the code by ourselves, but that we can simply use the highly-optimized Microsoft Structured Storage API. The filters and their configuration data could be stored in a separate subfile like "settings.ini" in the archive file and every file in the filter will have one DWORD bitset or something like this at the beginning. This bitset which indicates which filters are used, won't be accessible using the VFS_File_Read()/VFS_File_Write() functions, it simply doesn't "exist" for the outer world ;-)

So, we all agree that the DocFile File Format is good, but the major disadvantage is that it's absolutely not platform-compatible. That's bad. Really bad.

The Microsoft DocFile Format

Luckily, there is a library called "libole" for Linux which contains functions to access DocFiles. So the Linux implementation of the VFS could use "libole" to implement the archive functions.

Other Implementation Details

If we're going to use VFS2, we'll use the Microsoft concept of streams, since IStreams and IStorages are the basic blocks of Structured Storage. For these who don't know streams very well, here's a little extract from the Platform SDK Documentation:

"The IStream interface lets you read and write data to stream objects. Stream objects contain the data in a structured storage object, where storages provide the structure. Simple data can be written directly to a stream but, most frequently, streams are elements nested within a storage object. They are similar to standard files."

Streams have similar methods to the Standard Library's fopen()/fclose()/fseek()/f... functions, for instance the IStream's Seek() function acts like fseek(). The advantage of the IStream concept is IStreams are applicable for both Compound Files and Standard Files. So once we've created a IStream interface, we don't have to care about the file type anymore, we can just use the IStream methods.

Which Approach Are We Going To Use???

My personal favorite is using Structured Storage (VFS2, the 2nd approach), simply because it saves me a lot of work ;-) To be serious, I like using the Structured Storage for other reasons: Since Microsoft's programmers are usually good and the Structured Storage API is supposed to be highly optimized for Win32, it'll be much faster using Structured Storage instead of writing our own Archive management classes for the general case. But I'd like to hear your opinion on this point. What do you think? Just tell me what you think and vote for your favorite. Thanks ;-)

Conclusion

I hope this article wasn't too boring for you. I covered a lot of stuff - the new interface and how the implementation is going to work.

Another idea that came into my mind (I don't think mentioned it in the last issue, did I?) is to create a WinZip-like tool for managing Archive Files. What do you think? It would simplify archive management a lot, show you how to use the VFS and, of course, it would also be cool to show your friends, "Hey guys, wanna try my WinZip???" ;-)

I wish you the best for your nightly coding sessions,

Michael

You can grab all sources for this issue here.

Article Series:

Programming a Virtual File System - Part I
Programming a Virtual File System - Part II
Programming a Virtual File System - Part III