View Single Post
Old 10-31-2004, 07:58 AM
excaliber's Avatar
excaliber excaliber is offline
Senior Contributor

* Expert *
Join Date: Nov 2002
Location: Ohio, USA
Posts: 1,828
Default Asynchronous File I/O in VB.Net

Today I'm going to talk about Asynchronous file I/O. The .Net platform provides for developers a set of Asynchronous methods that can be used to gain performance in certain situations (file I/O, Networking, Queue managment, etc). This tutorial will cover using Asynchronous methods on basic File I/O to gain a performance increase on large files.

What is Asynchronous?
Consider this situation. Let's assume you have 1000 files that you need to open, read into byte arrays, then write to a different location. Each file is well over 4 megabytes of data. What could you do?

First, you could use a single thread to process each file synchronously. This could take a loooong time.

You could use a new thread for each file, and process them all at the same time. On the surface, this seems like the best and most elegant method. But this has three huge problems. First, the memory overhead would be enourmous.

Having 1000 different files opened and read into memory at the same time would be quite drastic. The second problem is more subtle. The OS (Windows 2000, NT, XP) manages how much time each thread gets. This time is called the quantum. The OS divides processor time up evenly between all threads with like priority. So what is the problem? When the CPU switches from one thread to another, it has to perform "context switching". This is a fancy way of saying the CPU is saving the state of one thread and loading a new one. Context switching happens whenever a new thread is being worked on by the OS, so usually has a very minimal impact on the system. But when you start increasing the total number of threads (in this case by 1000) context switching becomes longer, at the same time the quantum becomes shorter. By time the context switching is done, the quantum is mostly burned up and little time is actually used for processing.

The last problem deals with the limitations of the OS itself. Each thread is by default allocated 1MB of stack space. Modern Windows OS's (NT, 2000, XP) are limited to 2GB of user address space. This means that there can never be over (roughly) 2000 threads, assuming you have 2GB of spare memory.

So, what to do? Well, you could use a threadpool object to handle the requests. This is actually exactly what Asynchronous programming does, but does it "behind the scenes" and makes life a lot easier for the programmer. So lets discuss Asynchronous now. We are going to hit a bit of theory before jumping into code. Asynchronous methods do some interesting things that are not usually descibed in tutorials. I'd like to describe what is actually going on in the background, as it took me quite a while to scrap together the details.

Asynchronous methods use to very important things to gain performance: completion ports and a threadpool. Completions ports is a way to manage multiple requests at once in a non-blocking manner. A single thread will make requests to perform I/O on the completion port, which then notifies the OS of the request. The completion port returns control to the caller to continue processing or doing more work on other things. When the OS is done with the I/O, it notifies the caller through a callback. At this point, an idle thread from the threadpool (or, preferably, the thread that is currently executing to avoid context switching) is taken and used to recieve the callback. Data is gotten from callback, and it exits, returning the thread back to the threadpool.

This method allows us to make several to a lot of calls and allow them to be processed in the most efficient manner. Here is some more background information, and the reason I said that only some tasks benefitted from Asynch programming. Suppose instead of opening a file and reading from that (where the disk access speed is the bottleneck), you were factoring large numbers. In this scenareo, the bottleneck is the CPU, and no performance gain would come from asynch programming. Asynch programming works well when threads will spend a signifigant portion of their quantum doing little CPU work (because it is waiting for the disk to return, or a network connection, etc). In these situations, asynch programming is perfect because it can perform many operations at once and process them as they finish.
Reply With Quote