View Single Post
Old 08-08-2009, 11:50 PM
dilettante's Avatar
dilettante dilettante is offline
Underclocked lifestyle

Forum Leader
* Guru *
Join Date: Feb 2005
Location: Michigan, USA
Posts: 4,533
Default VB6 HugeFixedFile

For certain classes of large-scale tasks even the fanciest RDBMS may not be the answer. One of these involves the logging of large volumes of transactions.

An RDBMS like Jet or even SQL Server can add significant storage overhead to such logs. When you get into the 10s of gigabytes range even 50% overhead gets costly to store and back up, and 50% is often a low estimate!

You will probably want to digest such captured logs and use an RDBMS to store the aggregate data for decision support. But first you'll have to capture it and then process it to produce those aggregate figures.

You might start by using a Jet MDB in its lightest mode possible. This means a single table, no keys or a primary key, opening it exclusive, and tweaking a few other parameters. But you'll still hit the 2GB ceiling quickly.

MOving to SQL Server Express might get you a 4GB ceiling, which is still far too low. SQL Server storage overhead can be even higher than Jet's. You'll also need to turn off logging and tweak many other parameters for decent performance. SQL Server Standard removes the 4GB limit but does nothing to assist with the other issues.

There are other types of DBMS and even lightweight ISAM systems. None of these are common today and some have APIs that would make a VB6 programmer lose a lot of hair. They can still be good solutions though once you license them and get over the learning curve. A few even offer an OLE DB Provider which makes them easier to use.

The standby in the DOS days was the MS-ISAM that came with Pro editions of the Basic PDS and later VBDOS 1.0, but this isn't an option for the VB6 programmer.

One other holdover from DOS Basic that we still have in VB6 is the User Defined Type (record). VB6 native I/O statements still support fixed-length record files, including random I/O on them.


I was faced with a challenge like this recently.

Transactions had to be captured that arrive 24x7 from an untold number of sources for offline processing. These transactions had an arrival rate averaging 20 per second most of the day and as high as 200 per second during the two peak hours each 24 hour period. The "two hump" distribution seems to involve U.S. East and West coast sources.

At least one calendar year had to be stored in one datastore. This came to about 1.1 billion records per file. The records contained 120 bytes of data fields, so the total came to about 132.5GB per file - if you assumed zero overhead.

Processing of the resultant data involved several aggregations done monthly, quarterly, and annually, as well as "month to date." These were mostly sums, averages, and other simple calculations - grouped by hour and for some aggregations also by item and/or source category.

What the aggregated data was for and how it was to be reported and presented wasn't my problem. We just needed the results posted to tables in two "mirror image" SQL Server databases in two hosting centers.

What To Do?

I've decided not to describe the customer's first attempt as a logging subsystem. It is just too painful.

I replaced it with a VB6 service that accepts SOAP calls using my own HTTP server component to avoid IIS overhead. Then I logged the data to disk using a more complete version of the HugeFixedFile class which does I/O in blocks or pages to optimize disk throughput. We run two copies of the service on a 4-processor server to log to two separately located remote file shares (fiber Ethernet is plenty fast), and the application makes paired SOAP logging requests.

Then I turned the HugeFixedFile class, some helper routines, and documentation over to two ex-Cobol guys who knew some VB6. I spent about two days going over a sample "aggregator" with them and turned them loose on writing the (5) aggregator programs they needed. Then we did code reviews, cleaned things up, and ran a few tests. They enjoyed it: it reminded them of their old tape Cobol programming days!

So How Does it Work?

Basically it is nothing more than plain old VB6 UDT I/O - extended past the 2GB barrier.

It takes a little sashaying back and forth and a hatful of API calls, and it has a few wrinkles VB6 UDT I/O doesn't have. Not bad though really. Add a few helper routines, and a binary search to locate "start" positions for doing the nightly aggregation runs.

It should scale to hundreds of Terabytes... if you have the time and space!


The example attached is a slightly stripped version. Some of the basic optimizations aren't present in this version. They don't seem to make a lot of difference using a 5400 or 7200rpm drive anyway, at least not like they did over the fiber MAN to fast RAID arrays.

I more concerned about finding some tricks to reduce CPU consumption. I'm doing more copying back and forth than I'd like for one thing. Part of this is to help make it practical to use UDTs, which give easy access to individual fields.

So here (attached) I have a simplified example of HugeFixedFile's use. It is free-standing, just compile and run.

The program accepts a record count for creating the initial file. This file is filled with dummy data, including ascending timestamp values.

The one small challenge here is locating a start point for sequential processing based on a timestamp value and we don't have any indexes or keys. So a binary search routine and a "step back" (Find First) routine are included for demonstration.

A fuller example might accept a start and end timestamp, then run through the record range aggregating the transaction data. Perhaps this might be followed by a display of the results using MSChart.

Once you create a file you can retreive and display the contents of a record by record number (base 0) you enter or by Find First of a timestamp value you enter. It's a "dumb" example just to show what can be done.

My laptop drive is sort of a slow 5400rpm unit, and I am "pacing" the process of creating and writing the records to avoid UI lockup. Even so I can peg one core of my dual core machine, a 2.26GHz P8400. Write rate approaches 70,000 records/sec. so 120,000,000 records take about 26 minutes to write. That's about 8GB of data for these 72 byte records, so I set it as the limit the demo accepts.

What I haven't done yet is to profile this stripped version to find the bottlenecks. I suspect the data copying using LSet might account for a chunk of the CPU being expended though. And a 64-bit VB6 might do wonders in terms of eliminating the need to mess around with Currency variables too!

Maybe someone will have some useful suggestions.
Attached Files
File Type: zip HugeFixed (10.1 KB, 13 views)
Reply With Quote