Out of memory error at reDim of array + Creating Several Arrays with same Name [VBA]

faromic
09-02-2009, 10:29 AM
I have a sub routine where I am reading the lines of a text file and using the split function to store each string in the textline to an element of the array. The current textfile has 3,382,752 rows although it can and will vary. I am expecting textfiles having up to 25,000,000 rows (maybe even more). Anyways, with the current 3382752 row textfile, I get an out of memory error. When I manually change the number of rows being read in to 2,000,000 rows, I don't get the error. I then tried a different approach; create a 3d array and limit the number of rows to 1,000,000 in each "plane of the array". So for my case I would have 4 "planes" of arrays with the last "plane" not fully populated. I had defined 1,000,000 rows and 8 columns of data in each "plane of data". However, when I did this, I still received out of memory errors when the ReDim command was used to size the array to the number of planes and number of rows per plane. I'm not understanding exactly what the limitation here is. It seems like it's not the specified array size, but the number of lines being read in (or stored) in the arrray, even if it is a 3D array. I am running into this problem at only about 3 million lines of data; what will happen when I get to over 20 million?

Now I am thinking about using several (as many as needed) 2D arrays with 1 or 2 million rows and 8 columns to store the data. Before embarking on this coding, I wanted to get people's opinions on how to do this more efficiently and also avoiding out of memory errors. I'm not sure how to write code that would have the same array name with as many "indices" as required. That's my main question here. Let's say the name of the array is "Reactions_Array", and that I have 1,000,000 rows per array. In this case, that's 4 arrays with the 4th one not fully populated. So, I'm trying to have the arrays named:

Reaction_Array_1 or Reaction_Array(1)
Reaction_Array_2 or Reaction_Array(3)
Reaction_Array_3 or Reaction_Array(3)
Reaction_Array_4 or Reaction_Array(4)

Is there a way of doing this? I'm having trouble thinking this one through.

I know it's also possible to use a database for this purpose, but I have absolutely no experience using Access and think it would take too much time to learn it as I have time constraints setup by the work schedule.

Here is a portion of the text file I am reading in. These are not the first lines of the file; they are taken from the middle of the file. Actually I'm reading in 2 text files (both have equal number of rows). The only difference is that one doesn't have the Max/Min column (not shown). There are two text files because there are two different cases run and both need to be added together.

Here is some of the text file. There are 12 "columns" (although the 3rd is not always populated) in the text file.


1 C134 4350 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C134 4351 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C135 4162 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C135 4163 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C135 4350 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C135 4351 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C136 4162 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C136 4163 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C136 4350 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C136 4351 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000
1 C201 Max 4162 1321.49073 232.61665 211.25225 196.93580 45.58798 35.28102 17.53743 6.70972
1 C201 Max 4163 1320.16471 204.60263 353.79847 117.08680 18.13365 7.50298 17.53743 1.57042
1 C201 Max 4350 9.15908 95.46492 98.87438 178.43364 29.45386 25.21007 11.93670 6.70972
1 C201 Max 4351 17.30500 22.50205 241.45986 91.27807 7.81025 11.96793 11.93670 1.57042
1 C201 Min 4162 -1315.30679 -230.32392 -209.78770 -195.01775 -45.37177 -35.19049 -17.47265 -6.72238
1 C201 Min 4163 -1314.18721 -203.52428 -351.24880 -115.63272 -18.01703 -7.45453 -17.47265 -1.56354
1 C201 Min 4350 -9.40017 -94.26445 -98.32249 -176.56687 -29.21808 -25.13913 -11.84978 -6.72238
1 C201 Min 4351 -17.75252 -22.51595 -239.82286 -89.87744 -7.67309 -11.93906 -11.84978 -1.56354
1 C203 Max 4162 1321.49073 232.61665 211.25225 196.93580 45.58798 35.28102 17.53743 6.70972

Thanks,

kassyopeia
09-02-2009, 11:36 AM
You should re-examine if you really need to load the entire text file into memory. Your sample has about 60 characters per line. Multiplying that by 2 bytes per character and 25 million lines, that's a memory consumption of at least 3 gigabyte. Even if you overcome the technical problem that's causing the error message, that doesn't sound like a good idea.
If you want to rearrange the data and then write it back to disk, you don't need to load it all at once. If you want to do statistics, you can usually use some manner of running totals, which again means you don't need to keep the data in memory.

faromic
09-02-2009, 12:13 PM
I do need all the lines in the text file. You said you can write code to actually open the text file, rearrange the data as you like, and then close the file. The file can then be opened and read line by line as needed. One problem I have is that the data is not arranged in the order I need. So I wanted to read it into an array, and then use a series of For and If statements to iterate through every row in the array and use the lines that I need as they come along. As you said, I guess I can rearrange the text file as I need and read it directly. That's how I'm understanding it; let me know if I'm understand you incorrectly. How would I go about opening and rearranging the text file? I don't see how you could use a "sort" type of command if reading in line by line?

kassyopeia
09-02-2009, 12:58 PM
What I'm trying to say is that there is no need to transfer the entire file to memory at once. You can run loops and conditionals almost as easily against the file itself as against an array that contains the file contents.

Incidentally, this does sound more and more like it would be best addressed with a database. :)

faromic
09-02-2009, 01:54 PM
What I'm thinking of doing is separating the textfile by the value in the first column. So if there are 4000 different numbers in the first column of the text file, a new text file for each of the 4000 elements will be created. Each text file will contain all the lines that pertain to that element. The file can then be easily accessed later on and then read into an array since the rows are not in order. Opening the text file and iterating through each row while using the split function is very time consuming and doing this several thousand times would take too long I think. The thing that take the most time is that the columns are separated by more than 10 spaces. So I am using the InStr function and replacing double spaces with one space until only one space exists between strings.
How would I Create a new folder on the hard disk and also create a new text file and then copy lines to the new text file? I can't find anything useful on google.
Thanks,

kassyopeia
09-02-2009, 02:57 PM
Hmmm... interesting optimization problem. I don't think any of the straightforward approaches are feasible. You can't read the input file all at once, you can't open all the output files at once, and opening one output file at a time and reading the input again and again will take forever.

I see two possibilities: Either, do it line by line, and open the output files as needed (pseudocode):

Do Until EOF(input file)
read line from input file
construct output file name from line
open output file
write line to output file
close output file
Loop

Or, do both in chunks. If we keep the array below 100 MB and open 100 output files at a time, say:

Do Until EOF(input file)
For i = 1 to 500,000
read line from input file
construct output file name from line
put line into lines(i)
put name into names(i)
Next i
For j = 1 to (4000/100)
For i = 1 to 100
construct output file name from i, j
open output file
Next i
For i = 1 to 500,000
if output file names(i) is open then write lines(i) to output file
Next i
For i = 1 to 100
close output file
Next i
Next i
Loop

If you don't know the file names beforehand, this approach will require a first pass, during which you make a list of file names, and a second pass, as shown above. "construct output file name from i, j" would then mean "look up output file name in the list".

Which of the two is faster depends on how long it takes to open and close a file. I have no idea about that, so I'd try out both with a smaller sample first.

kassyopeia
09-02-2009, 03:01 PM
Oh, sorry, I forgot:

MkDir Statement
Creates a new directory or folder.
Syntax: MkDir path


Open Statement
Enables input/output (I/O) to a file.
Syntax: Open pathname For mode As filenumber
[...]
If the file specified by pathname doesn't exist, it is created when a file is opened for Append, Binary, Output, or Random modes.

EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum