Greene's Programming Pages
C++

File I/O Efficiency — C++'s fstream or C's FILE*?

Hi, fellow programmers!

I learned C++ from the beginning of my learning C/C++, and the two books I used to learn it from both steered me away from using C's FILE*, fgets, fread, fwrite, and so on. (In fact, neither Bruce Eckel — whose C++ teaching I have come to love very dearly — nor John T. Berry even mention at all how to do file i/o with C's FILE*.) Liking the iostream and fstream of C++, I never had any occasion to look "backward" at the C way of doing it. Until now.

I've been using Borland's C++ compiler 5.02 for the last few years. I generates pretty efficient code. Over the last several months (Jan-Sep, 2000), I have begun using the relatively new STL structures (nice stuff, that!), and as I get into using it more and more, I have noticed that the Borland 5.02 compiler sometimes generates some slow-running EXEs, compared to the Borland 5.5 compiler. I have shied away from the Borland 5.5 compiler specifically because its fstream file i/o is much less efficient than the 5.02 compiler's fstream file i/o. This prompted me to give Microsoft's Visual C++ 6.0 compiler a try (I'm using the current 6.0 version). I have noticed with various programs I've written that with my programs that used a lot of STL, the VC++ compiler generated faster EXEs, but with programs that had little or no STL the Borland 5.02 compiler was better.

Then, I tried out this one program of mine that uses an STL map structure to count the occurring values in a field in a fixed- record text file, and discovered that the Borland 5.02 compiler (using fstream) made a much more efficient EXE than the VC++ compiler, meaning that this EXE's time was being spent mostly on just reading the file, and very little on counting the occurring values in the map structure.

This motivated me to try a little test. I changed the file reading (I was using the fstream getline function) to C's FILE* and fgets.

WOW!!!

The difference is truly amazing! Under all compilers the EXE is substantially more efficient. And I found that the VC++ generated an EXE that was about 15%-20% faster than the Borland 5.02 compiler. And all of this simply by dumping fstream and using FILE*. What a lesson for me!

My data processing program users are sometimes processing hundreds of thousands, or even the occasional millions of record, and these substantial time differences can really add up and make a difference over time, which is why I dug into this. So if you write programs that need to perform a lot of file i/o, and you're using C++'s fstream, you might want to check into going "backward" a bit. It could be worth it.

Here is some detailed information on what I've tested showing the distinct differences in file i/o efficiency between C++'s fstream and C's FILE*.

The stripped down file i/o test program that I used is below.

This program simply performs a read on the input file using either the fstream "getline" function or the FILE* "fgets" function (which by default reads characters up to the next hex 0A [linefeed] into the buffer), checks to see if there is a hex 0D [carriage return] at the end of the line in the buffer (since reading in binary mode) and decrements the "len_buf" if so, then writes "len_buf" characters from the buf to the output file and adds a hex 0A [linefeed] on the end of the line. (This could actually be used to convert a PC text file to a UNIX text file because it converts the CRLF line-delimiter to simply a LF line-delimiter.)

As you can see, the vast majority of the work is simple file i/o, with very little else going on.

My system happens to have a Pentium II chip at 266 MHz. For generating times, I used an input file consisting of exactly 99,696 fixed-length CRLF-delimited records of 217 characters (219 bytes/record counting the CRLF on the end), which is a 21,833,424 byte file. I have two physically separate hard drives on my PC, and when I ran the program to "clock" the times, I had my source file on one drive and my destination file going to the other drive.

The three compilers I have are Borland 5.02 (B50), Borland 5.5 (B55), and (Microsoft) Visual C++ 6.0 (VC). Here are the times I recorded (in seconds):

C++:  B50  B55   VC
      ---  ---  ---
       14   34   28
       13   33   29
       13   34   28
       13   33   28
       14   33   28

C:    B50  B55   VC
      ---  ---  ---
       13   12   11
       12   13   13
       13   12   11
       13   12   11
       13   13   12

Now, MFC's CFile actually uses C code, so if you're using MFC you're getting the more efficient FILE* file i/o automatically. I could understand some small "performance hit" with using the C++ fstream (say, 10% or so), but it is the more than 100% performance hit that strikes me as being a bit ridiculous. Borland's 5.0 compiler's fstream, as you can see from my results, had a very tiny performance hit (if any at all, since the difference in time shown here is so small so that it could in fact be attributable to simple fluctuation). So what happened with Borland 5.5 that Borland killed the efficient fstream?

And for those of you who might be thinking, "Yeah, yeah, who cares about 10 or 25 seconds?" I want to point out that if you happen to be running, say, 10 MILLION records instead of just 100,000, then you are talking about the difference between, say, 20 minutes using FILE* and 50 minutes using fstream, meaning that if you use fstream you could be adding 30 minutes of completely unnecessary time for your program to run just performing basic file i/o. That's a big difference. These things add up substantially.

I would be curious to hear from others regarding their own results with this test program, especially in regard to relative times using other compilers as well, such as a GNU C++ compiler (with the libraries to make Win32 executables) or Watcom. It would also be interesting to hear from those of you who compile on UNIX platforms, to see if there are any substantial differences in file i/o times with compilers there between fstream and FILE*.

(By the way, I'm running a Windows NT 4.0 platform. The majority of the programs I write are for the purpose of processing large amounts of "record type" data (like what you might do with COBOL on a mainframe), so processing times with large numbers of records is a priority for me. As you can see, the program above is strictly a "console app" that would run in a console window on Windows NT or Windows 9x. If you're on a UNIX platform (say, Linux or AIX or anything like that), the code above should be compatible (even though I'm on Windows, I try to stay with compatible - more-or-less standard - C++ source code) though you'll probably have to alter the "include" files a bit.)

A C++ programmer,
Todd S. Greene
<tgreene@usxchange.net>
(Sep. 13, 2000)


//==============================================================================
// datetime: 9/12/00 08:40:12 am EST
// Copyright 2000 Todd S. Greene
// https://members.tripod.com/toddsgreene/
//
// This code is hereby released to the public domain.
//
// Your actual use of this code is your responsiblity, not mine. (In other
// words, if you choose to use this code, then you are liable for results,
// not me. No one has the right to sue me for anything, based merely on using
// some information that I have provided on a purely non-commercial basis.)
//==============================================================================
#define COMPILE_BORLAND50
//#define COMPILE_BORLAND55
//#define COMPILE_MICROSOFT

//#define FILEIO_CPP
#define FILEIO_C

//==============================
// Borland 5.0 headers:
#ifdef COMPILE_BORLAND50
#include <iostream>
#include <fstream>
//using namespace std;
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#endif
//==============================

//==============================
// Borland 5.5 headers:
#ifdef COMPILE_BORLAND55
#include <iostream>
#include <fstream>
using namespace std;
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#endif
//==============================

//==============================
// VC++ headers:
#ifdef COMPILE_MICROSOFT
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <fstream>
using namespace std;
#endif
//==============================


//##############################################################
//## START OF MAIN #############################################
//##############################################################

int main(int argc, char** argv)
   {
   if (argc != 3)
      {
      cerr << "\n   usage: test1 {input file} {output file}"
           << endl;
      return EXIT_FAILURE;
      }

   char* fs_name_in = new char[strlen(argv[1])+1];
   strcpy(fs_name_in, argv[1]);
   char* fs_name_out = new char[strlen(argv[2])+1];
   strcpy(fs_name_out, argv[2]);


   //===========================================================
   // input file:
#ifdef FILEIO_CPP
   ifstream fs_in;
   fs_in.open(fs_name_in, ios::binary);
#endif
#ifdef FILEIO_C
   FILE* fs_in;
   fs_in = fopen(fs_name_in, "rb");  // read only, binary
#endif
   if (!fs_in)
      {
      cerr << "\nfatal error: could not open input file \""
           << fs_name_in << "\"\n" << endl;
      return EXIT_FAILURE;
      }
   //===========================================================

   //===========================================================
   // output file:
#ifdef FILEIO_CPP
   ofstream fs_out;
   fs_out.open(fs_name_out, ios::binary);
#endif
#ifdef FILEIO_C
   FILE* fs_out;
   fs_out = fopen(fs_name_out, "wb");  // write only, binary
#endif
   if (!fs_out)
      {
      cerr << "\nfatal error: could not open output file \""
           << fs_name_out << "\"\n" << endl;
      return EXIT_FAILURE;
      }
   //===========================================================


   //--------------------------------------------------
   // set processing start time:
   time_t timv = time(0);
   char time_start[49];
   strcpy(time_start, ctime(&timv));
   if (strlen(time_start) > 0)
      {
      if (*(time_start+strlen(time_start)-1) == '\n')
         { *(time_start+strlen(time_start)-1) = '\0'; }
      }
   //--------------------------------------------------

   //===========================================================
   // setup variables for primary loop:
   //-----------------------------------------------------------
   const unsigned int sz_buf = 4096;
   char* buf = new char[sz_buf];
   unsigned long len_buf;

   unsigned long count_recs_in = 0;
   unsigned long count_recs_out = 0;

   long filepos = 0;
   //===========================================================

   cout << "\n   Processing data..." << flush;

   //===========================================================
   // primary input loop starts here:
   //-----------------------------------------------------------
#ifdef FILEIO_CPP
   while(fs_in.getline(buf, sz_buf))
#endif
#ifdef FILEIO_C
   while(fgets(buf, sz_buf, fs_in))
#endif
      {
      count_recs_in++;
#ifdef FILEIO_CPP
      len_buf = fs_in.gcount();
      if (!fs_in.eof())
         { len_buf--; } // cuz gcount() includes dropped delimiter
#endif
#ifdef FILEIO_C
      // len_buf = strlen(buf); // using file position is faster
      len_buf = ftell(fs_in);
      len_buf -= filepos;
      filepos += len_buf;
      //filepos = ftell(fs_in);
      if (!feof(fs_in))
         { len_buf--; } // cuz gcount() includes dropped delimiter
#endif
      if (*(buf+len_buf-1) == '\x0D')  // since using binary read
         { len_buf--; }

#ifdef FILEIO_CPP
      fs_out.write(buf, len_buf);
      fs_out.put('\x0A');
#endif
#ifdef FILEIO_C
      fwrite(buf, len_buf, 1, fs_out);
      fputc('\x0A', fs_out);
#endif
      count_recs_out++;
      }  // *** END OF PRIMARY INPUT LOOP ***
   //===========================================================

#ifdef FILEIO_CPP
   fs_in.close();
   fs_out.close();
#endif
#ifdef FILEIO_C
   fclose(fs_in);
   fclose(fs_out);
#endif
   delete[] buf;

   cout << "   ...finished processing data" << endl;

   //--------------------------------------------------
   // set processing stop time:
   timv = time(0);
   char time_stop[49];
   strcpy(time_stop, ctime(&timv));
   if (strlen(time_stop) > 0)
      {
      if (*(time_stop+strlen(time_stop)-1) == '\n')
         { *(time_stop+strlen(time_stop)-1) = '\0'; }
      }
   //--------------------------------------------------


   cout << "\n        input file: " << fs_name_in << endl;
   cout << "      record count: " << count_recs_in << endl;
   cout << "       output file: " << fs_name_out << endl;
   cout << "      record count: " << count_recs_out << endl;

   cout << "\n    start time: " << time_start << endl;
   cout << "   finish time: " << time_stop << endl;

   return EXIT_SUCCESS;
   }  // *** END OF MAIN ***

//##############################################################
//## END OF MAIN ###############################################
//##############################################################

 PREVIOUS PAGE   C++ PAGE 
 HOME   CONTACT ME 
This page: Created 10/13/00. Last updated 10/13/00.
Copyright © 2000  Todd S. Greene.