Binary Files

A "better" way to store data is in binary format.  This approach is used by most commercial database software. While characters (from strings) are still stored as their ASCII representations, numbers (ints and doubles) are stored in their bitwise form -- the same way they are stored in the computer's memory.  Storing numeric data in such a way usually means that the data takes up less space on disk.

Our first example stores a structure to a file and retrieves it.  The C++ iostream library has two functions that we can call to store and retrieve data in binary format:

	ostream & write( const char * buf, int length );
	istream & read( char * buf, int max );

Since character (char) data is exactly one byte (or 8 bits) in size, we can call these functions to store and retrieve strings and numbers.

To confirm, let's look at the relative sizes of standard data types.  We can use the sizeof operator to tell us:

Listing 1a.  Program to report size of certain types.

#include <iostream.h>

struct myStruct
{
	char * s;
	int i;
	double d;
};

int
main( )
{
	myStruct * p;	/* pointer to a single structure (4 bytes
			   to store the address of a structure) */

	/* the output of fifth line should be equivalent to the sum of
	   the outputs of the second, third and fourth lines
	 */
	cout << "sizeof(char) = " << sizeof(char) << endl;
	cout << "sizeof(char *) = " << sizeof(char *) << endl;
	cout << "sizeof(int) = " << sizeof(int) << endl;
	cout << "sizeof(double) = " << sizeof(double) << endl;
	cout << "sizeof(myStruct) = " << sizeof(myStruct) << endl;
	cout << "sizeof(myStruct *) = " << sizeof(p) << endl;

	return 0;
}

Download this program

Listing 1b.  sizeof program output (Windows 95, MSVC++ 5).

sizeof(char) = 1
sizeof(char *) = 4
sizeof(int) = 4
sizeof(double) = 8
sizeof(myStruct) = 16
sizeof(myStruct *) = 4

The program output clearly shows that char data is one byte in size, int data requires four bytes, double data requires eight bytes, and pointers require four bytes.  Also, the size of a structure is simply the sum of the sizes of its members.

Listing 2a.  Program to store/retrieve a struct in binary form.

#include <fstream.h>
#include <string.h>

struct somestruct
{
	char * ss_string;
	int ss_int;
	double ss_float;
};

// Write a somestruct to an output stream in binary form

ostream &
write( ostream & ostr, const somestruct & ss )
{
	//  Conditional operator.  Used in an assignment statement as
	//  shorthand for if..else.  The first part (before the ?)
	//  is an Boolean expression that gets evaluated as true (1) or
	//  false (0).  If true, the entire expression takes the value of
	//  the second part (after the ?); if false, the entire expression
	//  takes the value of the third part (after the :).
	//
	//  The expression below is equivalent to the following:
	//
	//	int length;
	//	if ( ss.ss_string != 0 )
	//	{
	//		length = strlen( ss.ss_string );
	//	}
	//	else
	//	{
	//		length = 0;
	//	}
	//
	int length = ( ss.ss_string != 0 ) ? strlen( ss.ss_string ) : 0;

	//  Write the length of the string to the file in 
	//  binary form.  To do this we are using a "member function"
	//  of the ostream class, whose prototype is:
	//
	//	ostream & ostream::write( const char * buf, int size );
	//
	//  To satisfy the compiler, we have to make a "type cast", which
	//  temporarily converts the "int *" created by using the
	//  "address-of" operator (&) on length, to a "const char *".
	//  This way the compiler knows to choose the appropriate
	//  function (since write is overloaded).
	//	
	ostr.write( (const char *) &length, sizeof(length) );

	//  only write the string member if one exists and has 1 or more
	//  characters in it
	if ( length > 0 )
	{
		ostr.write( ss.ss_string, length );
	}
	ostr.write( (const char *) &ss.ss_int, sizeof(ss.ss_int) );
	ostr.write( (const char *) &ss.ss_float, sizeof(ss.ss_float) );

	return ostr;
}

// Read the binary form of a somestruct from an input stream

istream &
read( istream & istr, somestruct & ss )
{
	//  free any memory associated with the string member and
	//  make the pointer null (0)
	delete [] ss.ss_string;
	ss.ss_string = 0;

	//  Try to read the length of the string member.  This uses
	//  a cast to force the compiler to use an overloaded member
	//  function of the istream class called "read", whose prototype
	//  is:
	//
	//	istream & istream::read(char * buf, int max);
	//
	int length = 0;
	istr.read( (char *) &length, sizeof(length) );

	//  check if there is more data to read
	if ( istr.good() )
	{
		//  allocate and read in the string if the read-in length
		//  is positive
		if ( length > 0 )
		{
			ss.ss_string = new char [ length + 1 ];
			istr.read( ss.ss_string, length );
			ss.ss_string[length] = '\0';
		}
		istr.read( (char *) &ss.ss_int, sizeof(ss.ss_int) );
		istr.read( (char *) &ss.ss_float, sizeof(ss.ss_float) );
	}
	return istr;
}

// Write the contents of a somestruct to an output stream in text form.

inline ostream &
operator<<( ostream & ostr, const somestruct & ss )
{
	//  The string member is checked for a null pointer, which
	//  will cause the string "(null)" to be substituted and
	//  written to the stream.
	//
	ostr << '"' << ( ( ss.ss_string != 0 ) ? ss.ss_string : "(null)" )
	     << '"' << ' ' << ss.ss_int << ' ' << ss.ss_float;

	return ostr;
}

int
main( )
{
	const char * const FILENAME = "myStruct.bin";

	somestruct myStruct;

	myStruct.ss_string = new char [32];
	strcpy( myStruct.ss_string, "Hello, hello, hello, how low?" );
	myStruct.ss_int = 32;
	myStruct.ss_float = 3.14159265;

	cout << "Initially:"  << endl << myStruct << endl;

	// Write the struct in binary form to a file, discarding any
	//  previous contents.

	ofstream out( FILENAME, ios::out | ios::trunc );
	if ( ! out )
	{
		cerr << "error saving" << endl;
	}
	else
	{
		write( out, myStruct );
	}
	out.close( );

	// ^Close the file so that we can open it for read...

	// ... then restore the binary form of the struct.

	somestruct anotherStruct;
	ifstream in( FILENAME );
	if ( ! in )
	{
		cerr << "error restoring" << endl;
	}
	else
	{
		read( in, anotherStruct );
		cout << "Restored:"  << endl << anotherStruct << endl;
	}

	return 0;
}

Download this program
Download an alternate version of this program

We call the write and read functions and, if necessary, convert numeric data to bytes temporarily, to store and retrieve data from the file.   It we look at a hex dump of the file:

Listing 2b.  Hex dump of binary file containing structure data.

(Pentium II, Windows 95, MSVC++ 5)
00000000:  1d 00 00 00 48 65 6c 6c 6f 2c 20 68 65 6c 6c 6f   ....Hello, hello
00000010:  2c 20 68 65 6c 6c 6f 2c 20 68 6f 77 20 6c 6f 77   , hello, how low
00000020:  3f 20 00 00 00 f1 d4 c8 53 fb 21 09 40            ? ......S.!.@
0000002d

we can pick out the individual components of the structure.  The bytes 1d 00 00 00 is the length of the string member -- 30 (1DH).  The extra three zeros are not used, but are stored, because an int requires four bytes.  (Since this example uses Windows 95 on an Intel x86 architecture machine, the number is stored in Little Endian format -- the lowest-order byte first and the highest-order byte last.  Some other architectures store numbers in the reverse order, called Big Endian.)

Next we have the characters of the string -- each takes up only one byte; followed by the int member (20 00 00 00 -- the number 32 in Little Endian format).  Finally, we have the double member, stored as 8 bytes (f1 d4 c8 53 fb 21 09 40) in IEEE floating-point format in Little Endian order.

The size savings of using binary format files is apparent -- firstly, no field or row separators.  Although the ints would have required only two bytes (instead of four) bytes each in ASCII format, the double would have required 10 bytes (as opposed to eight).


Back to the COMP435 page