Binary Files – Stephen Marz

Learning Objectives

Understand what a binary file means.
Be able to open a file as a raw, binary file.
Be able to read a given number of bytes from a binary file.
Be able to write a given number of bytes to a binary file.
Understand and be able to use the bytes data type.
Understand and be able to use the bytearray data type.

What is a Binary File?

All files are technically binary, meaning that they are made up of a bunch of 0s and 1s. However, when we initially read files, we read in 8 bits, or 8 0s and 1s in a row at a time. 8 bits is known as a byte, and text files store one character in one byte. So, if I have a 30 byte text file, that means that the file contains 30 characters.

When we talk about binary files in the context of this lecture, we’re talking about files that do not follow the 1-byte, 1-character format. Instead, we have a file format, such as a JPEG or MP3 file. These have a format that engineers designed. For example, a JPEG must have a width, a height, and a list of pixel colors to display an image. There are many other things that go into the JPEG file, but I think you get the point.

Binary File Structures

As I mentioned above, binary files are just sequences of 0s and 1s. There is no inherent structure. For example, if I rename my JPEG file into a TEXT file, Windows will try to open it as a text file. This is because Windows only sees 0s and 1s, but the extension (.jpg) tells Windows to open it as a JPEG. If Windows know this, it can run a program that is specifically designed to put a certain order to the JPEG file.

Binary File Example

One example that I use often to show how binary files work is the bitmap file. These are picture files, much like JPEG files, but there is no complicated decompression.

Recall that a byte is 8 bits, which is 8 0s and 1s put together. Generally, but unfortunately not always, the smallest addressable unit binary files use is a byte.

The following is what is known as a header, which is a structure put to 0s and 1s so that we know what those 0s and 1s mean. Here’s the the bitmap file header structure:

type (16 bits, 2 bytes)
size (32 bits, 4 bytes)
reserved (32 bits, 4 bytes)
offset (32 bits, 4 bytes)

You can see above that all of the sizes are not the same. So, in the file header, we can see what type of bitmap file this is by looking at the 16 0s and 1s that the file starts with.

Reading Binary Files

We can read binary files by adding a b into the mode given to open. Recall that we can open a file, such as f = open("myfile.txt", "r"). In this case, it looks for myfile.txt and opens it for reading, which is what “r” is for. When we read from the file, Python will give us strings since it thinks this is a text file. Recall that a string is just a sequence of characters.

If we want to open the file as a sequence of 0s and 1s (binary) instead of a sequence of characters (text), we can add a “b” to the mode, which stands for binary. For example, f = open("myfile.bin", "rb") will open the file myfile.bin for reading as a binary file. Now that we’ve added the “b”, Python will give us the bytes data type instead of a string.

Bytes Data Type

Since we’re working with bytes and not strings, Python added a bytes data type. This was not the case in the older Python versions, but now this data type makes reading and writing bytes from and to a file much easier.

Bytes can be created using a literal, which looks much like a string, but it begins with a b, such as: b"Bytes". Even though it looks like a string, when we assign this to a variable, this will be a bytes data type. Remember that when we slice a string, such as mystring[0] or mystring[:-1], we get a character or sequence of characters. Much like this, we get a single byte (8 bits) or sequence of bytes.

Getting Bytes from Numbers

Binary numbers can be represented by a sequence of bits. However, its length is somewhat arbitrary. For example, 00000000023 is 23 just as much as 023 is 23. However, if I only have two digits, I cannot represent 123. So, I say “arbitrary” because we can use more digits than are necessary, but not fewer. This is why Python allows us to convert an integer into a sequence of bytes, but in order to do so, we need to give it some parameters.

When we have an integer object (meaning NOT a literal), we can call a member function called to_bytes() to convert it into a sequence of bytes. However, remember what I said, we need to give it some parameters, namely two: size and endianness.

Size

Most sequences of bytes in a computer come in powers of 2: 1, 2, 4, or 8 bytes. The exact number of bytes per field depends on the file structure, which you would need to know before you write to it or read from it.

Endianness

Endianness isn’t that difficult to understand, but it is different than what we’ve talked about before. Endianness means what end of the number will come first. It comes in two flavors: big or little, meaning do we store the little end (the rightmost digits) first or the big end (the leftmost digits). The following figure shows how there are two ways we can store a number: big or little:

Representation of 0x1A2B3C4D5E6F7080 in big-endian and ... — Example of big and little endian byte orders.

Humans read in big-endian if you read from left-to-right. You can see that the number we’re storing is 1a_2b_3c_4d_5e_6f_70_80. This is a base 16 (hexadecimal) number. However, in little endian, it is stored backwards, since the little end (rightmost byte) is stored first.

Converting Integers to Bytes

Again, we can use to_bytes to create a sequence of bytes. For example:

value = int(10).to_bytes(length=4, byteorder='little')
print(type(value))

In the code above, we are required to use int(10) because the to_bytes function operates on an integer object and not an integer literal. Just like everything else in Python, there are a myriad of ways to perform this operation, however, I find this one to be the easiest.

There are two required parameters for to_bytes, the length, which is the number of bytes and the byteorder, which specifies if this will be stored little end or big end first. The byteorder parameter takes a string that is either ‘big’ or ‘little’.

When we print the type of variable we get back, it will print the following:

<class 'bytes'>

So, we legitimately got a class of bytes, which contains our to_bytes() function. We can write to a file using the same write, except now we pass it bytes(). For example,

i = 100
j = 200
k = 0xdeadbeef
i_bytes = i.to_bytes(length=1, byteorder='little')
j_bytes = j.to_bytes(length=2, byteorder='little')
k_bytes = k.to_bytes(length=4, byteorder='little')
f = open("myfile.bin", "wb")
f.write(i_bytes)
f.write(j_bytes)
f.write(k_bytes)
f.close()

Notice that when I write, the number of bytes to write to the file comes from the XX_bytes variable and not the write function itself. In this case, we can convert our integers into bytes by using to_bytes. We can specify how many bytes we want by specifying the length.

If you try to convert to_bytes using a length where the integer can’t fit, Python will throw an OverflowError as you can see below:

>>> int(2000).to_bytes(length=1, byteorder='little')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert

When I specify length=1, I’m telling to_bytes that I want the integer to fit within one byte. However, one byte (8 bits) would max out at \(2^{8}-1=255\). This isn’t 100% accurate in computers, but I don’t want to complicate things. When we store negative numbers, we use a bit to store the sign, so it isn’t quite this. However, I’m just showing this as an example to show that there is a problem when we have big numbers stored in smaller lengths.

From Bytes

The previous section showed how to convert from an integer into the bytes data type. However, this is really only useful when we’re writing bytes. What happens when we want to read bytes from a binary file? We can use the int.from_bytes() function to convert from bytes into an integer.

f = open("myfile.bin", "rb")
four_bytes = f.read(4)
two_bytes = f.read(2)
one_byte = f.read(1)
f.close()
print("Four bytes is:", int.from_bytes(four_bytes, byteorder='little'))
print("Two bytes is:", int.from_bytes(two_bytes, byteorder='little'))
print("One byte is:", int.from_bytes(one_byte, byteorder='little'))

You can see above, we have to specify the byteorder once again when converting from bytes into an integer. We don’t specify the length since we know the length from the bytes object itself.

Byte Arrays

A bytes object is immutable, meaning that we cannot modify it in place, much like a tuple. We can create a new bytes object that makes some transformation, but what if we want to change just a single piece of a bytes object? We can use a bytearray object to do this. These objects are very similar to bytes objects, however they can be changed in place, much like a list.

We can change a bytes object into a bytearray object by using bytearray(bytes_object) and vice-versa:

bytes_object = b"Mars"
ba = bytearray(bytes_object)
ba[3] = ord("z")
print(ba)
bytes_object = bytes(ba)
print(bytes_object)

The code above takes a bytes object, copies it into a byte array, changes the ‘s’ to a ‘z’, and copies it back. Notice that I had to use the ord(“z”), which is a special function that gives me the integer representation of a lowercase z.

NOTE: Reading and writing to binary files uses the immutable bytes object. However, as you can see above, you can convert to and from.