Handling binary data in PHP with pack() and unpack()
Tagged with: [ PHP ] [ pack ] [ unpack ]
Nowadays most lowlevel functionality like reading or writing graphics are taken care of 3rd party libraries
and that’s ok. It’s way to complicated to do things right and you probably want to focus on outputting or
sending a PNG instead of construction one from scratch. While reading and writing these kind of binary data
was normally done in languages like C or even assembler, most higher level languages still have these
capabilities and yes, even PHP… Meet pack()
and unpack()
.
Most people don’t even want to know about how things are done internally and don’t even want to know how a tar-ball looks like, or how a PNG file stores it’s color palettes. However, if you are just like me, then you are curious enough and want to know. So today i’m going to show you how to read a PNG file directly from disk and display the info that is hidden behind the image. Even might tell you an optimization trick or two in the meantime :-)
First things first: pack()
and unpack()
.
When dealing with binary data in PHP there are 2 main functions that you cannot live without. The pack()
and
unpack()
functions take a (binary) string and convert them into an array.
Both work more or less the same way. pack()
will store an array back into a binary structure, while unpack()
will do the opposite.
If you would write $binarystring
to a file, it would be 8 bytes long: 4 bytes for the length (since it’s a
32bit value), 3 bytes for the ascii ‘aBc’, 1 byte for cr and 1 byte for lf
With unpack you have to add the key since the output is an associative array. Take a look around in the php
manual for more info about pack()
/unpack()
.
PNG Format:
The binary format for PNG files are available on the internet. When viewing a PNG file in a hex viewer or editor, you will see the first 8 bytes are always the same.
- The first byte is always 0x89.
- Second to 4th byte are the letters ‘PNG’ (or in hex: 0x50 0x4E 0x47)
- 5th and 6th are the bytes 0x0d and 0x0a, which represents a DOS line ending
- 7th and 8th byte are 0x1a and 0x0a
So in order to check if a file is a valid PNG, we need to do the following:
- open the file (as binary)
- read the first 8 bytes
- unpack the bytes
- check if all entries are what we expect
open the file (as binary):
notice the “b” at the file options. This will make sure that the file is opened in binary mode.
read the first 8 bytes: $data
will contain a binary string. You cannot really read it, so we have to unpack
the data from it:
This would create a $header
array with the following info:
as you can see, the first entry (highbit) is 137, which is the same as 0x89. The signature is a normal string with ‘PNG’ and other characters should be the same as above.
As example, it checks if the highbit is actually 0x89 and checks the singnature for PNG. You should check the others as well..
After the PNG header, you get blocks of data called “chunks”. Each chunk is formatted the same way:
- 4 bytes : chunk length
- 4 bytes : chunk type
- N bytes : chunk data
- 4 bytes : chunk CRC
Before reading the chunk data, we must read the chunk length. So first thing we do is read the first 8 bytes (or 2 dwords actually):
Now the number of data bytes are know so we can read them as well:
finally: we read the CRC code and add it to the chunk array:
Reading all chunks
When reading one chunk, you can read them all. Depending on the chunk-type, you can actually unpack the data and display or use that information as well.
Optimization
I did tell you I was sharing a optimization trick in the beginning. So here it is:
As you might have seen, PNG’s have a lot of additional chunks with them. Including things like the last time it was written and a lot of text chunks. Since these chunks are not needed for displaying the PNG correctly, and they only take up space, you could write a program that removes these chunks from the PNG. Especially with small PNG’s, this can save up to even 50% of the file size! This is a trick that most image-compressors will use to achieve smaller size images without changing even 1 byte on the actual image. Neat huh?
Catches
Binary data is handled differently depending on your CPU. Most CPU’s nowadays use little-endian (intel’s x86 for instance) but there are still big-endian cpu’s out there (68000, powerpc’s etc). When reading a word or dword from binary data, make sure you know in which endian the data is written otherwise you might end up with incorrect data.
Especially when you want to write binary data, make sure you think of everything. Things can get very complicated and miswriting a single byte will corrupt your whole image. There are a lot of libraries out there that can do these things way better than you ever will.. but that should not stop you for trying anyway :-)