You want to rewrite the contents of your GZIP files to remove the optional filename metadata. Some GZIP files contain filenames in their headers, which increase the file size with no benefit. Here we look at how you can rewrite some GZIP files using the C# programming language.
After researching this problem, I found the GZIP File Format Specification, which outlines the contents of every valid GZIP file's headers. From the specification, each file begins with specific bytes.
+---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) +---+---+---+---+---+---+---+---+---+---+
Illustration notes. The FLG byte above, the fourth byte, contains 8 bits that can be set to 1 or 0 depending on what optional header information is in the file header. If a bit is set to 1, you can find the data for that bit starting at the 11th bit in the GZIP file.
FLG (FLaGs)
This flag byte is divided into individual bits as follows:
bit 0 FTEXT
bit 1 FHCRC
bit 2 FEXTRA
bit 3 FNAME
bit 4 FCOMMENT
bit 5 reserved
bit 6 reserved
bit 7 reservedBit descriptions. The flag byte, shown above, contains the FNAME bit at the fourth position, bit 3. Therefore, if the FNAME bit is set to 1, we can remove filename data starting at bit 11 through null.
/// <summary>
/// Rewrite the GZIP file specified and remove the file name bytes in it.
/// These are not needed.
/// Namespaces: System.IO, System.Collections.Generic
/// </summary>
static void GZipRemoveFileName(string fn)
{
// Read in GZIP
byte[] b = File.ReadAllBytes(fn);
// See if the file name is set.
// We don't deal with the file if any other flags are set.
if (b[3] == 8)
{
// Allocate the copy bytes
List<byte> copy = new List<byte>(b.Length);
// The flag byte will be set to 0
b[3] = 0;
// Add the first ten bytes
for (int i = 0; i < 10; i++)
{
copy.Add(b[i]);
}
// Ignore all non-null name characters
int a = 10;
while (b[a] != 0)
{
a++;
}
// Ignore the null
a++;
// Add the rest of the file
for (int i = a; i < b.Length; i++)
{
copy.Add(b[i]);
}
// Write the new byte array
File.WriteAllBytes(fn, copy.ToArray());
}
else
{
// Note that we couldn't rewrite the file
Console.WriteLine("Flag invalid: {0}", fn);
}
}Description. The above C# method simply receives the filename of a GZIP file, and then reads in its bytes with File.ReadAllBytes. It then tests the flag byte for 8. If the flag byte is == 8, that means the fourth byte is set to 1. This means we can remove bytes 10 through null.
Loop usage. The loops that follow simply copy the first 10 bytes, setting the fourth byte to 0, and then skip the filename bytes. Finally, the rest of the bytes are copied unchanged.
This algorithm can be implemented in C, C++, Java, or practically anything that lets you check individual bytes. This article isn't about C# but rather the algorithm you can use to remove the filename, as well as the general structure of GZIP.
Some GZIP programs such as 7-Zip will leave the filename even if you don't want it. I haven't found a way to remove it in these programs.
This algorithm modifies the GZIP files in my website correctly and it saves a significant number of bytes in each file. In an archive with about 300 GZIP files, as well as other stuff:
Before (with filenames):
5.56 MB (5,840,304 bytes)
After (no filenames):
5.56 MB (5,835,142 bytes)
Savings per file:
17 bytesHere we saw an algorithm that uses the GZIP specification information to rewrite GZIP files, saving an average of 17 bytes per GZIP file. The file name in GZIP is optional, and serves no purpose often. Most importantly, we saw how to read the GZIP specification, and saw an example of how to use it to manipulate the bytes in a GZIP file. This helps programmers of any language.