Monday, June 18, 2007

Duplciate Images & MD5/SHA Hashes

I was in the middle of backing up some family photos when I discovered that I had multiple copies of the same image in different folders. Not only is this a waste of space, but it can also lead to images getting deleted on accident when you believe you have a copy in another folder, but actually do not.

That begs the question, how in the world do you sort though thousands of photos making sure that you keep one, and only one copy of each? If you haven't renamed the files, you may be able to use the file name to weed though the duplicates, but what if you renamed some of the files, or have duplicate names on files? Now it starts getting a lot more complicated.

There are some commercial products you can buy that will do some of this image verification, but I had much bigger plans. In addition to just removing duplicates, I wanted to be able to add tags to images, verify that images that were backed up did not become corrupted, and a host of other things that I will write about later. To start off with, I just wanted to remove duplicates. I decided to do that though creating a checksum for the file.

Think of a checksum as verification that the file is exactly as it is supposed to be. You will often find checksums on files that are being downloaded so that you can verify that the file received was not corrupted during the transfer, or by a cracker who decided to implant a virus in your download.

There are lots of different types of checksums, 3 of the most popular are CRC32, MD5, and the different SHA functions. The issue with CRC32 is that collisions are very frequent compared to md5 or the SHA hashes. In the way I am using hashes, a collision is when 2 different files create the same hash at the end. Since I will be removing any files with the same hash, a collision would result in the deletion of a file that isn't actually a duplicate.

That leaves out CRC32, but what about MD5 and the SHA functions? MD5 hashes have been used for years (and still used) to store data like passwords for web sites for a bunch of reasons. However, this is bad because MD5 hashes of passwords can be broken using several techniques like plain old brute forcing or though rainbow tables. But how will an MD5 or SHA1 hash do on finding duplicate files? Actually, very well in fact. The probability of a collision for a given MD5 or SHA1 hash is extremely low, however it is not impossible. Here is a great article on the subject. However, generating MD5 or SHA1 hashes of files is a pretty quick operation, so both are viable options.

When you get down to it though, I don't like living with an error rate that I can describe without using decimal notation. So since both MD5 and SHA1 checksums can be generated in a second or two, even on a slow machine, why not generate both for each file? The odds of both a MD5 and SHA1 hash returning identical values for different files is roughly the same as every atom in your body deciding to rearange itself on Mars. That is an error rate I can live with.

Generating all these hashes manually is a crazy proposition. Thankfully, Visual Studio has methods for creating hashes of the most popular types of data though the System.Security.Cryptography namespace. using this, I was able to create a quick function that would generate an MD% or any of the SHA hashes for a given file. That way, I could create a quick script to loop though all the image files and save their critical data to a database. (File location, name, and hashes) Then, it is a simple matter to write a query that returns only unique files, and save those off somewhere while removing the duplicates.

The below VB.NET function called GetHashForFile takes a file location and an enum that represents the hash type to generate. In order to generate both MD5 and SHA1 hashes, just call it twice with different enums. I added the different SHA2 hashes to the function in case someone wishes to generate SHA2 hashes.

The function needs 3 namespaces for references in order to function. These 3 must be added if you have not referenced them already. Add this to the top of your module or code page.

Imports System.IO
Imports System.Text
Imports System.Security.Cryptography


In order to make the function as reusable as possible, an enum was created so that you can't forget which hash types can be created. Add this enum to your code page or module.

Enum HashType
MD5 = 1
SHA1 = 2
SHA256 = 3
SHA384 = 4
SHA512 = 5
End Enum


Then, add the function itself. the way this is currently written, if there is an error on generating the hash for a file (like trying to hash a file that doesn't exist), an empty string will be returned instead of an error being raised. If you want to raise an error that can be handled in the calling code, uncomment the lines that start with "Err.Raise".


Public Function GetHashForFile(ByVal Filepath As String, ByVal DataHashStandard As HashType) As String
'function that can create the most popular hashes.
'check that file exists.
If Not My.Computer.FileSystem.FileExists(Filepath) Then
' Err.Raise(vbObjectError + 13131, , "File " & Filepath & " does not exist") 'uncomment this line if you want to raise an error instead of return empty string
Return ""
Exit Function
End If

'declarations
Dim sb As StringBuilder = New StringBuilder 'stringbuilder to build the result.
Dim fs As FileStream = New FileStream(Filepath, FileMode.Open) 'open file
Dim HashProvider As Object

Try
'set the hash type
If DataHashStandard = HashType.MD5 Then 'md5 hash
HashProvider = New MD5CryptoServiceProvider
ElseIf DataHashStandard = HashType.SHA1 Then 'sha128 (sha1) hash
HashProvider = New SHA1CryptoServiceProvider
ElseIf DataHashStandard = HashType.SHA256 Then 'sha256 (sha2 256 bit) hash
HashProvider = New SHA256CryptoServiceProvider
ElseIf DataHashStandard = HashType.SHA384 Then 'sha384 (sha2 384 bit) hash
HashProvider = New SHA384CryptoServiceProvider
ElseIf DataHashStandard = HashType.SHA512 Then 'sha512 (sha2 512 bit) hash
HashProvider = New SHA512CryptoServiceProvider
Else
'close the file opened earlier
fs.Close()
fs.Dispose()
' Err.Raise(vbObjectError + 13132, , "Data HAsh Standard " & DataHashStandard.tostring & " is not valid") 'uncomment this line if you want to raise an error instead of return empty string
Return ""
Exit Function
End If
'compute the hash
Dim hash() As Byte = HashProvider.ComputeHash(fs)

'done with the file, close it.
fs.Close()
fs.Dispose()

' turn the byte array into a string
For Each hex As Byte In hash
sb.Append(hex.ToString("x2"))
Next
'return the result
Return sb.ToString
Catch ex As Exception
'close the file opened earlier
fs.Close()
fs.Dispose()
'Err.Raise(vbObjectError + 13133, , "Data Encryption failed with error " & ex.Message) 'uncomment this line if you want to raise an error instead of return empty string
Return "" ' return empty string on error instead of returning error.
Exit Function
End Try

End Function


Once all 3 sections are in a code page or module, you can use the function like this. Replace 'c:\config.sys' with the file name you want to hash.

msgbox(GetHashForFile("c:\config.sys", HashType.MD5)) 'pop messagebox with md5 hash
msgbox(GetHashForFile("c:\config.sys", HashType.SHA1)) 'pop messagebox with SHA1 hash