Simple Photo De-duplication with PowerShell

Published on Monday, November 2, 2015

Photo by Jan Loyde Cabrera on Unsplash

After several failed attempts and false starts over the course of the last decade or so, this summer I vowed to finally get my wife's photo library and my own properly merged. For various reasons, we both had large libraries of photos from various events and vacations which had a lot of overlap, but not 100% overlap.

This meant I needed to examine two large collections of photos and copy all of the non-duplicate photos from one to the other. And I couldn't rely on file names or paths at all, because my wife is good about renaming and organizing her photos, while I am not. So I whipped up a PowerShell script to gather the SHA-1 hashes of every file; by comparing them all, I could find and ignore the duplicates. It's not perfect - it'll only find pictures which are exact duplicates. If my collection has the original and my wife's collection just has the "red-eye reduction" version, we'll end up with both in the final collection. But it considerably reduced the amount of de-duplication work we have to do by hand.

Here's the function which actually gathers all the photo data for a folder (and all its child folders):

function Get-PhotoData {
    param([string]$path)

    $results = @()

    $files = Get-ChildItem $path -Recurse -Filter *.jpg 

    $total = ($files | measure).Count

    $files | % {$i=1} {
        Write-Host "Processing $_ ($i of $total)"

        $props = @{
            Name = $_.Name
            Path = $_.FullName
            Size = $_.Length
            Hash = (Get-FileHash $_.FullName -Algorithm SHA1).Hash
        }

        $results += (New-Object PSObject -Property $props)
        $i++
    }

    $results
}

I ran that function against my wife's photo collection (which was effectively our master collection) and my own:

$master = Get-PhotoData -path $masterPath
$toMerge = Get-PhotoData -path $toMergePath

In theory, I could then have used Compare-Object to figure out which items in my collection were safe to delete (i.e., they already existed in my wife's collection):

$safeToDelete = Compare-Object -IncludeEqual -ExcludeDifferent -ReferenceObject $toMerge -DifferenceObject $master -Property Hash -PassThru | Select-Object -ExpandProperty Path

This would give me a list of paths to photos in my collection which had a matching SHA1 hash to a photo already in my wife's collection.

Or, I could find the list of items in my collection which were missing from her collection:

$toMove = Compare-Object -ReferenceObject $master -DifferenceObject $toMerge -Property Hash -PassThru | ? { $_.SideIndicator -eq '=>' } | Select-Object -ExpandProperty Path

Moving each item in that collection would then be easy:

$toMove | % {$i=1} {
    Write-Host "Moving $_ ($i of $total)"
    Move-Item $_ "[destination path]"
    $i++
}

For small enough collections, this works great. But if you've got a large enough photo collection you might start running into performance problems with Compare-Object. If that's the case, with a little extra effort and a little bit of Python you can figure out your $safeToDelete list much faster. First, we dump our photo data to a couple of files:

$master | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "master.csv"
$toMerge | Select Name, Path, Size, Hash | Export-Csv -NoTypeInformation "toMerge.csv"

Now we throw together a quick Python program using pandas to read in the two data sets, merge them into a single data set by matching on the file size and hash, and dump the output to another file:

import pandas as pd

# Read in our email data file
master = pd.read_csv('../master.csv', header = 0)
toMerge = pd.read_csv('../toMerge.csv', header = 0)

both = pd.merge(toMerge, master, on=['Size', 'Hash'])
both.to_csv('../safeToDelete.csv')

The new .csv file will have columns 'Path_x' and 'Path_y'; since we had toMerge as the first parameter to merge, Path_x is a list of all the files in that collection which can be deleted. More Python-savvy folks than me can probably handle the deletion straight from the Python script, but I just did it with PowerShell:

$toDelete = (Import-Csv .\safeToDelete.csv).Path_x

$total = ($toMove | measure).Count

$toDelete | % {$i=1} {
    Write-Host "Deleting $_ ($i of $total)"
    Remove-Item $_ 
    $i++
}

Of course, don't go running any of this code or deleting any files until you've backed your folders up somewhere safe; if you make any mistakes (or any of my code is totally broken), you'll want a safety net in place.