findimagedupes 0.1.3 Copyright 2001 Rob Kudla <webmaster@kudla.org>
This program is distributed under the GNU General Public License; see the file COPYING for details. (Note especially the part about NO WARRANTY. You run findimagedupes at your own risk.)
findimagedupes is a crude command-line utility for finding visually similar images. With it you can compare two images and report the percentage of similarity, or compare an entire tree, reporting all likely duplicates based on a percentage of similarity you select. It optionally exports GQView collection files.
It can handle all image types understood by ImageMagick, but currently is limited to those types recognized as images/bitmaps by whatever version of "file" is installed on your system. This is due to PerlMagick's desire to treat every imaginable file type as an image, with tragic results.
I have nothing to do with GQView development myself, but the collections that findimagedupes creates work properly in GQView 0.91. This functionality is provided so that there's some easy way to visually compare duplicates; using GQView you can quickly delete the poorer quality duplicate. I plan to write a GUI front-end as I did with kcdfind, which in this case will allow you to view and manage duplicates as they're found. Ultimately, this functionality should be integrated into image management programs like GQView and Pixie. (I think it's on GQView's todo list already.)
The algorithm is monkeys-with-typewriters simple: reduce images to a uniform size, shape and palette; expand the histogram as much as possible; reduce further to a 16x16x1 bitmap; compare each pair of pictures and count the number of bits in an identical state in both. The algorithm is (hopefully) commented usefully in the code.
Despite its crudeness, it seems to be something like 98% accurate on most common images (people, animals, cartoons, pr0n, etc.) It does spectacularly poorly on images with miniscule differences, like 10 different shots of a sunset over the ocean, or the two presidential candidates in the last US election.
I've recently added a "GUI mode" which should allow other programs to interface with findimagedupes. It theoretically prints no output except status info and when it finds dupes. It's formatted like this:
Status::<number of current image>::<total number of images>::<percent> Dupe::<filename1>::<filename2>::<percent similarity>
This program seems stable, but should be considered a development release. In particular, the processing time rises geometrically with the number of images in a tree. I am not enough of a math or data structures guy to know how to speed up the binary comparison and bit counting of, say, 20,000 elements of 32 bytes of data, all resident in memory. (PerlMagick also seems to not let go of its memory when you undef images out of a set.) Suggestions would be welcome.
- Requirements
perl - language this is written in ImageMagick - library for manipulating images PerlMagick (Image::Magick) - Perl interface to above pwd, find, sort, tput (curses), file
(i.e. if this works right under NT I'd be surprised) A bunch of pictures of which you've totally lost control
Non-standard packages required for this to run should be available via my perl page at:
http://www.kudla.org/raindog/perl
If you'd like, get GQview at:
Usage: findimagedupes [options] [<file1> <file2>] Options:
-rescan = rescan fingerprints of all files in directory
-f <file> = use <file> as image fingerprint database
-d <dir> = scan <dir> instead of current directory
-t <num> = use <num> as threshold% of similarity (default 90)
-v <program> = launch <program> (in bg) to view each set of dupes
-c <file> = create GQView collection <file>.gqv of duplicates
<file1> <file2> = diff just those two files, using -v if present
(other options ignored if files are specified)
-p = only valid when files specified; prints the
hex of the actual fingerprint of each file.
-g = GUI mode: produce only machine-friendly output.
- History
0.1.3 11 February 2001
Due to a problem with ImageMagick 5.2.3 on my machine,
now uses raw 'mono' format for thumbprints instead
of 'pbm'. This returns identical data minus the
header stuff we were discarding anyway.
Applied patch from Max Stekelenburg:
Program would sometimes try to scan text files due
to a misplaced if.
Stomped on some uninitialized value warnings.
Applied patch from Paul Cassella:
Improved performance by only comparing images whose
number of 1 bits would allow them to fall within
the threshold.
Sped up bit counting routing using pack once per
image rather than a loop per comparison.
Eliminated non-regular-files from the scan.
Stomped on some uninitialized value warnings.
Bugfix: Will again rescan automatically if db file not
present or is zero-length.
Added -g option, "GUI mode", to format output for use
by GUI's and other programs.
0.1.2 30 September 2000
Changed algorithm to use an 8x8x8 thumbprint instead of
16x16x1. It failed disastrously, so changed back.
Cleaned up the code a little for public release.
Minor bug fixes I didn't bother to document.
First public release.
0.1.1 24 September 2000
Added "compare 2 files at a time".
Added GQview collection generation.
Fixed bug in binary comparison; huge jump in accuracy.
Generalized (sorta) the thumbprint generation routine.
0.1.0 17 September 2000
First working version, perhaps 75% accurate.
