SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Related Sites

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

findimagedupes 0.1.3 Copyright 2001 Rob Kudla <webmaster@kudla.org>

This program is distributed under the GNU General Public License; see the file COPYING for details. (Note especially the part about NO WARRANTY. You run findimagedupes at your own risk.)

findimagedupes is a crude command-line utility for finding visually similar images. With it you can compare two images and report the percentage of similarity, or compare an entire tree, reporting all likely duplicates based on a percentage of similarity you select. It optionally exports GQView collection files.

It can handle all image types understood by ImageMagick, but currently is limited to those types recognized as images/bitmaps by whatever version of "file" is installed on your system. This is due to PerlMagick's desire to treat every imaginable file type as an image, with tragic results.

I have nothing to do with GQView development myself, but the collections that findimagedupes creates work properly in GQView 0.91. This functionality is provided so that there's some easy way to visually compare duplicates; using GQView you can quickly delete the poorer quality duplicate. I plan to write a GUI front-end as I did with kcdfind, which in this case will allow you to view and manage duplicates as they're found. Ultimately, this functionality should be integrated into image management programs like GQView and Pixie. (I think it's on GQView's todo list already.)

The algorithm is monkeys-with-typewriters simple: reduce images to a uniform size, shape and palette; expand the histogram as much as possible; reduce further to a 16x16x1 bitmap; compare each pair of pictures and count the number of bits in an identical state in both. The algorithm is (hopefully) commented usefully in the code.

Despite its crudeness, it seems to be something like 98% accurate on most common images (people, animals, cartoons, pr0n, etc.) It does spectacularly poorly on images with miniscule differences, like 10 different shots of a sunset over the ocean, or the two presidential candidates in the last US election.

I've recently added a "GUI mode" which should allow other programs to interface with findimagedupes. It theoretically prints no output except status info and when it finds dupes. It's formatted like this:

Status::<number of current image>::<total number of images>::<percent> Dupe::<filename1>::<filename2>::<percent similarity>

This program seems stable, but should be considered a development release. In particular, the processing time rises geometrically with the number of images in a tree. I am not enough of a math or data structures guy to know how to speed up the binary comparison and bit counting of, say, 20,000 elements of 32 bytes of data, all resident in memory. (PerlMagick also seems to not let go of its memory when you undef images out of a set.) Suggestions would be welcome.

Requirements

perl - language this is written in ImageMagick - library for manipulating images PerlMagick (Image::Magick) - Perl interface to above pwd, find, sort, tput (curses), file

(i.e. if this works right under NT I'd be surprised) A bunch of pictures of which you've totally lost control

Non-standard packages required for this to run should be available via my perl page at:

http://www.kudla.org/raindog/perl

If you'd like, get GQview at:

http://gqview.sourceforge.net

Usage: findimagedupes [options] [<file1> <file2>] Options:

       -rescan         = rescan fingerprints of all files in directory
       -f <file>       = use <file> as image fingerprint database
       -d <dir>        = scan <dir> instead of current directory
       -t <num>        = use <num> as threshold% of similarity (default 90)
       -v <program>    = launch <program> (in bg) to view each set of dupes
       -c <file>       = create GQView collection <file>.gqv of duplicates
       <file1> <file2> = diff just those two files, using -v if present
                         (other options ignored if files are specified)
       -p              = only valid when files specified; prints the
                         hex of the actual fingerprint of each file.
       -g              = GUI mode: produce only machine-friendly output.
History

0.1.3 11 February 2001

        Due to a problem with ImageMagick 5.2.3 on my machine,
           now uses raw 'mono' format for thumbprints instead 
           of 'pbm'.  This returns identical data minus the 
           header stuff we were discarding anyway.
        Applied patch from Max Stekelenburg:
           Program would sometimes try to scan text files due
              to a misplaced if.
           Stomped on some uninitialized value warnings.
        Applied patch from Paul Cassella:
           Improved performance by only comparing images whose
              number of 1 bits would allow them to fall within
              the threshold.
           Sped up bit counting routing using pack once per
              image rather than a loop per comparison.
           Eliminated non-regular-files from the scan.
           Stomped on some uninitialized value warnings.
        Bugfix: Will again rescan automatically if db file not 
           present or is zero-length.
        Added -g option, "GUI mode", to format output for use
           by GUI's and other programs.

0.1.2 30 September 2000

        Changed algorithm to use an 8x8x8 thumbprint instead of
           16x16x1.  It failed disastrously, so changed back.
        Cleaned up the code a little for public release.
        Minor bug fixes I didn't bother to document.
        First public release.

0.1.1 24 September 2000

        Added "compare 2 files at a time".
        Added GQview collection generation.
        Fixed bug in binary comparison; huge jump in accuracy.
        Generalized (sorta) the thumbprint generation routine.

0.1.0 17 September 2000

First working version, perhaps 75% accurate.


Other Sites

Discussion Groups
  Beginners
  Distributions
  Networking / Security
  Software
  PDAs

About | FAQ | Privacy | Awards | Contact
Comments to the webmaster are welcome.
Copyright 2006 Sourcefiles.org All rights reserved.