What is the best duplicate file finder, that preserves my source of truth?

parkercp@alien.top · 10 months ago

What is the best duplicate file finder, that preserves my source of truth?

lilolalu@alien.top · 10 months ago

How should a duplicate finder know which is the source of the duplicate?

parkercp@alien.top · 10 months ago

I’d like to find something that has that capability- so I can say multimedia/photos/ is the source of truth - anything identical found elsewhere is a duplicate. I hoped this would be an easy thing to as the ask is simply to ignore any duplicates in a particular folder hierarchy…

speculatrix@alien.top · 10 months ago

Write a simple script which iterates over the files and generates a hash list, with the hash in the first column.

find . -type f -exec md5sum {} ; >> /tmp/foo

Repeat for the backup files.

Then make a third file by concatenating the two, sort that file, and run “uniq -d”. The output will tell you the duplicated files.

You can take the output of uniq and de-duplicate.

parkercp@alien.top · 10 months ago

Thanks @speculatrix - I wish I had your confidence in scripting - hence I’m hoping to find something that does all that clever stuff for me… The key thing for me is to say something like multimedia/photos/ is the source of truth anything found elsewhere is a duplicate …

Digital-Chupacabra@alien.top · 10 months ago

I wish I had your confidence in scripting

You know how you get it? by fucking around and finding out! I’d say give it a go!

Do a dry run of the de-dup to make sure you don’t delete anything you care about.

Mildly_Excited@alien.top · 10 months ago

I’ve used dupeGuru on windows for cleaning up my photos, worked great for that. Has a GUI and also works on linux!
https://dupeguru.voltaicideas.net/

parkercp@alien.top · 10 months ago

Thanks - I think I tried that - but at the time it had no concept of a source (location) of truth to preserve / find duplicates against - has that changed ? They don’t seem to reference that specific capability on that link ?

CrappyTan69@alien.top · 10 months ago

Only runs on windows but I’ve been using double killer for years. Simple and does the trick

parkercp@alien.top · 10 months ago

Thanks @CrappyTan69 - I ideally need this to run on my NAS, and if possible be opensource/free - looks like for what I’d need Double Killer for, it’s £15/$20 - maybe an option as a last resort…

Lorric71@alien.top · 10 months ago

Can’t you edit the OP and add the requirements? You haven’t even told us what NAS you have.

xewgramodius@alien.top · 10 months ago

I don’t think there is a good way to tell which two duplicate files was “first” other than checking Creation Date but if this is Linux that attribute may not be enabled in your fs type.

The closest thing I’ve seen is a python dedup scripts but after it identifies all the dups it deletes all but one of them and then puts hard links, to that real file, where all the deleted dups were.

root_switch@alien.top · 10 months ago

Only YOU can tell which is the source of truth but czawka can easily do what you need, what issues did you have with it?

Sergiow13@alien.top · 10 months ago

czkawka can easily do this OP!

In this screenshot for example, I added 3 folders and marked the first folder as reference folder (the checkmark behind it). It will now look for files from this folder in the other folders and delete all identical files found in the non-reference folders (it will off course first list all of them and ask you to confirm before deleting)