Author Topic: The Deduplication Project  (Read 1077 times)

0 Members and 1 Guest are viewing this topic.

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #15 on: June 18, 2017 - 09:19:15 »
yo rzil, i get what you saying. but you know that nongood does not have a naming standard. yori yoshizuki sorts undatted/unknown stuff by system without naming the files.
maybe the mentioned super mario hacks are better named in nongood than in tosec but the other 90% of the listed files in tosec are still better named than nongood. at least for verification matters.

to be able to have the name for each and every rom i would need to compare each hash one by one throught all sources and then choose the very best. but i cant do that for over 4milion unique hashes :) so i have to go by sources. and in general tosec has more valuable naming than nongood.
nongood and unrenamed are also not for verification in general.they help the other sources (tosec and goodsets) to see what they still can add to their databases. after the files are added by the mentioned sources, nongood and unrenamed discard the entry from their listings.

hi cannonwillow, yes progetto-snaps, tosec pix etc will hopefully find their way in over the time :)



Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #16 on: June 19, 2017 - 19:09:48 »
[You are not allowed to view links] Register or Login

DAT attached!

Additional Info:
each and every DAT that is listed in UDC (or that i have on my drive) is now included! (artwork, music, emulator, scene.. everything)
this is the final update.
with finishing the deduplication of all dats, you now have 99% of all unique hashes ever datted in one collection.
this collection can help to identify some of your more exotic files within your tosort folder which you thought would never find a match.
after you find out to which dat those files belong, i recommend to use the original dat. or do whatever you like :)

"Ultimate DATs Collection" and "The Deduplication Project" have come to an end now. i hope you have enjoyed both projects.
i will be back with UEC updates, and maybe other projects. see ya'll around.


Final Stats:

dats-size before deduping:      112,5 GB
dats-size after deduping:             1,6 GB

dats-count before deduping:     69621
dats-count after deduping:          9904

hashes-count before deduping: uncountable!
hashes-count after deduping:  11.117.841 unique hashes!

Offline cannonwillow

  • Newbie
  • *
  • Posts: 22
Re: The Deduplication Project
« Reply #17 on: June 23, 2017 - 18:29:36 »
Quick question

are _old folders within the new folders tree also not needed for a complete deduped collection?

example:   New/No-Intro/_Old/*

thanks in advance and for the amazing work.

cannonwillow


Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #18 on: June 23, 2017 - 19:03:21 »
what you dont need is:

DATs Collection [Deduped]\_Old

what you need for a complete collection is everything in:

DATs Collection [Deduped]\New

so even every _Old folder inside New :)

or in other words the 9904 dats inside New are all needed for a complete deduped hash-unique collection.

Offline cannonwillow

  • Newbie
  • *
  • Posts: 22
Re: The Deduplication Project
« Reply #19 on: June 23, 2017 - 19:27:48 »
thanks for clearing that up.

 8)