Author Topic: The Deduplication Project  (Read 1076 times)

0 Members and 1 Guest are viewing this topic.

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
The Deduplication Project
« on: June 03, 2017 - 20:19:44 »
The Deduplication Project

[You are not allowed to view links] Register or Login
[You are not allowed to view links] Register or Login
[You are not allowed to view links] Register or Login
[You are not allowed to view links] Register or Login
[You are not allowed to view links] Register or Login
[You are not allowed to view links] Register or Login


Source (Order)

No-Intro
Redump   
Console\Nintendo - Super Nintendo Entertainment System\SMW Hacks [Zandro]
GoodSets
Handheld\Handheld Simulators
*\Sony - PlayStation *\* [bsbt]
Computer\Microsoft - DOS\Total DOS Collection   
Computer\Microsoft - DOS\eXoDOS   
MAME, MESS
Neo Kobe
Maybe-Intro + (20170517 SNES DAT)
Console\Nintendo - Famicom Disk System\ROMs [Yori Yoshizuki]
TOSEC   
MESS SL                       
trurip
Arcade\Various Arcade [f205v]
Arcade\EMMA Italian Dumping Team
Patch\*
Computer\SPS
Computer\IBM - PC Compatibles\Floppy Disks [pablogm123, Kludge]
Computer\*\Numbered ROMs [Connie]
Computer\Commodore - 64\Non-TOSEC [Duncan Twain]
Computer\Commodore - Amiga\Non-TOSEC [Crashdisk]
Computer\*\Non-TOSEC [Yori Yoshizuki]
Computer\*\C64 Project|Perfect! C64|NonGood [Apache]
*\*\Non-TOSEC [Eggman]   
Computer\Commodore - 64\C64_#-Z [Corinthian]
Computer\Commodore [Yori Yoshizuki]
*\*\RHTP [Dominater01]
WoD Custom
Arcade
\*
Artwork\*
Music\*
Emulator
\*
Site
\*
GameBase
*\*\[Misc]
NonGood                       
UnRenamed
Various\ALTERNATE SYSTEM ROMS [Yori Yoshizuki]           
WoD
Computer\Personal Computer (PC)   
Scene
Various\*

How does it work?


-grab the deduped dats pack and all update packs (above)
-build them with the last attached DAT
-use the deduped dats collection found inside "DATs Collection [Deduped]\New\*" in a clean romvault setup as your datroot
-enjoy a deduped collection ;)


Additional Info:

-the sources are deduped in the order of the list above.
basically: no-intro > redump > smw hacks > etc...
-all dats are kept seperated (not merged). just think of a huge collection of "diff-dats".



« Last Edit: June 20, 2017 - 20:55:23 by @tractivo »

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #1 on: June 07, 2017 - 10:12:17 »
>>>UPDATE<<<

[You are not allowed to view links] Register or Login

-check first post for new additions (yellow marked lines)

-download the update pack from the link above "Deduped DATs Update-Pack" and put it in your "UDC"-ToSort directory
-replace the "Ultimate DATs Collection [Deduped]" with the updated DAT from the attachement below
-build/fix the files with the new replaced dat

-the rebuild content of the folder "New" is your updated "deduped" DATRoot
-all obsolete dats are moved to "_Old" which you can delete if you wish


Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #2 on: June 08, 2017 - 20:13:37 »
>>>UPDATE<<<
[You are not allowed to view links] Register or Login
for DAT get the newest UDC Release!
New Sources: (yellow marked Lines in first Post)
'bsbt's PlayStation Firmware and PSN
'Yori Yoshizuki's Famicom Disk System ROMs
'Connie's and 'MADrigal's Handheld Simulators


Updated Sources: (yellow marked Dates/Version in first Post)
No-Intro, Redump, SMW Hacks

Changes:
improved layout for Maybe-Intro and TDC

Additional Info:
very nice collection ;)

Offline Nukhem

  • Newbie
  • *
  • Posts: 33
Re: The Deduplication Project
« Reply #3 on: June 09, 2017 - 22:37:58 »
Amazing work ! This cleaned up my 1,4TB ToSort unknown files folder alot :)

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #4 on: June 10, 2017 - 01:15:59 »
glad to hear that  :)
a lot of projects discard files with dat updates. especially no-intro. the deduped dats project tries to keep track of everything that was ever datted, without the need for extra space.
there is still a lot more to add, working my way through the DATs 2 folder of the UDC collection. which will hopefully shrink all those tosort folders a lot more.


Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #5 on: June 15, 2017 - 23:59:38 »
[You are not allowed to view links] Register or Login

New DAT attached!
New Sources: (Yellow Sources)
'GoodSets' (a good amount of them, hopefully more to follow)
'Retroplay's snes t-en (integrated into maybe-intro)
'eXoDOS'
'Eggman's Non-TOSEC
'Dominater01's RHTP

Updated Sources: (Yellow Date/Version)
NonGood, UnRenamed

Changes:
splitted redump's old dats to "_Old (Files)" and "_Old (Sheets)"   

Additional Info:
a new "dat-source" is added called "zzz_obs", in which obsolete files/hashes will be listed. basically duplicates that are not really needed if you have complete main sources.
for example such a duplicate can be a nintendo 64 rom which is datted as big endian in no-intro but as little endian or byteswapped in wod dat. its still a duplicate but only in a different format.

one more info, putting 'GoodSets' so high in the list, will make your files move around a little.

peace

Offline dizzzy

  • Newbie
  • *
  • Posts: 12
Re: The Deduplication Project
« Reply #6 on: June 16, 2017 - 01:09:19 »
20161226        *\Sony - PlayStation *\* [bsbt]

What's this? I've been working on a companion set to PSX (basically 900 unique games not yet in redump.org). Would love more info on the source of this torrent(?)

Edit: just noticed Rom Shepherd has a tracker, I guess it's on there. Needa get on.
« Last Edit: June 16, 2017 - 01:12:44 by user7 »

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #7 on: June 16, 2017 - 01:52:59 »
those are PSN and Firmware collections for playstation systems datted by 'bsbt' which can be found on the nointro forum site.

Offline Connie

  • Hero Member
  • *****
  • Posts: 1863
Re: The Deduplication Project
« Reply #8 on: June 16, 2017 - 01:54:30 »
[You are not allowed to view links] Register or Login
What's this? I've been working on a companion set to PSX (basically 900 unique games not yet in redump.org). Would love more info on the source of this torrent(?)
Redump doesn't = Scene
Your 900 games is probably a mix of bad and not-redumped.
...or something that isn't disc based and is therefore in another dat.

EDIT:
Ninja'd by @tractivo
"Get busy living or get busy dying" - Shawshank Redemption (Stephen King)

My DAT Files - [You are not allowed to view links] Register or Login
My Shared Files - [You are not allowed to view links] Register or Login
My GOG.com Files - [You are not allowed to view links] Register or Login

Offline dizzzy

  • Newbie
  • *
  • Posts: 12
Re: The Deduplication Project
« Reply #9 on: June 16, 2017 - 04:36:28 »
[You are not allowed to view links] Register or Login
those are PSN and Firmware collections for playstation systems datted by 'bsbt' which can be found on the nointro forum site.

Gotcha, thanks for clearing that up.


[You are not allowed to view links] Register or Login
Redump doesn't = Scene
Your 900 games is probably a mix of bad and not-redumped.
...or something that isn't disc based and is therefore in another dat.

I know redump isn't scene, I dump there. My rom collection is PSX discs floating around on the internet that have not yet been redumped. Mostly rare japanese titles from Russian sites. But I was out of the loop about the bsbt no-intro set and thought that might be something similar to what I compiled (I understand now that it's not).
« Last Edit: June 16, 2017 - 04:39:51 by user7 »

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #10 on: June 17, 2017 - 20:42:01 »

[You are not allowed to view links] Register or Login
New DAT attached!
New Sources: (Yellow Sources)
a whole bunch of older(and newer) Non-TOSEC Dats from [Apache, Crashdisk, Duncan Twain, Yori Yoshizuki]
Corinthian's c64 collection
Connie's 'numbered roms' for a few computer systems
Yori Yoshizuki's snes MSU-1

Updated Sources: (Yellow Date/Version)
No-Intro, Redump, GoodSets(28/34 SETs completed)->big thanks to Zandro for his help!

Changes:
-
   
Additional Info:
finished 95% of all "main"sources and computer related dats!


Stats:
dats-size before deduping: 98,4 GB
dats-size after deduping:   0,7 GB

dats-count before deduping: 47926
dats-count after deduping:   5314

Offline rzil

  • Newbie
  • *
  • Posts: 21
Re: The Deduplication Project
« Reply #11 on: June 17, 2017 - 20:44:25 »
Wow, great project!! Thanks.

Can I ask what does "_Old" folder means? (duplicates, deprecated, something else?)

Also, I suggest moving GoodSets and TOSEC to be the very last in source order...
both are extremely large and shadowing the names of other collections,
which often have much more meaningful names.
« Last Edit: June 17, 2017 - 20:47:42 by rzil »

Offline @tractivo

  • Sr. Member
  • ****
  • Posts: 453
Re: The Deduplication Project
« Reply #12 on: June 18, 2017 - 01:11:15 »
_Old means leftovers, discarded stuff. for example in no-intro you can find all unique hashes of older dats in the _Old folder. stuff thats no longer listed in the newest dats, which can be found in the 'New' folder.

the deduped dats collection also has a folder called _Old, in which you can find dats that have changed or are fully deduped since newer updates.

for instance, after adding goodsets, a lot dats were totally deduped. for example a lot wod dats are now deduped by their original sources.

for making use of the deduped dats collection, you only need to have a complete:
'DATs Collection [Deduped]\New' folder
the dat is forcepacking=''unzip'', so you can directly copy the content of folder 'New' into a clean romvault datroot and you are good to go.

for your suggestion about the source order, can you give me an example of how you would sort the sources?

goodsets deserve to be up that high. the tools, which i still recommend to use along this project, do an amazing job in verifying our beloved roms. bigger databases is a plus in my oppinion.

my order is the result of a few mixed criteria. first is accuracy (no-intro, redump) second is how trustworthy and popular the source is. specialized dats can still be up high like smw hacks which are superior to goodsnes tools' naming.

another criteria is if the source still sees updates.

wod is placed last because it was a similar project that tried to list all unique hashes in one place. this project of mine now unmerges the wod dats to bring the files back to their original sources. with the original names.

and one last criteria is my own personal preference. i really like all of this projects that have put years in developing and cataloging this information we can use to collect and sort our beloved roms. some people may like trurip over tosec. or tosec over nointro. or the other way around.

it will never satisfy everyone at once. but i will still listen to more specific suggestions. with more details on why i should list something higher or lower i can do my best to improve the project.




Offline rzil

  • Newbie
  • *
  • Posts: 21
Re: The Deduplication Project
« Reply #13 on: June 18, 2017 - 05:43:16 »
Thanks for the explanation.

For instance, Super Mario 64 Hacks which are listed by name and creator in Yori's NonGoods
are all named "Super Mario 64 (1996)(Nintendo)(US)[hXXX]" in TOSEC

I would sort by specifity/speciality (accuracy is also very specific [only most accurate ROMs]), means DATs with very special goal will be high (first) in order (not much will change).
for example Maybe-Intro, which is specific to [SNES] rom translations will be before GoodSets, which tries to list every possible version.
and Zandro's SMW Hacks, which is specific to SMB Hacks, even before Maybe-Intro.
another criteria is meaningful naming (altough I think naming and speciallity often come together 8))

of course it is just a suggestion and you can do what you think is best.

Offline cannonwillow

  • Newbie
  • *
  • Posts: 22
Re: The Deduplication Project
« Reply #14 on: June 18, 2017 - 07:28:07 »
thanks for all the great work @Tractivo. Do you plan on Deduping the Dats2\Artwork dats? there's a lot of duplicate roms, especially in the \progetto-SNAPS folder.

cannonwillow