Author Topic: The Deduplication Project  (Read 1655 times)

0 Members and 1 Guest are viewing this topic.

Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #15 on: June 18, 2017 - 09:19:15 »
yo rzil, i get what you saying. but you know that nongood does not have a naming standard. yori yoshizuki sorts undatted/unknown stuff by system without naming the files.
maybe the mentioned super mario hacks are better named in nongood than in tosec but the other 90% of the listed files in tosec are still better named than nongood. at least for verification matters.

to be able to have the name for each and every rom i would need to compare each hash one by one throught all sources and then choose the very best. but i cant do that for over 4milion unique hashes :) so i have to go by sources. and in general tosec has more valuable naming than nongood.
nongood and unrenamed are also not for verification in general.they help the other sources (tosec and goodsets) to see what they still can add to their databases. after the files are added by the mentioned sources, nongood and unrenamed discard the entry from their listings.

hi cannonwillow, yes progetto-snaps, tosec pix etc will hopefully find their way in over the time :)



Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #16 on: June 19, 2017 - 19:09:48 »
DAT attached!

Additional Info:
each and every DAT that is listed in UDC (or that i have on my drive) is now included! (artwork, music, emulator, scene.. everything)
this is the final update.
with finishing the deduplication of all dats, you now have 99% of all unique hashes ever datted in one collection.
this collection can help to identify some of your more exotic files within your tosort folder which you thought would never find a match.
after you find out to which dat those files belong, i recommend to use the original dat. or do whatever you like :)

"Ultimate DATs Collection" and "The Deduplication Project" have come to an end now. i hope you have enjoyed both projects.
i will be back with UEC updates, and maybe other projects. see ya'll around.


Final Stats:

dats-size before deduping:      112,5 GB
dats-size after deduping:             1,6 GB

dats-count before deduping:     69621
dats-count after deduping:          9904

hashes-count before deduping: uncountable!
hashes-count after deduping:  11.117.841 unique hashes!
« Last Edit: July 14, 2017 - 18:05:08 by @tractivo »

Offline cannonwillow

  • Newbie
  • *
  • Posts: 42
Re: The Deduplication Project
« Reply #17 on: June 23, 2017 - 18:29:36 »
Quick question

are _old folders within the new folders tree also not needed for a complete deduped collection?

example:   New/No-Intro/_Old/*

thanks in advance and for the amazing work.

cannonwillow


Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #18 on: June 23, 2017 - 19:03:21 »
what you dont need is:

DATs Collection [Deduped]\_Old

what you need for a complete collection is everything in:

DATs Collection [Deduped]\New

so even every _Old folder inside New :)

or in other words the 9904 dats inside New are all needed for a complete deduped hash-unique collection.

Offline cannonwillow

  • Newbie
  • *
  • Posts: 42
Re: The Deduplication Project
« Reply #19 on: June 23, 2017 - 19:27:48 »
thanks for clearing that up.

 8)

Offline cannonwillow

  • Newbie
  • *
  • Posts: 42
Re: The Deduplication Project
« Reply #20 on: July 24, 2017 - 08:58:38 »
@tractivo, I have found some undeduped roms within the deduped collection (using the 9904 dat collection). such as gdl-0024.chd and gds-0031.chd as can be seen in the picture attached. one is from \Emulator  and the other from \Mame, Mess . I have seen a few others but cant recall from where. Maybe this is only .chd related.

and how could this amazing project end up on page 2, it really should be stickyed...

cannonwillow
« Last Edit: July 24, 2017 - 09:09:34 by cannonwillow »

Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #21 on: July 24, 2017 - 09:27:53 »
no need for sticky as this will not see further updates :) this was a one time only collection that should help varify all files that normally were unvarifiable. because no rommanager is able to load 100gb of dats at once :) but that aside thanks for your feedback.

about the undeduped dupe files that you have found. thats really only an issue with chd's. actually with dir2dat chd files. which are listed with their external file hashes. for like the one in your example:

DEmul - CHDs (v0.7 alpha_dir2dat_XML):
<rom name="gdl-0024.chd" size="135053965" crc="3e0f17e7" sha1="47f49359f07d6f4b3a7c05dae12271feb979ff2d"/>

MAME - CHDs (v0.187_XML):
<disk name="gdl-0024" sha1="4898b21fb1f44f34fcf1730f64cb0491e9195327"/>


see the different hashes? thats why they couldnt be deduped. my recommendation is to delete all "dir2dat" "CHD" dats.

Offline cannonwillow

  • Newbie
  • *
  • Posts: 42
Re: The Deduplication Project
« Reply #22 on: July 24, 2017 - 10:12:44 »
I thought by using the deduped dats (9904) I could minimize the size of my rom collection, especially random stuff that may be useful later. instead of deleting when I give up on it or not sure what it is. hate to figure out howto and then start the search all over again. A dat can be pulled from the normal udc collection or anywhere, shoved into the datroot with the deduped collection, uncheck all areas in romvault that probably do not apply, sort order useful here. hunt down the rest. TB's are really getting cheap these days but nobody wants copies of the same thing except for backup (and obviously not on the same drive). and actually organized.



thanks for the undeduped .chd explanation

Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #23 on: July 24, 2017 - 10:24:41 »
all true. there you go ;)
i am using a deduped datroot myself which i am constantly improving. saved me a lot of space so far  ;)

Offline cannonwillow

  • Newbie
  • *
  • Posts: 42
Re: The Deduplication Project
« Reply #24 on: July 24, 2017 - 10:34:45 »
had to go to microcenters store yesterday to buy more memory. when romvault (using deduped) goes into virtual memory during scan/fix might as well turn the monitors off, drink one last beer, and check it when you wake up the next day and then again the next day. extra memory was designed for romvault :D

Offline NLS

  • Full Member
  • ***
  • Posts: 169
Re: The Deduplication Project
« Reply #25 on: July 24, 2017 - 11:53:31 »
Very interesting project and amazing work by @tractivo.

Would prefer a generic solution (non existent) though, that automatically uses dats in the order a user decides (because now we all have to agree with @tractivo's order), finds duplicates and fixes things.

Still it is interesting as it might "clean up" left overs we all have.
---
NLS

Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #26 on: July 24, 2017 - 12:10:40 »
@cannonwillow
i am excluding dats that list archives (mostly scene), music collections, artwork stuff and some other bigger sized collections (mostly iso stuff). leaves me with the 5 of 11 million unique hashes. and around 3gb ram used by RV loaded. performance wise i dont have any issues using RV so far. but i remember that the complete deduped dats project would be around 7gb ram usage and also navigating through the sets would slow down a lot. opening a set would take up to 3 or 4 seconds.

i remember GordonJ telling me that he had around 4 million unique hashes loaded in its RVX build. but that would generate a cache/database file of around 100gb if not more :D

@NLS
so far there is no best solution for the overall deduped collection. but i can recommend learning how to use darksabre76's "[You are not allowed to view links] Register or Login". thats the tool i am using to generate these deduped sets.
i can help a little in tutoring you guys how to dedupe your dats. then you should be able to generate the deduped dats in your own prefered order :)  i may write a little tutorial if you like.

Offline attractivo

  • Hero Member
  • *****
  • Posts: 533
Re: The Deduplication Project
« Reply #27 on: July 24, 2017 - 14:20:29 »
Deduping Routine as of 2017-07-24

Introduction: i will now tell you [You are not allowed to view links] Register or Login!!!



CHAPTER 01: PreSorting the DATs! (Workdir Layout) - check [You are not allowed to view links] Register or Login

01.00-    parent source workdirectories with additional code that presorts sources for reversed deduplication method
   
   01.01-    source names are shortened
           
            example:
               "NoIn [!]" means "No-Intro [Newest Release Dats]"
               "NoIn [ ]" means "No-Intro [Older Release Dats]"
               "TSC [! (ISO)]" means "TOSEC [Newest Release ISO DATs"
   
   01.02-    sources are presorted with the help of the additional code
           
            example:
               "[84950] TSC [!]"
               "[89800] Rdmp [!]"
               "[89900] NoIn [!]"
            keep in mind that the sources are deduped in reversed order
            based on this example this means the order will be: NoIn [!] -> Rdmp [!] -> TSC [!]
   

02.00-   system folders inside each parentsource with additional information tags (see [You are not allowed to view links] Register or Login)
   
   02.01-   systems have additional information tags for [Year of first Release][Platform]
           
            example:
               "Nintendo - Super Nintendo Entertainment System - [1990][V]"
            the snes was first released in 1990 and it is a [V]ideo console
           
            platform codes:
               A = Arcade
               C = Computer
               H = Handheld
               M = Miscellaneous
               O = Operating System
               P = Pocket Computer (including Calculators, PDA's)
               V = Video 'Gaming' Console
         
03.00-    final sourcedir for each system.
      if the parent sourcedir is "[89900] NoIn [!]", then the final sourcedir inside the system is "NoIn [!]"
      basically the same as the parent sourcedir without the sorting "
[xxxxx]" code
     
         full example path for No-Intro's SNES:
            "*\[89900] NoIn [!]\Nintendo - Super Nintendo Entertainment System - [1990][V]\NoIn [!]"
   
04.00-   the actual "dat-containing folder" with [You are not allowed to view links] Register or Login:

         example for no-intro snes:
           
            20150416-225235- Sufami Turbo
            20170228-062640- Satellaview
            20170721-094446- Super Nintendo Entertainment System
           
         as you can see the dates are at the beginning of the folder name. similar to the sorting code that was used on the parent sourcedirs.
         this will help deduping the dats in reversed order. which automatically dedupes the dats from newest to oldest release
         


         
         Chapter 02:   Convert All DATs to XML! (make the DAT-Format consistent)

05.00-   now processing starts with SabreTools. build from newest sourcecode is copied to "C:\ST\*"
         
   05.01-    !!backup of the dats collection is made!!
   
   05.02-    all parent sourcedirs are copied to "C:\ST\1\*"
   
   05.03-   starting either from console or batchfile the following command:

sabretools.exe -ud -ox -out="2" 1

      -ud       means update/process the dats
      -ox       means output to xml format
      -out="2"    means output the dats to "C:\ST\2"
     
   05.04-    all dats will be converted to XML from "C:\ST\1\*" to "C:\ST\2\*"
         
      05.04.01-   some dats have better names in their descriptions than the actual setnames. like "MESS SL", "MAME ROMs", "MAME CHDs" or even Gruby's adventure pack dats
                 
                  example from mess sl snes dat (setname):
                     <software name="smasup"
               
                  example from mess sl snes dat (description):
                     <description>Super Mario All-Stars (USA, Final Prototype)
                     
               for these kind of dats i prefer the setnames replaced with the description information, which works with the following command:
         
sabretools.exe -ud -ox -dan -out="2" 1d

      -dan      means use description as setname
     
               the import folder is now called "1d", which means all the sources that want their setnames replaced with description information should be placed in "C:\ST\1d"
                 
                  example:
                  "C:\ST\1d\[82950] MMSS [! ROMs - MM]"
                  "C:\ST\1d\[83465] Gruby [!]"
                  "C:\ST\1d\[83950] MSSL [!]"
                 
               all other sources use step "05.03"
               


               Chapter 03: start deduping the sources

06.00-   finally its deduping time. first thing we need to know is that deduping dats is fastest done on a SSD or RAMDisk.
      we also need to know that how many dats can be processed at once determines on how much ram there is. (a 16gb ram pc can process up to 2gb dats at once)
     
   06.01-    check how big the converted dats are. is the total size lower than 2gb then copy or move them to "C:\ST\3" go to step 06.03
   
   06.02-   the total size of the dats is bigger than 2gb? then split by highlighting as many sources that are under 2gb in total. copy them to folder "C:\ST\3"
   
   06.03-   start deduping the dats in reversed alphabetic order with this command:
   
sabretools.exe -ud -m -di -rc -dd -nis="NoDump" -out="4" 3

      -m -di       means merge but diff the output
      -rc         the r means reversed the c means cascaded
      -dd         dedupe
      -nis=      excludes "NoDump" entries (entries that dont have enough information and therefore are not "buildable" with rommanagers)
     

   06.04-    if you have more than 2gb of dats, first move or delete the dats from "C:\ST\3" and do 06.03 with the next block of dats that is around 2gb or less. until all dats are deduped.
         if all the dats now are under 2gb, then put them together in folder "3" and dedupe them alltogether once again. now you have a complete deduped dats pack
         
you can make these steps a lot easier by writing some batchfiles. so you dont need to manually move around the dats. but thats up to you
         
Chapter 04: final sorting methods

for the final sorting i use two batchfiles. one grabs all the systems out of the parent source dirs and moves them to a new destination.
and the second batchfile renames some of the [!] tagged sources datfile folders to make them consistent for future updates.

the first batchfile that moves the systems looks like:

Code:
Only registered users can see contents. Please click here to Register or Login.
the result is all systems in one folder: [You are not allowed to view links] Register or Login and all sources that belong to one system are now together in that system, [You are not allowed to view links] Register or Login

and the other is a bit longer and each source needs its own handling but one example for NoIn [!]:

Code:
Only registered users can see contents. Please click here to Register or Login.this for example looks through all systems and when it finds a folder called "NoIn [!]", then it renames the folders inside by deleting the first 17 characters.
           
remember the snes datcontaining folders:

20150416-225235- Sufami Turbo
20170228-062640- Satellaview
20170721-094446- Super Nintendo Entertainment System

deleting the first 17 characters renames them to this:

Sufami Turbo
Satellaview
Super Nintendo Entertainment System   

this will help to keep the romroot of the newest release dats consistant. therefore no roms are moving around. besides the ones that are now rollbacks. those move to the old release folders, for which the date information will not be deleted.

i am using a parentbatchfile that starts everything after another and moves renames automatically. so right now i only need to update the sources whenever new dats are released, and then i can start the batchfile and go drink some coffee. when all is done i just replace the datroot with the new files. update them in RV, and voila, only updated stuff shows up, everything else is in place
     
     
thats it

thats how the Captain made knick knack with the megaduck!

     
   
   
           
           
           
     
« Last Edit: July 24, 2017 - 16:11:01 by @tractivo »