|
@@ -0,0 +1,39 @@
|
|
|
+
|
|
|
+# [level6](https://github.com/cdelorme/level6)
|
|
|
+
|
|
|
+This was a fun and amusing project I chose to build for a variety of reasons. I myself have roughly 6 terrabytes of personal data files, a lot of which includes images, videos, and of course copious amounts of text files from project source code.
|
|
|
+
|
|
|
+While my scale is perhaps a bit larger, my problem is no different than that which my family and friends face regularly. There are no free or simple tools out there that will handle file deduplication flexibly and efficiently.
|
|
|
+
|
|
|
+My goal with this project was to create a free and open source software that could handle file deduplication on personal computers. Ideally at break-neck speed, but with a high degree of accuracy.
|
|
|
+
|
|
|
+The name of the project came about from the concept of destroying cloned files. As an avid anime fan I decided to name it after a particular project in a series I enjoyed, although the connotation of the project in that series is certainly darker than the purpose of this project.
|
|
|
+
|
|
|
+
|
|
|
+## design considerations
|
|
|
+
|
|
|
+Originally I wanted to simply go with sha256 hash comparison. My first implementation simply created a list of hashes for every file in a single threaded loop and printed out sets of results by hash.
|
|
|
+
|
|
|
+Then, as I read more about comparison, I realized that file size is relevant. I took the single core file walk and grouped files by their size, then created sha256 hashes. I proceeded to add options to delete or move the files, and json support for output such that it could be consumed and used by other applications in an easier way.
|
|
|
+
|
|
|
+My next stage was adding concurrency, which was my first attempt at golangs concurrency model. It was actually quite refreshing once I understood how it worked, and I managed to improve the performance of the software quite a bit.
|
|
|
+
|
|
|
+I showed it to a few friends who gave me some suggestions, including adding summary data such as how many files were scanned, how many hashes were generated, and how long the operation took. Another suggestion was to use a lighter hashing algorithm as a stop-gap before generating sha256 hashes. This turned out to be another boon to performance, since sha256 is an expensive algorithm compared to something like crc32, so I added crc32 as a first-measure before running sha256 hashing. I implemented full counts of how many hashes were generated, plus how many duplicates were found.
|
|
|
+
|
|
|
+The end result is roughly 600 lines of code, which can provide a very fast way to identify and manage duplicate files. There is still a lot of room for improvement, but I'm pretty happy with how quickly I was able to put it together.
|
|
|
+
|
|
|
+
|
|
|
+## future plans
|
|
|
+
|
|
|
+- dodgy windows compatibility
|
|
|
+- byte-comparison for large files and 100%-accuracy going a step beyond sha256
|
|
|
+- core library, cli & gui repositories
|
|
|
+- detailed multimedia comparison
|
|
|
+
|
|
|
+Tests on Windows Vista 32 bit have crashed, but Windows 8.1 64 bit work. However, on Windows it seems that resource-exhaustion occurs during which the OS forces the program to close. This problem stems from its less-than-frugal use of ram to store contents when generating sha256 hashes. One solution is to set a max-size, and simply omit them from comparison. This seems to have worked well in my test cases, but it's obviously not perfect.
|
|
|
+
|
|
|
+I would like to implement byte-by-byte comparison for a more detailed approach to comparing two files. While the odds of sha256 conflict are absurd, this would give us a 100% trust option for identifying duplicates. Similarly it would allow us to assume a safe arbitrary max file size, reducing necessary cli arguments, and still being capable of parsing very large files.
|
|
|
+
|
|
|
+I already worked to separate the code into its own library folder, but in the future I want to separate that as its own repository, then import and use it with both a cli and gui implementation. This would allow something like `level6-sdl` to provide a graphical interface, stacked ontop of the same source as the cli implementation.
|
|
|
+
|
|
|
+I would like to use more detailed comparison methods against images, videos, and audio. Algorithms that identify key-points in similar files, and can draw conclusions such as altered contrast, brightness, cropping, rotation, etc and still manage to match similar items. These small changes may barely be visible in the files themselves but would cause hash or byte comparison to fail.
|