This was a fun and amusing project I chose to build for a variety of reasons.
There are few free utilities that provide file deduplication, even on open sourced platforms. There are some complex features as part of modern file systems, but they generally don’t resolve the underlying problem and are not portable.
I myself have roughly 6 terrabytes of personal data files, a lot of which includes images, videos, and of course copious amounts of text files from project source code. While this may differ from the average user to some degree, I have found that many suffer from the same problem of redundant copies, often due to the lack of a good backup strategy. This problem is compounding, and exists long after you have found a good backup strategy.
The objective was to produce a relatively simple command line tool that could handle file deduplication on personal computers. Ideally, it would run fast, and with a high degree of reliability/accuracy.
The tools naming is loosely based on events during a storyline in an anime.
This project has undergone several iterations to reach its current state.
The first iteration was based almost entirely around sha256 hash comparisons, under the assumption that:
After realizing that for duplicate data the files must be the same size, a second iteration grouped the files by size.
A third iteration added the concurrency model using channels, which was my first taste of concurrency in go and appeared to boost the speed significantly. By concurrently comparing each group on its own thread, I was able to parallelize the process, which mainly improved the hash generation.
A friend suggested leading with crc32 since the computation was faster than sha256, which would allow for early identification of unmatched files. This led to the fourth iteration where crc32 was introduced.
After encountering a strange and poorly documented windows behavior, where resource exhaustion would terminate a process, I realized that loading entire files into memory was probably not a wise choice and a fifth iteration introduced buffered hashing.
I began to question scenarios that could yield sha256 collisions, and decided that I would rather add byte-by-byte comparison as a final check. This led to a sixth iteration.
With the buffering now in place, I tried moving towards mutex-based concurrency due to the complex appearance of the channel based model, but during the process noticed that eliminating concurrency resulted in faster execution times. Thus the seventh iteration was the removal of concurrency and converting the code to a pipeline system with placeholders for expected functionality.
The eighth iteration came from the desire to clean the syntax to match other projects and the “publicly accepted standard”, add godocs, and eliminate YAGNI artifacts. I ended up renaming the repository and library, gave the command line a better name, dropped features that would have reduced the accuracy of the system or relied heavily on external tools, and consolidated the code into a single simple structure.
The current, ninth iteration, involved minimizing the exposed behaviors from the library, fixing placement of various components, separating metrics into its own library as a dependency, and removing the hashing behavior after realizing the cost of hashing was not only significantly greater than byte-by-byte comparison but was also thrashing the disk. This is 232 lines of library, 40 lines of command line, and 109 lines of tests with over 90% coverage, omitting tests that would require mocking a file system.
Attempting to test every possible file system behavior by abstracting and overriding (or by mocking) is a fools errand, and you can simply rely on basic error handling for that remaining 10% coverage.
Even with buffering, hashing still requires the entire file to be read, whereas byte-by-byte comparison can terminate as early as the first 4k read. This is why you should always approach the problem with the simplest solution first.
I had originally planned to introduce alternative comparison behavior, such as content comparison to find partial-matching files, and multimedia comparison which would have relied on separate tools. Instead I chose to do one thing well.
I spent some time considering reimplementation of parallelism and attempting to optimize the file comparison algorithm reducing the total number of times file comparison is run. Theoretically neither of these implementations would have worked well; the file system is not able to operate in parallel, and you cannot take advantage of disk cache when you are bouncing the read head between 4k reads across a large number of files. Understanding the foundations of the problem as well as what your code is actually doing on a system is essential to writing a good application.
I considered a web interface to make it more usable, but introducing a method of canceling a scan, reporting or requesting the status, and pause behavior would require significant amounts of complexity and re-work and it isn’t a priority.
updated on 2017-07-18