Переглянути джерело

added articles on my staticmd and level6 projects

Casey DeLorme 10 роки тому
батько
коміт
89872b9214
2 змінених файлів з 79 додано та 0 видалено
  1. 39 0
      src/projects/level6.md
  2. 40 0
      src/projects/staticmd.md

+ 39 - 0
src/projects/level6.md

@@ -0,0 +1,39 @@
+
+# [level6](https://github.com/cdelorme/level6)
+
+This was a fun and amusing project I chose to build for a variety of reasons.  I myself have roughly 6 terrabytes of personal data files, a lot of which includes images, videos, and of course copious amounts of text files from project source code.
+
+While my scale is perhaps a bit larger, my problem is no different than that which my family and friends face regularly.  There are no free or simple tools out there that will handle file deduplication flexibly and efficiently.
+
+My goal with this project was to create a free and open source software that could handle file deduplication on personal computers.  Ideally at break-neck speed, but with a high degree of accuracy.
+
+The name of the project came about from the concept of destroying cloned files.  As an avid anime fan I decided to name it after a particular project in a series I enjoyed, although the connotation of the project in that series is certainly darker than the purpose of this project.
+
+
+## design considerations
+
+Originally I wanted to simply go with sha256 hash comparison.  My first implementation simply created a list of hashes for every file in a single threaded loop and printed out sets of results by hash.
+
+Then, as I read more about comparison, I realized that file size is relevant.  I took the single core file walk and grouped files by their size, then created sha256 hashes.  I proceeded to add options to delete or move the files, and json support for output such that it could be consumed and used by other applications in an easier way.
+
+My next stage was adding concurrency, which was my first attempt at golangs concurrency model.  It was actually quite refreshing once I understood how it worked, and I managed to improve the performance of the software quite a bit.
+
+I showed it to a few friends who gave me some suggestions, including adding summary data such as how many files were scanned, how many hashes were generated, and how long the operation took.  Another suggestion was to use a lighter hashing algorithm as a stop-gap before generating sha256 hashes.  This turned out to be another boon to performance, since sha256 is an expensive algorithm compared to something like crc32, so I added crc32 as a first-measure before running sha256 hashing.  I implemented full counts of how many hashes were generated, plus how many duplicates were found.
+
+The end result is roughly 600 lines of code, which can provide a very fast way to identify and manage duplicate files.  There is still a lot of room for improvement, but I'm pretty happy with how quickly I was able to put it together.
+
+
+## future plans
+
+- dodgy windows compatibility
+- byte-comparison for large files and 100%-accuracy going a step beyond sha256
+- core library, cli & gui repositories
+- detailed multimedia comparison
+
+Tests on Windows Vista 32 bit have crashed, but Windows 8.1 64 bit work.  However, on Windows it seems that resource-exhaustion occurs during which the OS forces the program to close.  This problem stems from its less-than-frugal use of ram to store contents when generating sha256 hashes.  One solution is to set a max-size, and simply omit them from comparison.  This seems to have worked well in my test cases, but it's obviously not perfect.
+
+I would like to implement byte-by-byte comparison for a more detailed approach to comparing two files.  While the odds of sha256 conflict are absurd, this would give us a 100% trust option for identifying duplicates.  Similarly it would allow us to assume a safe arbitrary max file size, reducing necessary cli arguments, and still being capable of parsing very large files.
+
+I already worked to separate the code into its own library folder, but in the future I want to separate that as its own repository, then import and use it with both a cli and gui implementation.  This would allow something like `level6-sdl` to provide a graphical interface, stacked ontop of the same source as the cli implementation.
+
+I would like to use more detailed comparison methods against images, videos, and audio.  Algorithms that identify key-points in similar files, and can draw conclusions such as altered contrast, brightness, cropping, rotation, etc and still manage to match similar items.  These small changes may barely be visible in the files themselves but would cause hash or byte comparison to fail.

+ 40 - 0
src/projects/staticmd.md

@@ -0,0 +1,40 @@
+
+# [staticmd](https://github.com/cdelorme/staticmd)
+
+I looked at some alternatives such as [harpjs](http://harpjs.com/) and [hugo](http://gohugo.io/), and while both of them are fantastic tools, I wanted something much more basic.
+
+I feel that the moment the content for a website has the need for configuration files, you are no longer building a basic static site but a dynamic site with a print release system.  While there is nothing inherently wrong with this approach, it seems overkill for a number of projects I'd like to build.
+
+My goal with this code was not just a basic static site generator, but the ability to compile many files into a single _"book"_.  Ideally the ability to do so using a single template file, and with nothing more complicated than a handful of variables inside it.
+
+
+## design considerations
+
+The code needed two paths, one for making a single page book, and one for making a multi page site.  There also needed to be a split in the path for multi page generation that allowed for relative paths, so it could be used/accessed like locally generated documentation.
+
+I was also creating this as a [work-project](https://www.youtube.com/watch?v=gsPabU09FFM), and had a deadline to meet.  So I dropped some of the objectives from the original design, such as multiple index support, and some of the navigation logic was changed when I looked at how some others were doing it, like [Steve Losh](http://stevelosh.com/).
+
+
+## future plans
+
+I have a few things I'd like to refactor, features to add, and components to rebuild:
+
+- moving the core logic to a separate library, and making a `staticmd-cli` repository
+- rewriting the index generator for single page and multiple page more intelligently
+- revisit and cleanup the process of building list of files, and filters
+- addition of optional configuration file support
+
+The core of the system which currently accepts cli inputs could be adjusted such that the cli logic is independent.  This would allow me to, in the future, create a graphical interface version (ex. `staticmd-sdl`).  It's something I've experimented with in my [`level6` project](level6.html).
+
+I have several concerns with the current implementation that generates indexes or table-of-contents.  For the single page code I never created a means to sort the contents (it assumes file system order which should be alphabetical).  This also means the title of the section, which should point to the index, is not a link because the order it lists the content is not controlled (linking would potentially put you in the middle of that sub-section of content).  The multi page index generation currently has to go through extra rounds when using relative links, and like before control over order is only by name.  Ideally I want to be able to sort via folders-first for multi-page generation, and some alternative for single-page that allows more control over the order.  Fixing how the content builds when using single page such that index contents are the first loaded in sub-sections, and being able to link to them would be ideal.
+
+I want to revisit and cleanup how I create the list of files, folders, and apply filters such as the accepted markdown file types, or which paths to ignore.  Right now the code and the way the logic is split does not feel efficient.  I'd like to make it more readable, without killing performance.
+
+While my original concept was to avoid configuration files, I am seeing where the need arises.  If filtered paths, accepted file types, and a bunch of other properties become flagged options, the command to insert all of this becomes just as complicated as a configuration file.  I would like to do what I can to allow configuration file support, with flags for the same set of options, and fallback to a set of _sane_ default values otherwise.
+
+
+## references
+
+I've been fond of reading about game development lately, and came across [gameprogrammingpatterns.com/](http://gameprogrammingpatterns.com/), which was written using markdown.  This is primarily because the best practices we use in our field today so greatly conflict with game developments version of best practices where performance far outweighs the value of things like object-oriented-design.
+
+Another great resource I found was [the golang book](http://www.golang-book.com/), which is also written in markdown.  I find great beauty in the simplest of aesthetics, and this is a perfect example of that.