Shorten the README; save the story for the blog post.

This commit is contained in:
Sam Fredrickson 2023-12-06 18:33:26 -08:00
parent 62dcffc2c9
commit 0346b96449

View File

@ -2,65 +2,21 @@
## Background ## Background
I have a niche problem: my storage server's ZFS pool is lumpy! See [this blog post](https://blog.humancabbage.net/posts/datashake) for the
motivation behind this program. Basically, this program copies files back-and-
forth between ZFS datasets to attempt to address unbalanced utilization among
vdevs.
``` ## Usage
NAME SIZE ALLOC FREE FRAG CAP HEALTH
zones 32.6T 12.2T 20.4T 3% 37% ONLINE ```text
mirror 3.62T 2.21T 1.41T 5% 61.1% ONLINE $ datashake --source /tank/stuff --temp /tank/temp --concurrency 2
c0t5000CCA25DE8EBF4d0 - - - - - ONLINE
c0t5000CCA25DEEC08Ad0 - - - - - ONLINE
mirror 3.62T 2.22T 1.40T 6% 61.3% ONLINE
c0t5000CCA25DE6FD92d0 - - - - - ONLINE
c0t5000CCA25DEEC738d0 - - - - - ONLINE
mirror 3.62T 2.28T 1.34T 6% 63.0% ONLINE
c0t5000CCA25DEAA3EEd0 - - - - - ONLINE
c0t5000CCA25DE6F42Ed0 - - - - - ONLINE
mirror 3.62T 2.29T 1.33T 5% 63.2% ONLINE
c0t5000CCA25DE9DB9Dd0 - - - - - ONLINE
c0t5000CCA25DEED5B7d0 - - - - - ONLINE
mirror 3.62T 2.29T 1.34T 5% 63.1% ONLINE
c0t5000CCA25DEB0F42d0 - - - - - ONLINE
c0t5000CCA25DEECB9Dd0 - - - - - ONLINE
mirror 3.62T 237G 3.39T 1% 6.38% ONLINE
c0t5000CCA24CF36876d0 - - - - - ONLINE
c0t5000CCA249D4AA59d0 - - - - - ONLINE
mirror 3.62T 236G 3.39T 0% 6.36% ONLINE
c0t5000CCA24CE9D1CAd0 - - - - - ONLINE
c0t5000CCA24CE954D2d0 - - - - - ONLINE
mirror 3.62T 228G 3.40T 0% 6.13% ONLINE
c0t5000CCA24CE8C60Ed0 - - - - - ONLINE
c0t5000CCA24CE9D249d0 - - - - - ONLINE
mirror 3.62T 220G 3.41T 0% 5.93% ONLINE
c0t5000CCA24CF80849d0 - - - - - ONLINE
c0t5000CCA24CF80838d0 - - - - - ONLINE
``` ```
You can probably guess what happened: I had a zpool with five mirrors, and then ## Shortcomings
expanded it by adding four more mirrors. ZFS doesn't automatically rebalance
existing data, but does skew writes of new data so that more go to the newer
mirrors.
To rebalance the data, the algorithm is straightforward: * The way actions and errors are logged in-memory and only persisted at the
end is not robust enough. Program crashes or system power loss can cause
* for file in dataset, files to be lost in the temporary directory. In the meantime, the program
* copy the file to a temporary directory in another dataset still writes to `stdout` for each copy operation, so piping the output to
* delete the original file `tee` should suffice for now.
* copy from the temporary directory to recreate the original file
* delete the temporary directory
As the files get rewritten, not only do the newer mirrors get more full, but
also the older mirrors free up space. Eventually, the utilization of all mirrors
should converge.
## Solution
The `datashake` program aims to automate the rebalancing process, while also
adding some robustness and heuristics.
* Gracefully handle shutdowns (e.g. Ctrl-c) to prevent files from getting lost.
* Keep track of processed files, so that if the program stops and resumes, it
can skip those files.
* Write a journal of operations so that, if shut down ungracefully, files in
the temporary directory can be identified and recovered.
* Don't bother processing really small files.