Deduplication Basics & Best Practices

Deduplication Basics & Best Practices

After receiving many questions about deduplication module for Windows 10 (article). I decided to write this little article to give some best practices and usages. Note that data deduplication is disabled by default and is not supported for certain volumes, such as any volume that is not an NTFS file system or any volume that is smaller than 2 GB. You can retrieve the entire list of deduplication cmdlets here.

When you want to enable data deduplication on one or more volumes you need to use the Enable-DedupVolume cmdlet. You can use the Set-DedupVolume cmdlet to customize the data deduplication settings afterward. The most important parameter in this command is UsageType, it specifies the expected type of workload for the volume. This parameter sets several low-level settings to a default value that are appropriate to the usage type you specify. The acceptable values for this parameter are:

  • HyperV – A volume for Hyper-V storage.
  • Backup – A volume that is optimized for virtualized backup servers.
  • Default – A general purpose volume. This is the default value.

Once you enable deduplication, you may want to customize data deduplication settings on one or more volumes. Actually, this cmdlet uses a lot of parameters that are not available in the Enable-DedupVolume. You can especially find parameters to exclude an array of extension types or an array of names of root folders that the deduplication engines exclude from data deduplication and optimization. Two other parameters are also important. There are MinimumFileAgeDays that specifies the number of days to wait before deduplication engine optimizes these files and MinimumFileSize that specifies the minimum size threshold, in bytes, for files that are optimized. This last parameter can be useful for an Hyper-V usage, for example, you can force deduplication of .vhdx files but not small configuration files. One last useful parameter is ChuckRedundancyThreshold, it specifies the number of identical chunks of data that the deduplication engine encounters before the server creates a redundant copy of the data chunk. This increases the reliability of the server by adding redundancy to the most referenced chunks of data. In fact, deduplication detects corruptions and the deduplication scrubbing job restores the corrupted chunks from a redundant copy if it is available. The default value is 100 and the minimum value you can set is 20, note that a low value reduces the effectiveness of data deduplication by creating more redundant copies of a chunk, and consumes more memory and disk space.

Then you need to launch your deduplication job by using the Start-DedupJob cmdlet. Note that the deduplication job can queue if the server is running another job on the same volume (than you can check using the Get-DedupJob cmdlet) or if the computer does not have sufficient resources to the run the job. The machine marks the queued jobs that you start with this cmdlet as manual jobs and gives the priority of the manual job over scheduled jobs. Thanks to the parameters Cores and Memory you can control the maximum percentage of physical cores and memory that a job uses. You can also use the parameter StopWhenSystemBusy to indicate that the server stops the job when the system is busy and retries later (this can be particularly useful for a scheduled job). But the most important parameter for this cmdlet is the Type, it specifies the type of data deduplication job. The acceptable values for this parameter are:

  • Optimization – A type to launch data deduplication process.
  • Garbage Collection – A type to free up all deleted or unreferenced data on the volume.
  • Scrubbing – A type to validate the integrity of all data on the volume.
  • Unoptimization – A type to revert data deduplication process.

But what about real life? In my case, I use deduplication engine for my Data and Virtual volumes. The first one contains a bunch of different files from standard Office documents to movies or even games, so I will prefer to use default type with a minimum file size of 5 GB and a minimum file age days of 15 and a chunk redundancy data of 80.
The second volume is used to store one of my virtual lab with a lot of virtual machines running under Hyper-V. In this case, I will use HyperV type, with a minimum file size of 512 mega and a minimum file age days of 0.

Then, especially if you work with a large amount of data on your deduplicated volume. For example, if you delete 10 virtual machines on your volume, you will need to run a manual garbage collection and scrubbing job to clean old chunks and free up some space, and check the integrity of remaining data like you can see in the example below.

One thought on “Deduplication Basics & Best Practices

Leave a Reply

Your email address will not be published. Required fields are marked *