I had a serious problem last year: my Compellent SAN's were running out of free space, and FAST. Turns out, this happens to lots of storage vendors and is caused by virtual disk (vmdk) migrations. Anytime a virtual disk moves, the SAN sees it as new writes and doesn't relinquish the original space. So if you do lots of SVMotion jobs or disk-based backups that create snapshots then delete them, this could be a serious problem!
So, here is the extremely condensed version of what I did to resolve the issue and ultimately reclaim 30TB, yes, THIRTY TERABYTES on my production arrays.
I ended up installing vCenter Orchestrator and using scheduled workflows to automate the process.
(I owe credit for this idea to my good friend Sean Howard at VMware)
There's a few catches though, few things you have to do before creating workflows in Orchestrator:
- you have to enable SSH on a host (or a few; depending on how many hosts you want running the workflows)
- you have to/should suppress the SSH security warnings on those hosts
- you have to enable SSH password authentication on those same hosts
- you have to enable use of the UNMAP commands for space reclamation (VAAI primitives)
- then you can issue the UNMAP comands against the datastores/LUN's
Enable SSH
SSH is required for command-line access to ESXi hosts.
- Using the vSphere Client, select the ESXi host, then the “Configuration” tab, then under “Software” click “Security Profile”.
- In the top-right corner, click “Properties”, then select “SSH”, then click “Options”
- From here you can start or stop the SSH service and set the startup type to manual or automatic.
Disable SSH security warnings
If SSH is enabled, it will generate warnings on the summary tab of each host. These warnings can be suppressed.
- Using the vSphere Client, select the ESXi host, then the “Configuration” tab, then under “Software” click “Advanced Settings”.
- Locate “UserVars” on the left-hand pane, then change “UserVars-SuppressShellWarning” to 1.
SSH Password Authentication
To allow for applications such as vCenter Orchestrator to have root or shell access to an ESXi host, Password Authentication must be enabled.
- Log into the console with SSH (putty).
- Navigate to /etc/ssh and edit sshd_config using vi
- Type vi sshd_config and press Enter.
- Now you are in the vi text editor. You can move around using page up, page down, and the arrow keys.
- Enter insert mode. You can press i to start editing where the cursor is, you can also press shift+o to start editing on a new line above the cursor, or o to start editing on a new line below the cursor.
- Change the line "PasswordAuthentication no" to "PasswordAuthentication yes".
- To save and quit press Esc, :, w, q.
- To quit without saving, press Esc, :, q, then press Enter.
- Restart the SSH daemon on the ESXi host under “Configuration” tab, and “Security Profile”
Enable use of UNMAP commands for space reclamation
vSphere 5.0 introduced VAAI Thin Provisioning Block Space Reclamation (UNMAP) Primitive. This feature was designed to efficiently reclaim deleted space to meet continuing storage needs. ESXi 5.0 issues UNMAP commands for space reclamation during several operations.
- Log into the console with SSH (putty) and issue the command below
- esxcli system settings advanced set --int-value 1 --option /VMFS3/EnableBlockDelete
- This is a per-host setting and must be issued on each ESXi 5.0 host in your cluster.
Issue UNMAP command against VMFS datastore (manually)
The actual space reclamation command is as follows. This writes a “balloon file” to the top blocks of the datastore to persuade the SAN into relinquishing free space
- Log into the console with SSH (putty)
- Change directories into the VMFS datastore you wish to reclaim space from
- cd vmfs/volumes/datastorename (where datastorename is the actual name of the datastore)
- Issue the UNMAP command
- vmkfstools -y 60 (where 60 is the percentage of space the balloon file will attempt to reclaim space from)
So to automate this, what you can do is create custom scheduled workflows in vCenter Orchestrator and run something like this:
cd vmfs/volumes/LUN-NAME;vmkfstools -y 60
Where "LUN-NAME" is the VMFS datastore name.
You will see tremendous spikes in front-end I/O while this runs. Don't be suprized if you see 25,000,000 KBps. This is normal however, and its not truely using that much. The physical host is just writing the balloon file as fast as it can to the datastore, then deleting it immediately.
That's pretty much it!!! I hope this helps someone.