Scenario

I recently had a situation whereby an NSX-v DLR Control VM had been unintentionally removed from vSphere Inventory, this caused a lot more grief than you would expect.

This DLR CVM was performing L2 bridging for some production workloads that were being migrated off VLAN-backed segments into logical switches. As a result, it was critical that the CVM be powered on and connectivity be restored as soon as possible. This DLR CVM was deliberately deployed as non-HA due to some issues previously faced with L2 bridging/HA split brain/STP loops/shut ports, another post on that some time.

I attempted to redeploy the DLR CVM after being informed of the outage, only to find out that this would not be straight forward. NSX was referencing a non-existent/stale Edge VM that was completely gone and unable to be restored.

This was a NSX 6.4.3 environment. These steps are applicable to both Edge Services Gateway VMs as well as DLR Control VMs.

The steps that created this scenario were as follows:

DLR CVM was powered off and removed from inventory unintentionally.
DLR CVM was found on the relevant datastore, reregistered on another host, powered back on, connectivity restored.
Later that day, the DLR CVM was shutdown and destroyed with no warning whatsoever.
- This was observed by a colleague after the fact while analysing vCenter/NSX Manager logs. DO NOT attempt to reregister the Edge VM if it has been removed from inventory.
I then attempted to redeploy the CVM via the Networking & Security interface in the vSphere Web Client, this did not work.
- I also tried using the NSX Manager REST API to delete the CVM so that I could deploy a new one, this did not work.
- Following the above failed attempt, I tried to redeploy using the NSX-v Manager REST API, this did not work.
Tried adding new appliance VM so that it would deploy and then remove the reference to the old/non-existent one, this did not work.
- Although this didn’t error, it did display odd behaviour whereby the new Edge VM showed up as “Undeployed” but disappeared immediately after a UI refresh.

At this point I raised a severity 1 ticket with VMware support who were able to help me resolve this issue, and I hope the following details are useful to anyone else that faces this same issue.

Relevant Information

Things to note:

My DLR CVM had an ID of edge-4

edge-id

My DLR CVM was referencing a VM with an ID of vm-421 (see below)

Log Snippets

These log snippets have been taken from my lab where I reproduced the same issue.

2018-11-25T06:55:35.546Z a-nsxmgr NSXV 6346 – [nsxv@6876 comp=”nsx-manager” subcomp=”manager”] Failed to find vm with ID = ‘vm-421‘ in the inventory.

2018-11-25T06:55:35.549Z a-nsxmgr NSXV 6346 – [nsxv@6876 comp=”nsx-manager” subcomp=”manager”] populateSystemEvent parameters : sourceName edge-4, morefIdOfObjectOnVc null, moduleName NSX Edge Gateway, eventCode EDGE_VM_NOT_FOUND_IN_VC_INVENTORY, severity Critical, messageParams [vm-421] eventMetaData {edgeVmVcUUId=5017c675-8084-8ba2-4e25-0ad0d3baa310, missing vmId=vm-421}

2018-11-25T06:55:35.552Z a-nsxmgr NSXV 6346 – [nsxv@6876 comp=”nsx-manager” subcomp=”manager”] [SystemEvent] Time:’Sun Nov 25 06:55:35.549 GMT 2018′, Severity:’High‘, Event Source:’edge-4‘, Code:’30032‘, Event Message:’NSX Edge appliance with vmId : vm-421 not found in the vCenter inventory.‘, Module:’NSX Edge Gateway‘, Universal Object:’false’

vShield Edge:10165:Invalid Configuration. To deploy NSX Edge appliances, valid appliance configuration or valid deployment container configuration is required.:The requested object : DeploymentContainer [globalcontainer] could not be found. Object identifiers are case sensitive.

2018-11-25T07:03:13.938Z a-nsxmgr NSXV 6346 – [nsxv@6876 comp=”nsx-manager” subcomp=”manager”] populateSystemEvent parameters : sourceName edge-4, morefIdOfObjectOnVc datacenter-2, moduleName NSX Edge Health Check, eventCode EDGE_GATEWAY_HEALTHCHECK_NO_PULSE, severity Critical, messageParams null eventMetaData {edgeId=edge-4}

2018-11-25T07:03:13.941Z a-nsxmgr NSXV 6346 – [nsxv@6876 comp=”nsx-manager” subcomp=”manager”] [SystemEvent] Time:’Sun Nov 25 07:03:13.938 GMT 2018′, Severity:’Critical‘, Event Source:’edge-4‘, Code:’30034‘, Event Message:’None of the NSX Edge VMs found in serving state. There is a possibility of network disruption.‘, Module:’NSX Edge Health Check‘, Universal Object:’false’

Caused by: ObjectNotFoundException: core-services:202:The requested object : DeploymentContainer [globalcontainer] could not be found. Object identifiers are case sensitive.

Verify the broken state

Open a console to the relevant NSX-v Manager (preferably via SSH) and go into engineering mode with these steps.

Open the postgres database console:

/opt/vmware/vpostgres/current/bin/psql -U secureall

Enable extended output mode (much easier to read):

\x

Run the following command (substituting your relevant edge-id):

SELECT * FROM edge_vm_info WHERE edge_id_fk = '<edge-id>';

Take note of the vm_mo_id0 value as this is the moref that NSX Manager is using to reference the appropriate DLR CVM in the vSphere Inventory.

At this point you should make certain that the VM in question does not exist in the vSphere inventory. There are a couple of ways of doing this:

Use the vCenter Managed Object Browser
1. https://<vcenter>/mob/?moid=<vm-id>
  1. This will return a 404 if the VM cannot be found
Use PowerCLI to find the VM based on ID
1. Get-VM -ID VirtualMachine-<vm-id>
  1. This should return no VMs if the VM cannot be found

At this point you have confirmed that we do in fact need to make some modifications to the NSX-v Manager postgres database so that it no longer references the stale Edge VM.

Resolution

In order to resolve this, the following steps need to be followed:

Ensure you have a current NSX Manager backup.
Using the same steps above, ensure you have a postgres console open on the NSX-v Manager.

Run the following postgres query, take note of the values returned under id_to_be_deleted:

SELECT t1.id AS "id_to_be_deleted", t1.datastore_moid, t1.host_moid, t1.resource_pool_moid, t1.vm_hostname, t1.vm_name, t2.edge_id_fk
FROM edge_vm t1
INNER JOIN edge_service_config t2
ON t1.edge_config_id_fk = t2.id
WHERE t2.edge_id_fk = '<edge-id>';

ids_to_be_deleted

Run the following select statement(s) and save the output in the event that they need to be referenced after we delete the relevant rows:
- ```
SELECT * FROM edge_vm WHERE id = <id_to_be_deleted>;
```
Run the following delete statement(s) to remove the problematic rows so that we can redeploy the edge VM:
- ```
DELETE FROM edge_vm WHERE id = <id_to_be_deleted>;
```

delete_statements

You will now observe that there are no configured Edge VMs for the relevant Edge, you can now redeploy a new Edge VM as you would do normally.

create_new_cvm

One last step is required to get the postgres database records updated correctly. The edge_vm table is correct, but there will be stale entries in the edge_vm_info table (referencing stale VC UUID and VC moref values). In order to update this data we need to perform a redeploy operation (could have updated the postgres database manually, but this is a much safer method).

correct_uuid_moref

You should now have restored your Edge VM.

Thoughts

Although it is unlikely that an Edge VM would be deleted from disk or removed from inventory in a production/stable environment, these steps will minimise your outage window in the event that it occurs in your environment.

Other options to resolve this were all evaluated for feasibility/effort/reliability but were deemed not the preferred action plan:

Pulling Edge VM config programatically via API, deleting existing Edge, redeploying new Edge via API
- This may have been problematic due to trying to delete Edge while it was still referencing a non-existent Edge VM. This method could have left the environment in a worse state, stuck trying to delete the Edge.

NSX-v Edge VM has been deleted from disk/removed from inventory – How to resolve