Photon Controller – Deployment and Troubleshooting Tips

Cormac

8 years ago

I’ve spent the past week or so getting familiar with Photon Controller (v0.8) and deploying various frameworks such as Kubernetes, Mesos and Docker Swarm. As I did this setup a number of times, I learned quite a bit about potential gotchas and common pitfalls that a newbie (like me) could run into when getting up to speed with Photon Controller. What I have done in this post is highlight some of the more important considerations to watch out for when getting started with Photon Controller.

A quick note on the log outputs below. These logs were mainly captured by logging onto the photon controller installer appliance (using the default credentials of esxcloud/vmware), and then using the “docker logs -f ” command against the “deployer” container id to follow the logs outputs from this container. This shows the various task that are implemented to deploy Photon Controller. To list the containers, use a “docker ps” command.

In other places, which I shall highlight in the post, the logs were captured using the same methodology from the Photon Controller itself; and also by logging into individual framework containers to look at logs directly. I will highlight in each case where the logs can be captured.

*** Please note that at the time of writing, Photon Controller is still not GA ***

*** The steps highlighted here may change in the GA version of the product ***

Networking Considerations – initial deployment

There are 3 distinct components that need to be considered in getting Photon Controller deployed. The first is the ESXi host(s) that will be used for management or cloud by Photon Controller; the second is the Photon Controller installer appliance, and the third is the Photon Controller itself. When deploying the installer appliance, it must be able to communicate to both the ESXi host(s) and the Photon Controller deployment. Therefore the installer must be deployed on a VM network that can reach the ESXi host(s) as well as the VM network of the Photon Controller. If the installer cannot reach the ESXi host(s), it will error as follows on the installer:

INFO  [2016-05-12 10:24:58,742] com.vmware.photon.controller.common.xenon.scheduler.\
TaskSchedulerService: [/photon/task-schedulers/vib-uploads] \
Host[5cfcbe7b-36f4-4f96-ac91-a23607a4543e]: TaskSchedulerService moving service \
/photon/vib-uploads/e8e3bec4-6e9b-43cd-a9d2-16cb96029e8c from CREATED to STARTED
ERROR [2016-05-12 10:24:58,761] com.vmware.photon.controller.deployer.dcp.task.\
UploadVibTaskService: [/photon/vib-uploads/e8e3bec4-6e9b-43cd-a9d2-16cb96029e8c] \
java.net.SocketException: Network is unreachable

In my testing, the installer was deployed on one VM Network VLAN, and I attempted to deploy the Photon Controller on a different VM Network VLAN that was not reachable from the installer. The deployment never completed, but actually appears to run on indefinitely, displaying the following message every 2 minutes on the installer’s “deployer” logs:

com.vmware.photon.controller.deployer.dcp.task.WaitForDockerTaskService: \
[/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-f131b6de7dfd] Handling patch \
operation for service /photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-f131b6de7dfd
INFO  [2016-05-12 11:47:45,583] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-\
f131b6de7dfd] Setting iteration count to 62
INFO  [2016-05-12 11:47:45,583] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-\
be78-f131b6de7dfd] Setting successful iteration count to 0
INFO  [2016-05-12 11:47:45,584] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-\
f131b6de7dfd] Performing poll of VM endpoint (iterations: 62)

Lesson: Ensure that the ESXi host(s), photon controller installer and photon controller can all communicate.

Storage Capacity Considerations

During one roll-out, the Photon Controller deployment failed with the following error in the installer logs:

ERROR [2016-05-12 08:59:05,861] com.vmware.photon.controller.deployer.dcp.workflow.\
CreateManagementVmWorkflowService: [/photon/workflow/create-mgmt-vm/1c2c8838-3550-4265-\
adce-5aeea83a701a] java.lang.RuntimeException: [Task "START_VM": step "START_VM" \
failed with error code "InternalError", message "Please contact the system \
administrator about request #124e1b27-844c-4503-8a13-3ce07343c37f"]

I looked on the ESXi host client, and in the ‘Tasks’ view, I found the following error:

Task: Power On VM
Key: haTask-51-vim.VirtualMachine.powerOn-3127178
Description: Power On this virtual machine
Virtual machine: b470a706-9a9c-4170-b9e6-122c2b3a0e70 
State: Failed  - Failed to extend swap file from 0 KB to 25161728 KB.
Errors:
•Failed to extend swap file from 0 KB to 25161728 KB.
•Current swap file size is 0 KB.
•Could not power on virtual machine: No space left on device.
•Failed to power on VM.
•Failed to start the virtual machine.

On closer inspection of the datastore, I found that had only 14GB of free space. this meant that i could not create the swap file for the Photon Controller (25GB). Once I cleaned up the datastore, the next deployment succeeded.

Lesson: Ensure there is enough free space on the datastore to deploy and power-on the Photon Controller.

Network considerations – frameworks (1)

Once the Photon Controller is deployed, the next step is to deploy a framework. In the provided examples that you will find on the Photon Controller wiki on github, we provide both the sample images and the steps to help you deploy Kubernetes, Mesos and Docker Swarm, common container orchestration, clustering and scaling solutions. The first thing to be aware of when deploying one of these frameworks is that you will require a combination of static IP addresses as well as DHCP. If you do not have DHCP, the deployment of parts of the framework will error with “VM failed to acquire an IP address” as follows:

2016/05/13 10:45:37 photon: Task 'e44b1cbb-7233-40e7-9c9d-df3f01b7fa3f' is in error \
state: {@step=={"sequence"=>"1","state"=>"ERROR","errors"=>[photon: { HTTP status: '0',\
 code: 'InternalError', message: 'Failed to rollout KubernetesEtcd. \
Error: MultiException[java.lang.IllegalStateException: VmProvisionTaskService failed \
with error VM failed to acquire an IP address. /photon/clustermanager/vm-provision-tasks\
/411ea466-c817-45d8-ae06-a87771298389]', data: 'map[]' }],"warnings"=>[],\
"operation"=>"CREATE_KUBERNETES_CLUSTER_SETUP_ETCD","startedTime"=>"1463132002000",\
"queuedTime"=>"1463132002000","endTime"=>"1463132737000","options"=>map[]}}

Note that you will need to login to the Photon Controller and not the installer to find these log messages. It is the same procedure as before, whereby you can use the “docker logs -f ” to examine the logs in real-time, but for the deployment of the frameworks, you need to examine the logs on the controller, and not the installer. Once the Photon Controller has been deployed, the installer plays no further role.

Lesson: Ensure that the network can have both static IP addresses, as well as addressed allocated via DHCP.

Network considerations – frameworks (2)

The ESXi host(s) that are consumed by Photon Controller can have multiple VM networks. When it comes to deploying a framework, you need to have a way of selecting the correct network for the framework containers. I described the procedure to do this in a previous post here. the important part is to use the ID of the network. If you use the name of the network you will get a “network <name> not found” error from the cluster/framework create command, as shown here:

API Errors: [photon: { HTTP status: '0', code: 'InternalError', message: 'Failed to \
rollout KubernetesEtcd. Error: MultiException[java.lang.IllegalStateException: \
VmProvisionTaskService failed with error [Task "CREATE_VM": step "RESERVE_RESOURCE" \
failed with error code "NetworkNotFound", message "Network con-nw not found"]. \
/photon/clustermanager/vm-provision-tasks/9bd0619e-5ecf-415f-9136-0ac01d192be5]', \
data: 'map[]' }]

Lesson: When there are multiple VM networks to choose from, ensure you create and correctly select the one you wish to use when deploying a framework.

Network considerations – frameworks (3)

When trying to deploy the Mesos framework, I could see that the etcd and master containers were successfully created, but the marathon container (for orchestration) was never deployed. On examining the deployment logs on the controller, I could see the following errors appearing repeatedly in the controller logs:

! java.net.ConnectException: Connection refused
! at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_51-BLFS]
! at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) \
~[na:1.8.0_51-BLFS]

Eventually the deployment of the Mesos framework failed with a timeout:

API Errors: [photon: { HTTP status: '0', code: 'InternalError', message: \
'Failed to rollout MesosMaster. Error: MultiException[java.lang.IllegalStateException: \
ClusterWaitTaskService failed with error Wait period expired. /photon/clustermanager/\
wait-for-cluster-tasks/463f3f62-1af1-4e06-a62a-c2e1173d798e, \
java.lang.IllegalStateException: ClusterWaitTaskService failed with error Wait \
period expired. /photon/clustermanager/wait-for-cluster-tasks/c39bcf03-9df5-4987-a06a-\
81438b019d7d, java.lang.IllegalStateException: ClusterWaitTaskService failed with \
error Wait period expired. /photon/clustermanager/wait-for-cluster-tasks/37c59e19-f690\
-4981-ae4d-8493dc625e50]', data: 'map[]' }]

Thanks to some guidance from my colleague Michael West, the root cause of this issue was discovered by looking through the logs on the master. We SSH’ed to the master, (credentials in this case are root/vmware) and ran the command journalctl. This displayed the startup logs, and this is what those logs displayed:

May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.055197176Z" level=info msg="POST /v1.20/containers/create?\
name=photon-mesos-master"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064556198Z" level=error msg="Handler for POST /containers/\
create returned error: No such image: mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404\
 (tag: 0.26.0-0.2.145.ubuntu1404)"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064593792Z" level=error msg="HTTP Error" \
err="No such image: mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404 \
(tag: 0.26.0-0.2.145.ubuntu1404)" statusCode=404
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Unable to find image 'mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404' locally
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064933334Z" level=info msg="POST /v1.20/images/create\
?fromImage=mesosphere%2Fmesos-master&tag=0.26.0-0.2.145.ubuntu1404"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Pulling repository docker.io/mesosphere/mesos-master
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Error while pulling image: Get https://index.docker.io/v1/repositories/mesosphere/\
mesos-master/images: dial tcp 52.73.94.64:443: network is unreachable

So it would seem that the Mesos master image that we built needs to retrieve some images externally from the docker.io repository. I was not aware of this requirement, and placed the framework on an internal network with no route externally. Thus this was the reason the Mesos framework was timing out.

In fact, there is a similar issue with the Kubernetes “tomcat” example. The tomcat server also needs to pull some external images. If you create the pods as per our example, the tomcat server will report an error while pulling images as follows:

C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 create -f \
Downloads\photon-Controller-Tomcat-rc.yml
replicationcontroller "tomcat-server" created
C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 create -f \
Downloads\photon-Controller-Tomcat-service.yml
service "tomcat" created
C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 get pods
NAME                                                     READY     STATUS                                                                                                                                             RESTARTS   AGE
k8s-master-master-342412bb-a963-4855-bc94-77dc7390fa53   3/3       Running                                                                                                                                            0          5m
tomcat-server-za0e4                                      0/1       Error while \
pulling image: Get https://index.docker.io/v1/repositories/library/tomcat/images: \
dial tcp 52.72.231.247:443: network is unreachable   0          16s

Lesson: To deploy frameworks such as Mesos using our sample images, or to use our tomcat pod example on the Kubernetes framework, ensure that the container network has a route to the outside world to pull down required images.

Hope this is useful to those of you looking to get started with Photon Controller.