Photon Controller – Deployment and Troubleshooting Tips

PHOTON_square140I’ve spent the past week or so getting familiar with Photon Controller (v0.8) and deploying various frameworks such as Kubernetes, Mesos and Docker Swarm. As I did this setup a number of times, I learned quite a bit about potential gotchas and common pitfalls that a newbie (like me) could run into when getting up to speed with Photon Controller. What I have done in this post is highlight some of the more important considerations to watch out for when getting started with Photon Controller.

A quick note on the log outputs below. These logs were mainly captured by logging onto the photon controller installer appliance (using the default credentials of esxcloud/vmware), and then using the “docker logs -f ” command against the “deployer” container id to follow the logs outputs from this container. This shows the various task that are implemented to deploy Photon Controller. To list the containers, use a “docker ps” command.

In other places, which I shall highlight in the post, the logs were captured using the same methodology from the Photon Controller itself; and also by logging into individual framework containers to look at logs directly. I will highlight in each case where the logs can be captured.

*** Please note that at the time of writing, Photon Controller is still not GA ***

*** The steps highlighted here may change in the GA version of the product ***

Networking Considerations – initial deployment

There are 3 distinct components that need to be considered in getting Photon Controller deployed. The first is the ESXi host(s) that will be used for management or cloud by Photon Controller; the second is the Photon Controller installer appliance, and the third is the Photon Controller itself. When deploying the installer appliance, it must be able to communicate to both the ESXi host(s) and the Photon Controller deployment. Therefore the installer must be deployed on a VM network that can reach the ESXi host(s) as well as the VM network of the Photon Controller. If the installer cannot reach the ESXi host(s), it will error as follows on the installer:

INFO  [2016-05-12 10:24:58,742] com.vmware.photon.controller.common.xenon.scheduler.\
TaskSchedulerService: [/photon/task-schedulers/vib-uploads] \
Host[5cfcbe7b-36f4-4f96-ac91-a23607a4543e]: TaskSchedulerService moving service \
/photon/vib-uploads/e8e3bec4-6e9b-43cd-a9d2-16cb96029e8c from CREATED to STARTED
ERROR [2016-05-12 10:24:58,761] com.vmware.photon.controller.deployer.dcp.task.\
UploadVibTaskService: [/photon/vib-uploads/e8e3bec4-6e9b-43cd-a9d2-16cb96029e8c] \
java.net.SocketException: Network is unreachable

In my testing, the installer was deployed on one VM Network VLAN, and I attempted to deploy the Photon Controller on a different VM Network VLAN that was not reachable from the installer. The deployment never completed, but actually appears to run on indefinitely, displaying the following message every 2 minutes on the installer’s “deployer” logs:

com.vmware.photon.controller.deployer.dcp.task.WaitForDockerTaskService: \
[/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-f131b6de7dfd] Handling patch \
operation for service /photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-f131b6de7dfd
INFO  [2016-05-12 11:47:45,583] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-\
f131b6de7dfd] Setting iteration count to 62
INFO  [2016-05-12 11:47:45,583] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-\
be78-f131b6de7dfd] Setting successful iteration count to 0
INFO  [2016-05-12 11:47:45,584] com.vmware.photon.controller.deployer.dcp.task.\
WaitForDockerTaskService: [/photon/wait-for-docker-tasks/aa6f2dbc-87d0-49e1-be78-\
f131b6de7dfd] Performing poll of VM endpoint (iterations: 62)

Lesson: Ensure that the ESXi host(s), photon controller installer and photon controller can all communicate.

Storage Capacity Considerations

During one roll-out, the Photon Controller deployment failed with the following error in the installer logs:

ERROR [2016-05-12 08:59:05,861] com.vmware.photon.controller.deployer.dcp.workflow.\
CreateManagementVmWorkflowService: [/photon/workflow/create-mgmt-vm/1c2c8838-3550-4265-\
adce-5aeea83a701a] java.lang.RuntimeException: [Task "START_VM": step "START_VM" \
failed with error code "InternalError", message "Please contact the system \
administrator about request #124e1b27-844c-4503-8a13-3ce07343c37f"]

I looked on the ESXi host client, and in the ‘Tasks’ view, I found the following error:

Task: Power On VM
Key: haTask-51-vim.VirtualMachine.powerOn-3127178
Description: Power On this virtual machine
Virtual machine: b470a706-9a9c-4170-b9e6-122c2b3a0e70 
State: Failed  - Failed to extend swap file from 0 KB to 25161728 KB.
Errors:
•Failed to extend swap file from 0 KB to 25161728 KB.
•Current swap file size is 0 KB.
•Could not power on virtual machine: No space left on device.
•Failed to power on VM.
•Failed to start the virtual machine.

 On closer inspection of the datastore, I found that had only 14GB of free space. this meant that i could not create the swap file for the Photon Controller (25GB). Once I cleaned up the datastore, the next deployment succeeded.

Lesson: Ensure there is enough free space on the datastore to deploy and power-on the Photon Controller.

Network considerations – frameworks (1)

Once the Photon Controller is deployed, the next step is to deploy a framework. In the provided examples that you will find on the Photon Controller wiki on github, we provide both the sample images and the steps to help you deploy Kubernetes, Mesos and Docker Swarm, common container orchestration, clustering and scaling solutions. The first thing to be aware of when deploying one of these frameworks is that you will require a combination of static IP addresses as well as DHCP. If you do not have DHCP, the deployment of parts of the framework will error with “VM failed to acquire an IP address” as follows:

2016/05/13 10:45:37 photon: Task 'e44b1cbb-7233-40e7-9c9d-df3f01b7fa3f' is in error \
state: {@step=={"sequence"=>"1","state"=>"ERROR","errors"=>[photon: { HTTP status: '0',\
 code: 'InternalError', message: 'Failed to rollout KubernetesEtcd. \
Error: MultiException[java.lang.IllegalStateException: VmProvisionTaskService failed \
with error VM failed to acquire an IP address. /photon/clustermanager/vm-provision-tasks\
/411ea466-c817-45d8-ae06-a87771298389]', data: 'map[]' }],"warnings"=>[],\
"operation"=>"CREATE_KUBERNETES_CLUSTER_SETUP_ETCD","startedTime"=>"1463132002000",\
"queuedTime"=>"1463132002000","endTime"=>"1463132737000","options"=>map[]}}

Note that you will need to login to the Photon Controller and not the installer to find these log messages. It is the same procedure as before, whereby you can use the “docker logs -f ” to examine the logs in real-time, but for the deployment of the frameworks, you need to examine the logs on the controller, and not the installer. Once the Photon Controller has been deployed, the installer plays no further role.

Lesson: Ensure that the network can have both static IP addresses, as well as addressed allocated via DHCP.

Network considerations – frameworks (2)

The ESXi host(s) that are consumed by Photon Controller can have multiple VM networks. When it comes to deploying a framework, you need to have a way of selecting the correct network for the framework containers. I described the procedure to do this in a previous post here. the important part is to use the ID of the network. If you use the name of the network you will get a “network <name> not found” error from the cluster/framework create command, as shown here:

API Errors: [photon: { HTTP status: '0', code: 'InternalError', message: 'Failed to \
rollout KubernetesEtcd. Error: MultiException[java.lang.IllegalStateException: \
VmProvisionTaskService failed with error [Task "CREATE_VM": step "RESERVE_RESOURCE" \
failed with error code "NetworkNotFound", message "Network con-nw not found"]. \
/photon/clustermanager/vm-provision-tasks/9bd0619e-5ecf-415f-9136-0ac01d192be5]', \
data: 'map[]' }]

Lesson: When there are multiple VM networks to choose from, ensure you create and correctly select the one you wish to use when deploying a framework.

Network considerations – frameworks (3)

mesosWhen trying to deploy the Mesos framework, I could see that the etcd and master containers were successfully created, but the marathon container (for orchestration) was never deployed. On examining the deployment logs on the controller, I could see the following errors appearing repeatedly in the controller logs:

! java.net.ConnectException: Connection refused
! at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_51-BLFS]
! at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) \
~[na:1.8.0_51-BLFS]

Eventually the deployment of the Mesos framework failed with a timeout:

API Errors: [photon: { HTTP status: '0', code: 'InternalError', message: \
'Failed to rollout MesosMaster. Error: MultiException[java.lang.IllegalStateException: \
ClusterWaitTaskService failed with error Wait period expired. /photon/clustermanager/\
wait-for-cluster-tasks/463f3f62-1af1-4e06-a62a-c2e1173d798e, \
java.lang.IllegalStateException: ClusterWaitTaskService failed with error Wait \
period expired. /photon/clustermanager/wait-for-cluster-tasks/c39bcf03-9df5-4987-a06a-\
81438b019d7d, java.lang.IllegalStateException: ClusterWaitTaskService failed with \
error Wait period expired. /photon/clustermanager/wait-for-cluster-tasks/37c59e19-f690\
-4981-ae4d-8493dc625e50]', data: 'map[]' }]

Thanks to some guidance from my colleague Michael West, the root cause of this issue was discovered by looking through the logs on the master. We SSH’ed to the master, (credentials in this case are root/vmware) and ran the command journalctl. This displayed the startup logs, and this is what those logs displayed:

May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.055197176Z" level=info msg="POST /v1.20/containers/create?\
name=photon-mesos-master"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064556198Z" level=error msg="Handler for POST /containers/\
create returned error: No such image: mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404\
 (tag: 0.26.0-0.2.145.ubuntu1404)"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064593792Z" level=error msg="HTTP Error" \
err="No such image: mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404 \
(tag: 0.26.0-0.2.145.ubuntu1404)" statusCode=404
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Unable to find image 'mesosphere/mesos-master:0.26.0-0.2.145.ubuntu1404' locally
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 docker[256]: \
time="2016-05-13T12:19:58.064933334Z" level=info msg="POST /v1.20/images/create\
?fromImage=mesosphere%2Fmesos-master&tag=0.26.0-0.2.145.ubuntu1404"
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Pulling repository docker.io/mesosphere/mesos-master
May 13 12:19:58 master-ec124282-0bdf-4cbc-aeb3-9551921cf8f3 cloud-init[297]: \
Error while pulling image: Get https://index.docker.io/v1/repositories/mesosphere/\
mesos-master/images: dial tcp 52.73.94.64:443: network is unreachable

So it would seem that the Mesos master image that we built needs to retrieve some images externally from the docker.io repository. I was not aware of this requirement, and placed the framework on an internal network with no route externally. Thus this was the reason the Mesos framework was timing out.

tomcatIn fact, there is a similar issue with the Kubernetes “tomcat” example. The tomcat server also needs to pull some external images. If you create the pods as per our example, the tomcat server will report an error while pulling images as follows:

C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 create -f \
Downloads\photon-Controller-Tomcat-rc.yml
replicationcontroller "tomcat-server" created
C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 create -f \
Downloads\photon-Controller-Tomcat-service.yml
service "tomcat" created
C:\Users\chogan>kubectl.exe -s 172.30.0.100:8080 get pods
NAME                                                     READY     STATUS                                                                                                                                             RESTARTS   AGE
k8s-master-master-342412bb-a963-4855-bc94-77dc7390fa53   3/3       Running                                                                                                                                            0          5m
tomcat-server-za0e4                                      0/1       Error while \
pulling image: Get https://index.docker.io/v1/repositories/library/tomcat/images: \
dial tcp 52.72.231.247:443: network is unreachable   0          16s

Lesson: To deploy frameworks such as Mesos using our sample images, or to use our tomcat pod example on the Kubernetes framework, ensure that the container network has a route to the outside world to pull down required images.

Hope this is useful to those of you looking to get started with Photon Controller.

11 Replies to “Photon Controller – Deployment and Troubleshooting Tips”

  1. im having trouble just adding additional management host and cloud hosts.

    com.vmware.photon.controller.api.common.filters.LoggingFilter: Response: /deployments//hosts [404] in

    when i try to add an additional cloud host. Would you have any idea why ?

    1. I’m afraid not. I get the same thing when I try to add a host after the initial deployment. Not sure if it is something related to having CLOUD and MGMT on the same host initially. Let me ask some experts about it.

    2. I just added some feedback on the photon-controller slack channel. It seems that interactive mode has issues, but if you put everything on a single command line (other than metadata), it progresses further.

  2. I’ve been trying to deploy Photon Controller on a fresh ESXi host. I’ve followed William Lam’s guide, and successfully reached the point to set up a Kubernetes cluster.

    However, I’m getting the “VM failed to acquire an IP address” error you have described at the 2/4 stage of creating the cluster. I have checked, and there is a DHCP service available.

    I’ve also tried deploying Mesos and Swarm clusters, but get the same error.

    I’ve also tried stating all IP addresses statically when running the “photon cluster create” command. In all circumstances, same error.

    Could you please let me know how I can troubleshoot this further?

    Many thanks

    1. I’m guessing you would need to use some sort of network monitoring tool (tcpdump) to verify that the DHCP requests are being generated, and whether or not your DHCP server is responding.

      1. Many thanks Cormac – would you recommend using the tool within the Photon Controller VM?

        1. … I created a new Linux Ubuntu VM alongside the Photon controller VM (same portgroup / subnet 192.168.0.0/24) on the ESXi host. The Linux VM booted and retrieved its IP address from the DHCP server, so from that I could prove that the DHCP server is responding.

  3. Hi Cormac,

    I ensured that there is only one VM portgroup on the host, and then used the “cluster create” command in the article, with -w option set using the photon network ID. I put everything into the 192.168.0.0/24 subnet (including master-ip and etcd1), and then no value when prompted for etcd2.

    This time it got to 3/4, but failed with a timeout, rather than failing with ““VM failed to acquire an IP address” .

    Could you advise how I should troubleshoot this? You’ve probably already written this, and I’ve probably missed it… But which log files should I be looking at, and how to access please?

    Apologies for the tech-support request!

    Many thanks

    1. So I had a similar issue in my lab, and it proved to be something on a particular VLAN. I switched to another VLAN, and it worked fine, every time. But I do not know why the problem was occurring, since DNS, DHCP and NTP were all working for VMs on the original VLAN.

      This is most likely to do with discovery, which is the responsibility of the etcd node. Log onto the etcd node, and review logs such as “journalctl” output to see if you can spot anything. You may need to open a bash shell to the container and try running “etcdctl” commands to see if you can figure out why there is a problem with discovery. This is all new to me as well, so I’m learning as I go along.

      Like I said, my nodes were getting IP addresses, but the Swarm cluster simply wouldn’t form on that network. Never got to the root cause, so would dearly like to hear if you figure this out.

Comments are closed.