automation: shore up rebooting behavior
There was a race condition in the old code. Use
instance.stop()/instance.start() to eliminate it.
As part of debugging this, I also found another race condition
related to PowerShell permissions after the reboot. Unfortunately,
I'm not sure the best way to work around it. I've added a comment
for now.
Differential Revision: https://phab.mercurial-scm.org/D6288
--- a/contrib/automation/hgautomation/aws.py Fri Apr 19 06:07:00 2019 -0700
+++ b/contrib/automation/hgautomation/aws.py Fri Apr 19 07:34:55 2019 -0700
@@ -808,10 +808,26 @@
)
# Reboot so all updates are fully applied.
+ #
+ # We don't use instance.reboot() here because it is asynchronous and
+ # we don't know when exactly the instance has rebooted. It could take
+ # a while to stop and we may start trying to interact with the instance
+ # before it has rebooted.
print('rebooting instance %s' % instance.id)
- ec2client.reboot_instances(InstanceIds=[instance.id])
+ instance.stop()
+ ec2client.get_waiter('instance_stopped').wait(
+ InstanceIds=[instance.id],
+ WaiterConfig={
+ 'Delay': 5,
+ })
- time.sleep(15)
+ instance.start()
+ wait_for_ip_addresses([instance])
+
+ # There is a race condition here between the User Data PS script running
+ # and us connecting to WinRM. This can manifest as
+ # "AuthorizationManager check failed" failures during run_powershell().
+ # TODO figure out a workaround.
print('waiting for Windows Remote Management to come back...')
client = wait_for_winrm(instance.public_ip_address, 'Administrator',