Thursday, 26. November 2009 12:43
1 Problem
According to Citrix best practices, you should reboot all your production servers once every week – some even recommend rebooting your servers every night. This is not major issue if you have downtime during weekend, however terminal environments in general likes to draw attention of remote and branch offices – and these offices can be in different time zone or working at different time.
In enterprise environments, it’s therefore very common that whole environment will transfer to 24\7. As amount of applications is growing, it becomes almost impossible to arrange downtime for servers that are hosting multiple applications.
Once you split your servers into multiple silos (groups of servers), problem becomes even bigger – you cannot handle your farm servers as one group anymore and you should carefully plan how you divide your servers into groups. If you have for example 10 servers in your farm and divide them to silos A (2) and B (8), you cannot simply reboot 5 servers and then second half, because your group A might become completely unavailable.
There are two common approaches to reboot your servers – either each server contains scheduled task to reboot itself or you control them centrally (centralized or decentralized structure). Decentralized structure is easier to implement, however it’s very complicated in case you want to disable one reboot and you don’t have feedback about how many servers were rebooted successfully.
As mentioned above, it’s not only important to reboot your servers, however also to check if server was rebooted successfully. Installation of one corrupted application or hotfix can cause your whole farm to become unavailable after scheduled reboot.
Import part of maintenance design is that on terminals, best moment to install new applications (or update existing) is immediately after a server reboot, before users log on.
First, we will talk about design decisions and later on we will show how they are implemented at S4Matic job.
2.1 Server groups
Maintenance of XenApp servers is daunting task and it’s very important to think about it in advance. First answer we need to discuss is how we should split our servers during maintenance, because (without downtime window) you don’t want to reboot all your servers at once. It is also not recommended to reboot all servers at once in case you have downtime window – huge amounts of IMA connections can corrupt your datastore and in case you applied wrong patch or your application was not updated correctly, you can destroy all your servers at once.
It is not recommended to reboot all servers at once.
One of prerequisites was that we will need to have only single version of application available all the time. That means that we should use 2 groups of servers for maintenance – let’s call them A and B. Both of them consist of 5 servers (so we have 10 in total).
1.) A and B are available to all users (with application version 1)
2.) A is going to maintenance. New application version is installed. B is still available for users (still with version 1).
3.) A is going to production and B is going to maintenance. From this moment, we have only version 2 available.
4.) B is finished with maintenance – A and B have version 2 deployed.
Common question is why can’t we split servers to 3 or even more groups? If we will do so, we won’t be able to guarantee that single version of application is available. In above scenario, we would end up with group B (version 1) in maintenance, group A (version 2) in production and group C (version 1) in production.
Dividing servers to more than two groups is possible only in case you don’t deploy new version of applications, but simply want to reboot servers to guarantee stability – even in that case it’s not recommended. This principle is called 50:50 (fifty-fifty) – split your servers to two equal groups.
What is very important to mention is that these groups should be based on your silos. Consider scenario where you have 2 silos in your environment – Silo1 and Silo2. First silo consist of 2 servers, second consist of 8 servers (10 in total). You will need to have A1 and B1 (both with 1 server) and A2 and B2 (both with 4 servers). If you would implement 2 groups for your whole farm, in result whole Silo1 would become unavailable in case both servers will be in same group.
Dividing servers to groups for maintenance purpose should be based on silo. You should have 2 groups for your whole farm.
S4Matic way: servers are split to 2 groups, however optionally (emergency maintenance) you can decide to use more groups. Servers are automatically divided to silos using advanced algorithm – there is no need to manually configure which servers belongs to which silo. This detection is based on published application – S4Matic will split servers, so every published application is available all the time (in case there are at least 2 servers assigned to it).
2.2 Downtimes
Implementing maintenance with downtime window isn’t that complicated task and can be achieved pretty easily. If you are not allowed to use downtimes, situation becomes more complicated.
Easiest solution is to disable access to server few hours before your maintenance starts – that way (drain mode), your current users will stay logged on, however no new users are allowed to log on to servers that are in maintenance.
In typical scenario, you will disable logons 8 hours before your reboot starts to give users enough time to finish their work and you send them few messages right before reboot.
Hide maintenance window from users by implementing stage with disabled logons.
There are few different ways how to disable access to servers – most popular is to simply turn off terminal services logons (Change Logon /Disable) or in even cases even remove server from all published application.
There is however also third way – just by creating load evaluator that will always report full load. Easiest way is to create scheduling rule and remove all available times from it. We usually refer to this load evaluator as No New Logons. There are few advantages – this is farm operation, not per-server operation, which means that your server doesn’t even need to be online to apply this change. Second advantage is that you can still log on using RDP in case you need to troubleshoot something. Third advantage is that you can easily see servers that are in maintenance just by looking at assignment of your load evaluators (if server has No New Logons, it’s currently being processed by maintenance).
Best way how to disable access to server is by applying special load evaluator.
With disabled logons phase, your maintenance shouldn’t affect any of logged on users.
You should apply same principle when you reboot all servers in your farm, but also in case you simply want to reboot single server. If malfunctioning server doesn’t affect your business, you shouldn’t force your users to log off.
S4Matic way: disadvantage of above scenario is that server is not used for those 8 hours even in case there are no users logged on or in case users logged off 1 hour after maintenance started. S4Matic is dynamically monitoring these servers to determine if there is still need for waiting phase or there are no sessions and server can be rebooted. Therefore S4Matic can reboot some servers after 8 hours (users logged on), or few minutes after maintenance started (no sessions), or after 5 hours when last user logged off.
S4Matic supports not only full maintenance (all servers), but also per silo maintenance and even maintenance on single server (or group of servers).
2.3 Reboot check
Very common mistake during maintenance implementation is “shoot and run” principle. Administrators just call external utility (shutdown.exe or psshutdown.exe) and in better case considers successfull reboot based on returned error level.
Many customers are very surprised when they try to check success rate of their reboots – even thought utilities returns success values, server reboot got stuck in the middle of the process and server is (better case) not rebooted at all or (worse case) got stuck during reboot and is not usable anymore. Another common approach is ping of server – but of course, in case server was not rebooted, it is responding.
It is highly recommended to apply better ways how to detect if server was rebooted or not. One of approaches for example is to stop IMA service and consider server rebooted if it is running (however that requires some additonal logic during startup script).
Do not rely on error levels or ping status – implement better way to detect if server was rebooted or not.
S4Matic way: S4Matic implements few different checks to detect if server was rebooted or not. Most important check is however based on server uptime – using this value, S4Matic can detect if server was really rebooted after reboot command was issues. Because S4Matic is framework, you can easily extend it to check services or even your infrastructure (SCCM\Altiris DB etc).
2.4 Switch
As we mentioned in previous chapter about server groups, splitting your servers to 2 groups is highly recommended. If you follow this best practise, you will realize that there is one very important moment for every silo called “switch” – during switch, first group of your servers is released to production and second group is “removed” (logons are disabled).
All changes you have done to first group of servers are available in production at one moment of time.
In automated environments, switch is most important moment of your farm during week.
There is also one side effect of this switch event – if your maintenance failed (for example you installed new patch that broke your installation), after switch your whole silo is affected by this problem. It is best moment to check health of your servers before switch event.
It is very important to detect conditions of your servers before switch.
You should define some rules for switch – it is not recommended to wait for 100% of the servers from first group, simply because one server can always fail. Consider scenario where you got silo (don’t forget, groups should be based on silo) with 100 servers. You create two groups – A (50) and B (50). If one of servers from group A fails, it doesn’t truly effect your environment.
Our recommendation is to wait for 85% of servers – in our scenario, that means that at least 43 servers must finish successfully before switch can occur.
Expecting 100% of servers to finish successfully is not recommended – you should count that some servers could fail during reboots.
In typical implementation, you will give some time (30 minutes for example) for servers to reboot. After this time, you will check how many servers finished successfully and based on that information you will either approve or reject switch. If you don’t want to have downtime (as we discussed), amount of time that maintenance can take will grow dramatically (8 hours waiting for users + 1 hours sending messages + 30 minutes reboot = 9 hours and 30 minutes).
Switch should occur when enough of servers from first group are finished successfully.
Most people thinks that there are 2 switch events for every silo – first occurs when first group of servers finishes maintenance and second occurs when second group is finished. This is not exactly correct (or shouldn’t be). Reason why we use switch is that we want to guarantee that only 1 version of application is available at the time. Once first group is released to production, we already have version 2 there, therefore any server from group 2 can be released immediately (because it has same configuration as rest of servers from group 1). Consider scenario where we have silo with 100 servers and we split them to two groups.
1.) Once maintenance started, we have 50 + x (because existing sessions are still running on servers from group A)
2.) After some time, we will have 50 servers available (all users logged off from servers in group A or were logged off after timeout)
3.) After switch, we have 50 servers from group B (!)
4.) 51, 52, 53…100 – whenever any server from group B is finished, it can go to production immediately
In case you have 2 groups, there is only 1 switch moment. Servers from group 2 can be released to production immediately. After switch, amount of your servers will grow constantly (to 100% ideally).
Based on above explanation, most time consuming operation during maintenance is waiting for switch – group A – servers that are waiting to enter maintenance and also group B – servers that finished with maintenance, however are waiting for rest of servers from same group.
Why is group B also waiting? As we said, you should wait for your users to log off without spamming them with messages. Some servers can be rebooted within minutes (no sessions), however they still need to wait for rest of servers (until we will get required percentage of successfully rebooted servers).
Waiting for switch is most time consuming operation during maintenance. Servers from second group must wait for first group to finish and servers from first group that already finished must wait for rest of their servers.
S4Matic way: calculating and validating switch conditions is probably most complex algorithm supported by S4Matic. Because groups are silo based, S4Matic implements same principle as with sessions and doesn’t wait for hardcoded time to do switch – as soon as there are enough servers, switch will occur. That means that if you have silos that are not used during maintenance window, complete silo can finish maintenance within 1 hour. If you have silos that are used during maintenance window, S4Matic will automatically adopt to it and switch can occur after hours. All these settings are automatically configurable.
Additional functionality of S4Matic is that it is evaluating all servers constantly – if it will see that too many servers from group A failed, it will automatically stop maintenance for that silo. There is no need to wait for timeout – S4Matic will report that based on calculations, there is 0% chance that silo will finish successfully.
S4Matic is automation oriented framework. It is design to automatically process very complex tasks and automate them. One of (many) supported jobs is called XenApp maintenance. This job is based on our best practices as consultancy company.
Few highlights of S4Matic itself first:
1.) No clients needed on end points
2.) Easy to implement
3.) Full logging into event log
4.) Able to handle unexpected situations (stuck IMA, hung sessions, failed reboot…)
5.) Designed for extensions
6.) Integration with 3rd party tools supported (Altiris, SCCM…)
7.) Based on open technologies – powershell, XML…
8.) Tons of other features…
Our goal is to design maintenance solution that would support following:
1.) All XenApp servers should be rebooted based on centralized configuration (controlling servers)
2.) No downtime should be needed
3.) End users should not be affected (forced to log off)
4.) Silos should be automatically detected and maintenance schedule should be changed according to their structure
5.) All servers should be checked if they rebooted successfully
6.) All unexpected situations should be handled correctly (IMA crashes, hung sessions, reboot failures)
7.) Maintenance should guarantee that only one version of application is available at same time to all users
8.) After server reboot, applications should be deployed before server is released to production (optional)
Below you can find high-level overview of maintenance process:
High level overview of maintenance process
As you can see, whole process is divided to few different phases (and each phase consists of few different steps, internally called “containers”):
Initialization phase: during this phase, servers are being assigned to silos (automatically based on their published applications) and each silo is divided to groups. Finally, before starting maintenance, we will check if servers are online. During initialization phase, we will create server groups.
Switch: all containers involved in this phase are used to calculate whether silo can be switched or not. In S4Matic: Hold Servers we store servers that are waiting for maintenance to start and in S4Matic: Calculate switch we store servers that finished successfully, however are waiting for rest of servers from their group. To read more about switch, refer to chapter Switch.
Preparation phase: preparation phase is used to not affect business users that are logged on to servers. To read more about this phase, refer to chapter Downtime.
Reboot phase: during reboot phase, we not only reboot servers, but we also check their health state after reboot. To read more about this phase, refer to chapter Reboot check.
Release to production: during this phase, we will release servers back to production. Original load evaluator is assigned once maintenance is finished.
If you are interested in additional details, just send me email to m.zugec(at)loginconsultants.com. Our biggest customer got 350 servers maintained by S4Matic and at this moment we are building farm with 700 servers in total (however S4Matic can be easily used in small environments also).
Martin