The Many Senses of Recovery in SMSyncServer

4/26/16
(4/27/16: Update– Bitbucket repo now has these changes).

SMSyncServer is an open-source framework with two primary goals: 1) providing mobile device app users with permanent access to their app data, and 2) synchronizing data across mobile devices. For more background, see this initial blog article. You can find the code for the framework on Bitbucket.

I recently bit in and am restructuring much of the internals of SMSyncServer. What motivated this restructuring? Largely it was mechanisms of recovery, but in addition some of the algorithms I was using internally were just getting out of hand—the code was hard to read and understand. This restructuring is about ¾ done—with perhaps 1 week left to burn to get it done. When I say recovery, I mean this in a broad sense of handling various error and unwanted conditions. Conditions that are a fact of life, but that as a developer you’d rather not have. I want SMSyncServer to recover from the following issues:

1) Network loss/regain
2) App transitioning between background and foreground
I take these two to be the most common recovery cases, and similar in their effects. The operating environment of a mobile device is one of having an unreliable network connection. As opposed to a computer with a fixed wired network connection, a cell phone can lose its connection with a cell tower at any time. At any time, a tablet device can move out of range of a wifi signal. This is normal operation for a mobile device, and I want to be able to handle these situations gracefully. Because this is such an intrinsic characteristic of the mobile environment, recovering from such situations should be a main character of the design of the SMSyncServer.

Similarly, the app transitioning to the background, and coming back to the foreground is a typical user operation. They stop using your app, the app making use of SMSyncServer, and begin using another app. Your app goes into the background, and stops having access to CPU and/or network resources. While a little different than losing network connectivity because it also involves loss of CPU access, this is a common occurrence and needs to be front and central in the design of the internal SMSyncServer architecture.

3) User app termination
This is also fairly typical of normal operation in a mobile environment. It is similar to transitioning to the background, but with a complete loss of non-persistent state. If the app developer hasn’t saved variables and other RAM values to files or databases, those values will be lost. SMSyncServer needs to be able to restart very nearly where it left off when the app launches again.

 4) App crashes
Hopefully this is a somewhat unusual case, but we all know software has bugs. Due to what ever reasons (e.g., programmer error, software rot), programs will crash. We should do our best to recover from these crashes.

5) Client API Errors
A developer making use of the SMSyncServer will, of course, make errors in development. This is just part and parcel of software development. While SMSyncServer cannot necessarily deal with these situations, it can report them to the calling app, and provide mechanisms for reset and recovery.

6) Server failures, both short-term and long-term
7) Internal Errors
8) Non-recoverable errors
Now we’re really getting into murky territory. These situations fall under the rubric of programmer error within the SMSyncServer or other SMSyncServer problems—e.g., the hosting system on which the sync server is running fails. Or communication between the sync server and the cloud storage system fails. There is not all that much that can be systematically done about these situations. But, the SMSyncServer client should persistently mark the app as being in a failure mode, and before resuming normal operation, require that the calling app take steps to reset from this failure mode.

In general, and where possible, before assuming the worst and throwing an error, the SMSyncServer needs to retry an operation. This makes sense for server API operations used by the client interface. The case where this raised its head recently is in the case of OAuth authentication with Google Drive. Credentials can expire with OAuth, and a retry of an operation will sometimes be sufficient to fix the problem.

Synchronization Algorithm
There are two main ways that SMSyncServer deals with the most common recovery cases—e.g., network loss from a mobile device. The first is a synchronization technique that I’ve been using from the start of the client/server design, and will revisit it here. The general assumption is that the mobile device is far more likely to lose its network connection than the server. The server is to assumed to be far more stable, and less likely to lose its network connection. Of course, the server can fail. But, primary use cases should be dealt with first.

In light of this assumption of difference in network stability from client to server, the SMSyncServer does uploads and downloads, i.e., synchronization, in two phases. Consider uploading first. A group of files are first uploaded to the SMSyncServer server. While the upload could be done directly from the mobile device to the cloud service, because the mobile device client is prone to failure, this could affect the integrity of files stored in the cloud storage. Once the upload from client to SMSyncServer server is done, the server takes over in an asynchronous operation, disconnected from the client, and does an outbound transfer—sends the files to the cloud storage system (e.g., Google Drive). This outbound transfer is considered to be relatively likely to succeed—because it is done independently of the mobile device. If a mobile device upload to the sync server fails, this doesn’t affect the files on the cloud storage system. The upload is just later retried.

Downloading occurs in a symmetric but reversed fashion. A group of files is first inbound transferred from the cloud storage system to the sync server. This, as above, occurs disconnected from the client app and is relatively likely to succeed. Thus, we are relatively likely to obtain a consistent state of a collection of files on the SMSyncServer server. Once the inbound transfer is completed, individual files are downloaded to the client device. Once all files are downloaded to the client device, the app using SMSyncServer is informed (through a delegate callback) that the download has completed.

SMSyncServer provides transactional or locking support for synchronization of a group of files. For upload, this amounts to obtaining an exclusive server lock for the device/user for the entire duration the upload and outbound transfer. Because no other device can alter cloud storage for that user or alter the file index stored in the SMSyncServer server, integrity of the collection of files is assured. For download, this amounts to holding the exclusive server lock for the duration of inbound transfer from cloud storage to the SMSyncServer server. Because downloading doesn’t require a change to the SMSyncServer server file index, the lock can be released prior to the actual file downloads to the client.

Persisting App State
The second main way SMSyncServer deals with the primary recovery case is by persisting app state. This is the guts of the change I’ve been making recently and the main reason I’m writing this blog article.

Prior to the current recovery changes, I was making only limited use of Core Data to persist state information for SMSyncServer. This resulted in some hard to understand client-side algorithms for uploading and downloading. I relied too much on some persisted and too-complicated mode variables, which attempted to encode the current operating state so that on regaining network access or after a relaunch of the app, could be used to restart synchronization. While all 80 or so synchronization XCTest cases were passed, the code was fragile and hard to understand. It also, all too often, resulted in infinite loops internally. Not good.

After getting fed up with this fragile and difficult to understand code, I decided to make a large change to the internal client-side architecture of the SMSyncServer. While the client API would remain largely unchanged, as would the server design, I decided to make more use of Core Data in the iOS client to persist app state, and to do away with the complicated persisted mode variable. From a previous total of two NSManagedObject subclasses, I now have 16 of these Core Data classes. At the heart of this revised internal client-side architecture is a set of queue’s, persisted in Core Data:

Queues

The beingDownloaded queue persists meta data describing downloads that have yet to be done, but that are actively being processed. When there are downloads to be carried out from the server, the beingDownloaded queue is populated with meta data from the server describing these downloads, and this is persisted to Core Data. After each successful file download is carried out, this queue is updated and persisted. This way, even if the network connection is lost, or the app loses CPU access, the next time the app executes this persisted information about downloads can be accessed, and downloading can be resumed.

Similarly, the beingUploaded queue persists meta data describing uploads that have yet to be done, but that are actively being processed. In this case, uploads are initiated from the client API. For example, when the app user makes a change to a file, this can be queued up using the SMSyncServer client method uploadImmutableFile. As with the idea of persisting download meta data, persisting upload meta data enables recovery from loss of network connectivity, and loss of CPU access. Downloads are considered higher priority than uploads (because the server is taken to have the currently most up-to-date information) and so only one of the beingDownloaded queue or the beingUploaded queue is actively processed at any one time, and uploads are only handled when all currently active downloads have been processed.

The remaining two collections of queues in SMQueues are also for uploads. The client operation of an app is modeled as queuing a series of upload and upload-deletion operations followed by a commit operation. Prior to committing, using the SMSyncServer client commit method, the upload operations are persisted in the uploadsBeingPrepared queue. After a commit, that particular queue is itself queued in the committedUploads queue. To process a queue of uploads, one of the queues from the committedUploads queue is moved to the beingUploaded queue.

The overall algorithm for uploading follows, given in Swift code. Each method (e.g., self.doUploadDeletions), when successful, removes one or more operations from a Core Data queue or changes the state of operation(s) in the Core Data queue so that the algorithm has a persistent quality. This code is from the file SMUploadFiles.swift.

Code1

Server-based Recovery
Part of the operation of recovery is based on the server API calls operating in the Node.js code. Previously, I had separate server API entry points dedicated to recovery operations. E.g., a server API call to recover from a failed outbound transfer. Now, in order to provide improved recovery operation, I’m using a different strategy, and one that better decouples the client from the server.

My goal with each server API call is to enable each call to be made multiple times, sequentially, without failure. In normal operation, the first time the call is made, the call succeeds. If the same call, with the same arguments are made a second time, then the call does nothing, but returns success. For example, the server API call lock normally works to establish a server lock for the calling user/device. But suppose that while the lock operation succeeds, the return communication fails, say because of a temporary network loss. The user/device thus owns a server lock, but observes a failed server API call. If the user/device was to do a retry, what I want is for the lock server API call to notice that that user/device already has the lock and to return as if this was the first call to the lock operation. I.e., it will not fail. This greatly assists in recovery because it assures that a retry is more likely to succeed.

I’m generally carrying out this strategy with other server API operations as well, so that if a retry is carried out, the retry does any necessary recovery handling. The more specific server API entry points that previously did specific recovery operations have been removed.

Mode State
Previously, I had a complicated persisted mode variable which encoded information about how to recover, given a relaunch or network reconnection. While this complication has been removed, a simpler persisted mode state has been retained. Now, the primary use of the mode is for indicating error states that the calling app needs to be informed of. Here are the various current modes:

Code2

Conclusion
So far, the code for synchronization seems easier to understand and debug. And more reliable in terms of the various goals for recovery. I am also having more confidence, because it is easier to understand, that the internal architecture will be more suited to porting to other mobile architectures (e.g., Android). Time will tell!

About the author: Christopher G. Prince has his B.Sc. in computer science (University of Victoria, B.C., Canada), an M.A. in animal psychology (University of Hawaii, Manoa), an M.S. in computer science (University of Louisiana, Lafayette, USA), and a Ph.D. in computer science (University of Louisiana, Lafayette, USA). His M.S. and Ph.D., while officially in computer science, were unofficially in cognitive science, split between animal psychology and computer science. Chris is a dedicated animal person, and has also developed: Catsy Caty Toy, a customizable and shareable iPhone and iPad app for your cats (http://GetCatsy.com)Petunia, an app for recording and sharing pet health information (http://GetPetunia.com), and WhatDidILike, an iPhone app to keep track of restaurants and food that you like (http://WhatDidILike.com).

Creative Commons License
“The Many Senses of Recovery in SMSyncServer” by Christopher G. Prince is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at chris@SpasticMuffin.biz.