To Lock or Not to Lock in the SMSyncServer

SMSyncServer allows multiple mobile device apps to access the same collection of cloud-based files. For example, two people using apps on different mobile devices can modify the same Google Drive data (using the same Google credentials, or through an invitation and their Facebook credentials). File uploads from the apps can modify the same shared file. Upload-deletions can delete the same shared file. Downloads by an app on one device need to capture a consistent snapshot of the current cloud-based data—not a state of the data midway through updates. Thus with this shared access to the same data there has to be some means of enabling cooperative access to this data.

Currently, SMSyncServer uses a system of mutually exclusive locks in order to enable this cooperative access. Prior to a sequence of file uploads and/or upload deletions, an exclusive lock is obtained by an app on a mobile device, and this lock is released by the mobile device app after the uploads have taken place. Prior to a sequence of downloads, a lock is obtained by the mobile device app—and released, by the app, after the downloads have completed. Such a locking mechanism works well until either:

  • An error occurs on the server or the mobile device, or
  • A mobile device becomes disconnected from the network.

In these cases, a lock can be held for an arbitrarily long period of time, which will exclude apps running on other devices from accessing the same user’s data. i.e., sharing will no longer be possible in this situation. One solution to this problem is to implement lock breaking—which would require a fairly extensive set of changes to SMSyncServer (see Appendix I, below). Another potential solution to this issue, also requiring an extensive set of changes, is to change to a non-locking based design. Modern database technology has provided alternatives to lock-based mechanisms for enabling cooperative access to shared data. Two interesting alternatives are those of eventual consistency [1] and versioning [2].

How could the concepts of eventual consistency and/or versioning be applied to SMSyncServer? It seems helpful to restate a main assumption of this system: We assume a relatively small number of apps/devices concurrently accessing the same user cloud-based data (e.g., Google Drive files). For example, with just me accessing the same data across an app running on multiple devices, there would typically be no concurrent access—because I typically don’t use more than once device at the same time. Given this assumption, the use of locks as we currently are using them seems too strong. Particularly because they are currently causing grief. For example, in development, I often have errors which leave locks on the server, and have to take manual steps to remove those locks.

What would it look like to rely more heavily on our assumption of relatively little concurrent access to the same user data? And how could this be leveraged to limit or eliminate the locking that is taking place on the server? Let’s look at this in terms of the primary operations we carry out on the server.

1) Downloads. The initial step in all operations on the client-side of SMSyncServer (i.e., on an app) is a download, or an attempted download. With file downloads in this system, no modifications are carried out to files on the server. A file download updates or creates a file on a device. A download deletion removes a file from a device. The emphasis is on the device, not the server or user cloud storage. Currently, a lock is held through the duration of the file download operation in order to provide a consistent snapshot of the user cloud data to the device. However, since we are assuming little actual concurrency of access to user cloud data, this consistent snapshot should normally be provided—even without locking.

To remove locking, we need to consider a possible edge condition: If device1 starts a download, and during device1’s download an upload by device2 occurs. This could leave the downloading device (device1) with an inconsistent snapshot of the data on the server because some of it could have been an earlier version of data than that resulting from the upload. A way to deal with this is to, after the download has been received by the device (but not yet provided through the SMSyncServer delegate callbacks to the app), again fetch the file index. If the file index contents have not changed in the interval between starting and completing the download, then we would be certain to have a consistent snapshot of the data. If the file index had changed, then we could do a delta between the two file indices and download the updated files. In the majority of cases, given our assumption of limited real concurrency, there would be no change between the first and subsequent file indexes.

2) File-uploads and upload-deletions. How can uploads take place without locks or with at most a limited use of locks? A main reason for the use of locks in SMSyncServer is so that we do not have multiple devices modifying or deleting the same file(s) in user cloud storage at the same time. E.g., suppose that both device1 and device2 tried to update file1 in user cloud storage at the same time. We could be left in a state where we don’t know which update was successful or if possibly both were successful to some degree. What if, however, we were to have these separate devices upload their update as separate file versions. Say, file1_vX and file1_vY. And if multiple files were being uploaded in a sequence, the same scheme could be used—each file could be uploaded with a version number. Of course, version numbers would have to be distinctive across devices.

After this use of file versioning to upload files, the file index on the server would need updating. We plan to keep the concept we have had so far— to maintain just a single final version of files. The last or latest writer to the file index on the server would be the winner, establishing the current `truth` of files. We would need a server-side means to ensure that the file index was updated in an atomic manner. This, however, may be easier to implement than a locking mechanism accessed by individual devices. That is, the atomic file index update could be imposed by the server itself, and never be seen by devices. We should be able to eliminate the possibility that we have lingering locks that could interfere with the system—for example if locks could only be held for a limited time duration and if locks were removed when the server restarts.

Upload-deletions would be done in a similar manner—they would be delayed. First, the file index would be updated in an atomic manner, and then later the actual deletions would be done. These deletions would be part of a new requirement for garbage collection—file versions that were uploaded but never used because they were losers in a file index update race would need to be deleted from user cloud storage. We would need to systematically keep references to these to-be-garbage-collected files in separate server index.

A series of uploads could fail in a different manner than currently possible. Suppose an upload-deletion was being carried out by device1, but that upload-deletion was carried out by device2 first. Since a lock would not be held, device1 could fail in this operation, which would force it to do a download (to do the consequent download-deletion). This is effectively just a variant of current client/server operation.

How do we deal with a file index update race? For example, suppose device1 and device2 both concurrently do a sequence of file uploads, and upload-deletions. We will have a server-based locking mechanism (not exposed to client mobile apps) that will ensure that the file index is updated atomically. Given that we have two devices concurrently doing uploads, there are two main situations that can occur with this file index update:

A) The sets of files are distinct – entirely different files are being uploaded and/or deleted by the devices. Suppose that device1 wins the race—device1 updates the file index, then device2 updates the file index. For device1, this sequence of events would be the same as it is with the SMSyncServer client now—with strong locks. It would have done a download first, and then successfully carried out an upload.

For device2, the situation is different. It would have done a download first, but now there is an intervening upload (by device1) that has taken place. Device2 does need to download those files uploaded by device1. The question is: When? Should device2 a) download those device1 uploads first, i.e., before applying its updates to the file index, or can device2 b) download those device1 uploads after applying its updates to the file index? This really seems like an application dependent question. For some apps, the user might be making decisions about data based on a set of files and a) would be necessary. For other apps, there more might be more independence across files and b) would be acceptable. For now it seems easier, if a little more conservative, to just adopt option a) for SMSyncServer.

B) The sets of files are not distinct – some files being uploaded/deleted by device1 are the same as those being uploaded/deleted by device2. Suppose again that device1 wins the file index update race. In this case, device2 really needs to know that it lost the race—because it needs to (i) do a download—to update its files and resolve conflicts, (ii) redo the previous upload (which might have been modified by the prior conflict resolution).

How can we determine the winner and loser of this file index update race? What we really seem to be meaning is: Between our download of files and our upload, was there an intervening other upload by another device? To accomplish this we can introduce another level of versioning: A version for the entire set of user files. Each time the file index is updated we can increment this global versioning number. When a client app does a download it can be given this global versioning number. When an upload is being carried out, if the global version number is the same as the one we know about, then we are the winner of the update race. If not, we lost to someone else and need do a download (get another global versioning number), and redo the upload. With the limited concurrency we expect, any given device would nearly always be the winner of the file index update race.

Summary
Currently, there is a client-app oriented locking mechanism in SMSyncServer. This generally works, but is problematic in failure situations. It seems better to have a more robust mechanism that reduces the amount of locking and improves access to shared data. This is particularly relevant because we assume that we have actually little real concurrency. Why have such strong locking mechanisms when we have limited concurrency?

After this change described here, we would still have locking. However, now the locks would be held for only a brief period of time—the time it takes to update the file index. Additionally, the locks would be server-based and far less prone to being retained. Since we rarely actually need the locks (i.e., we have very little real concurrent access to the same user data), there would be very little actual contention for this briefly held lock. We also should be able to design this lock so it can be automatically released in the case of some error in updating the file index. E.g., on rebooting the server, we could automatically release this lock.

Appendix I: Lock Breaking
Lock breaking can be needed for several reasons: 1) If a user app connects, takes out a lock, then loses network connectivity and doesn’t come back online for a long time (or never comes back online), this would leave a lock on the server—and no other device could access the data. 2) When an error occurs on an app (or the server), and it requires a server Cleanup, that cleanup will remove the lock and the operation status on the server. This leaves us in a situation very much like a lock having been broken. On the app, after the cleanup, we will have to restart the app operation.

Lock breaking requires somewhat of a redesign of the means that upload and download operations take place in the SMSyncServer framework. It must now be possible to restart a sequence of operations from the beginning from anywhere along the sequence of operations. Thus, we cannot delete the component operations of a sequence of operations until we know absolutely that all of the operations have been successfully completed.

One other issue with lock breaking and restarting sequences of operations from scratch is that when doing a sequence of file-uploads or upload-deletions, we can be left in a state where we have some uploads or upload-deletions applied to the cloud-based file system (e.g., Google Drive) and then have a lock broken. That is, there are still some file-uploads or upload-deletions that remain to be carried out for a given sequence of operations. This should only ever happen if there is a failure on the server-side after allowing the server to proceed asynchronously with doing the outbound transfer of the file-uploads or upload-deletions.

One way to deal with this latter issue is to apply version numbering to files in user cloud storage, and to not immediately delete files with upload-deletions on cloud storage. New versions of files could be uploaded to user cloud storage with files with new names. E.g., <UniqueName>_<VersionNumber>.<Extension>. After all new versions of files are uploaded, then the changes could be applied to the MongoDB file index. Later still, the old versions of the files could be purged from user cloud storage. File deletions could be handled in reverse manner. The changes could be applied to the MongoDB file index, and sometime later the deleted files could actually be purged from user cloud storage. Of course, both of these changes would require temporary storage in user cloud storage. If we purged files from cloud storage before beginning new sequences of file-uploads and upload-deletions, this could alleviate the additional storage requirements. Or we could initiate a purge after we know absolutely that a given sequence of file-uploads and upload-deletions was successful.

This use of versioning and delayed deletion from user cloud storage would also fit well with a direct-streaming technique of files from a device to user cloud storage.

[1] https://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf; https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/; https://blog.acolyer.org/2015/03/18/a-comprehensive-study-of-convergent-and-commutative-replicated-data-types/; https://hal.inria.fr/inria-00177693/document; http://doc.akka.io/docs/akka/current/scala/distributed-data.html

[2] https://en.wikipedia.org/wiki/Multiversion_concurrency_control

Acknowledgments: Thanks to the developers at Comcast VIPER. The inspiration for the design of this change in SMSyncServer came in part from discussions with them during a presentation I made on SMSyncServer.