The SyncServer: Permanent Access to Your App Data

12/29/15
(Update: 1/21/16; Corrected typos).

The Problem
This blog article is about a new software development project I’ve started. This project has been on my mind for about two years. My original motivation stemmed from the need to keep data synchronized across mobile devices. Various apps do this. For example, Apple’s Calendar app. You can access your calendar on your iPhone, on your iPad and on your Mac, all because the apps keep their data synchronized across various mobile devices and the web.

I could have approached this using some existing software technologies. However, two thoughts made me want to embark on a custom development effort: 1) I am a single-man development shop. I want to keep the budget for this project low in terms of on-going maintenance costs. More specifically, I don’t want to pay for user’s data storage. And 2) I want to enable users to have permanent access to their own data. The app I want to use this data sync software for is Petunia, my Pet Health Records app. I wanted users of Petunia to permanently own their own data. For example, if they decided to no longer use Petunia, I wanted them to still have their data. If I suddenly stopped being willing to support Petunia for some reason, I wanted them to still have their own data. In my opinion, all too often software hides or otherwise obscures the data you enter. And when you make a significant data entry investment, such as entering the health care data for your pets, you don’t want to lose access to that data.

Solution Concepts
My first thought at a solution to this user-owned cross-device data sync problem was to use existing 3rd party cloud data storage. E.g., Google Drive, or Dropbox. The app running on one device would just upload its data to, say, Google Drive, and the same app running on another device would download the data from Google Drive. Problem solved. Since the data would be stored in a user-owned service, they would permanently retain ownership of the data. And, if the app could store the data in an industry standard format (e.g., XML or JSON), the app user would stand a good chance of being able to retain access to that data even if they changed to using a different app or if the app stopped being maintained. Lovely! Hmmm. Should things be so easy.

As I thought more about this direct use of file-based cloud services and wrote up initial software designs, I realized there were some problems. One problem lies in sharing data. Users often want to be able to share their data with others. If the data resided in the Cloud in a service such as Google Drive, how does a user share that data with others? With services like Google Drive there are sometimes facilities for sharing a link to a file with others, but how do you generalize that? How do you enable some app-specific subset of the data files to be shared with some specific permissions with someone else? And enable that access to occur in a manner controlled by the app. Certainly the user wouldn’t want to share their Google username and password with someone else. E.g., a user of Petunia wouldn’t want to give a veterinarian their Google username/password in order to allow their vet access to their data on Google Drive.

Another problem with this direct use of file-based cloud services lies in ensuring data consistency or integrity. This problem technically is also known as data locking or providing atomic transactions. Suppose that the app has a collection of files that only make sense when taken as a unit. In the case of Petunia, my pet health records app, an example of this comes in a Pet Profile. A Pet Profile consists of a description of basic information about the pet (cat/dog, birthday, color, breed etc.) and a collection of photos of the pet. This info only makes sense when taken as a unit. The Pet Profile in specific refers to or links to the photos. So the question technically becomes: How can the Petunia app upload (or download) the Pet Profile and the pet’s photos as a locked or atomic group or unit? To my knowledge, services such as Google Drive don’t specifically provide support for such locking.

Locking is important in a mobile environment because it would be very easy for one part of the data such as the Pet Profile to be uploaded (or downloaded), but for the photos, say, to fail in uploading. And another user of the same data could then access only part of the Pet Profile, which was intended to be accessed only as a whole. This could happen if the network (cellular or wifi) connection is lost. Or if the app crashes. Or the user decides temporarily to use some other app, putting Petunia in the background, causing Petunia to lose network and CPU access. Apps are, as some say, ephemeral.

Thinking about these issues led me to a new solution concept: Add an intermediary server between the Cloud-based storage system (e.g., Google Drive), and the app. This intermediary server, which I call the SyncServer, would act to allow both locking of groups of files and it would enable sharing of data with others. The overall system architecture looks like this:

Figure1
Figure 1: SyncServer System Architecture

How would this allow locking and sharing?
In terms of sharing, through some controls in the app (e.g., Petunia), the user could indicate which people (through say, Facebook usernames or other identifiers) could have access to which data (e.g., one pet and her photos), and for what purposes (e.g., for viewing only, not modification). These access rights could be stored in the SyncServer, and the SyncServer could thus act as a sharing gateway to the users data. The SyncServer would also store more general permissions for access to the user’s cloud storage (e.g., for Google Drive, technically this can be done using Oauth2 tokens).

In terms of locking, this could be done using the following technique or algorithm. First consider uploads. And say we’re talking about Petunia’s pet profile and associated photos. The first step in the upload procedure is to upload the pet profile (say as one file) and its associated photos (another number of files). This upload occurs to the SyncServer, and not directly to the cloud-based storage system. This first step represents the fragile part of the process. At any time and for many reasons, the data transfer from the app to the SyncServer could fail. However, because we are uploading to the SyncServer and not to a service such as Google Drive, we (as developers) have more control. We can temporarily store these files on the SyncServer, and wait until the entire logical unit (i.e., the pet profile and all associated photos) are correctly uploaded to the SyncServer before transferring those files to the user-accessible cloud storage. In this way, if the transfer to the SyncServer fails, we can recover and restart, all without having altered the files that the app user has access to. This process looks like this:

Figure2
Figure 2. Outline of Uploading Algorithm

The reason that Step 2) is less likely to fail is that it is running on the server, disconnected from the user’s mobile device. It is going to run to completion generally without fear of network loss or the app going into the background because the user decides to use some other app. Yes, the SyncServer itself could fail for various reasons. But apps running on mobile devices are typically more likely to fail than servers in the right circumstances. An app “fails” in this sense when it loses the network connection, or when it loses CPU runtime (the user uses another app). For a server, in contrast when picking a hosting service or setting up a server, it is possible for the app developer or publisher to establish a high-level of server hardware and hardware reliability. It is fairly common these days for hosting services to have > 99% uptime.

I think downloading of data can happen in a similar manner, but that’s still a work in progress. So far I’ve been focusing on uploading.

Other Goals
Another goal that I have in mind for this project is to enable app users to have offline access to their data. For example, a Petunia user should be able to add a new pet, and associated photos (and other data such as vet records), when they are offline, without access to the network (which could easily happen with the app running on an iPad). Then, later, when they get access to a network, and are using Petunia, the app could upload their new information.

This is a goal for me in no small part because of my beliefs about mobile devices. These are very powerful devices, typically with lots of space for storage of information. Their apps should provide more functionality than web pages. Their apps should be able to work at least temporarily without access to a network.

First Implementation Phase
During Thanksgiving this year, I took the plunge to start development of this data-sync system. Several factors conspired to make this happen now. First, I wanted a new project to think about and work on. Too much of my programming lately had been maintenance of existing apps. Second, a colleague of mine asked for help with a network programming task, and that had gotten me thinking about this data-sync issue again. Third, I had been wanting to have a significant open-source project to contribute to the community and to provide a work-example for my consulting. So, this project is going to be open-source too! Fourth, I had been wanting to experiment with Node.js. I’ve heard a lot of about this technology and wanted to sink my teeth into it. So, the server side of this system, the SyncServer, is going to be written in Node.js/Javascript. And last, but not least, I had just been jonesing to dig into this. I like the concept. IMHO, it’s got some interesting innovations in it.

I’ve now been working on the SyncServer for about a month and have a pre-alpha system implemented. Overall, it weighs in at nearly 6,000 lines of code and comments so far. About 3,250 lines in Javascript for the SyncServer, and 2,500 lines of code for the client/app. I’m an iOS developer and so naturally, I’m writing an example mobile app (for testing the SyncServer) in Swift for iPhone. I’m also starting with Google Drive and at some point will move onto incorporating Dropbox, but want to get the system working with Google Drive first. Right now, for development purposes, the SyncServer itself runs on my Mac. Eventually, I’ll need to find a hosting service for it to run on on the web. The SyncServer uses MongoDb as its persistent storage system.

The example mobile app enables a user to sign into their Google account, and those sign in credentials are sent up to the SyncServer using HTTP/REST. All of the SyncServer services are requested over a REST interface. The SyncServer stores these credentials in a Mongo collection (think: SQL table) in terms of a collection of users known to the SyncServer. Every REST API call to the SyncServer requires these sign in credentials—to ensure that the server is accessed by an authorized user.

The primary goal that I’m focusing on right now is enabling a group of files to be uploaded in a locked and robust manner. Downloading of files, and sharing data with others (e.g., for Petunia, sharing data with veterinarians) will follow after I get the uploading working properly. Currently, I have the basics of enabling a group of files to be uploaded in a locked/atomic manner to the SyncServer, and then transferred to Google Drive. At some point in this development process, I’ll release a public open-source version.

Other Technical issues
1) Robust error handling: Perhaps the single biggest issue on my mind right now with this project is recovery from errors. And maintaining intellectual control over this error recovery process. That is, I want a fairly simple means of categorizing and dealing with errors. This goal led to Figure 2 as presented above. I’m still in the midst of working out an algorithm (hopefully one that’s fairly simple) for recovery from failure in each of the two steps in Figure 2. Since failure in step 1) is estimated to be more likely than failure in step 2), a main fault tolerance issue here is being able to deal with failures during an upload. E.g., failing after only 1 of 2 files in a group of files gets uploaded from the mobile device to the SyncServer.

2) How much of the network activity should be explicitly exposed to the user? Some programs show a user their ongoing network activity explicitly via the user interface of the program. For example, the Cornerstone program used by developers for SVN version control of program code, gives the following UI message when doing a specific long-running operation:

Figure3

In general, it seems like good practice to hide much of the network activity from the user. Why should the app user really care about the internal network activity of the app?

But, at times, it can be suitable to interrupt the user’s activity and require them to wait. In the case of my Petunia app, it may be suitable to give such an explicit user-interface message when initially converting over to using the SyncServer. In this case, a user might have a considerable amount of data (e.g., profiles on various pets, health care info on those pets, pictures of the pets, and images linked to the health care info). While recovering from failure would still be a good idea, it would seem suitable to try to get the user to let the app operate for a continuous time interval in the foreground while it uploaded their initial set of data to the SyncServer.

3) Javascript and syntax errors while executing: This is something of a long-standing personal annoyance. While I’ve been wanting to work with Node.js, I’ve not been looking forward to the fact that Javascript is an interpreted language. Perhaps it’s because of the way I first learned how to program, but I just don’t like finding syntax errors in my code when I run the code. I’m sorry. No I’m not. It’s just wrong! :).

4) Inheritance and Javascript: One of the things I’ve learned through this project is that Javascript by itself doesn’t support inheritance in an object-oriented sense. I didn’t know that before. I have to figure someway to do this. While I’ve not had a strong need for a design using inheritance yet, I will. For example, when it comes to incorporating multiple different cloud-based storage systems (e.g., Dropbox in addition to Google Drive), I will want to be able to do this in an object-oriented manner.

5) Lack of transactional support in MongoDb: This was surprising to me. I’ve done some PHP and mySQL programming work before this and I am accustomed to the facility that mySQL has with rollbacks. That is, if you have made a series of changes to a database with mySQL, and then later find that an error has occurred, you can rollback the entire series of changes. MongoDb from what I can tell so far does not have this support.

Assumptions:
1) The data of an app can be represented as a set of files, where each file is identified by a UUID known to the app and to the SyncServer. To use the SyncServer it will be necessary that the app translate its representations into a set of files. For example, if the app makes use of Core Data, then in some manner, those Core Data objects will need to be converted into files. In the case of Petunia, my plan is to convert each Core Data object into a single file. In some apps, a set of Core Data objects could be converted to a single file.

2) The end user is willing to have an account in such a cloud-based storage system. This doesn’t seem like such a big deal. It seems like services like Google Drive and Dropbox have good market penetration.

3) The end users are not going to mess with those files. The files will be stored in a user-accessible manner in the users own Cloud-storage system. The app can put a “README” file in the relevant directory which says “Please don’t alter or delete the files in this directory”, but fundamentally, they are the users files. Of course, this is an assumption made by pretty much any program running on a desktop computer or laptop. The files exist somewhere in the file system. Typically programs depend on integrity through obscurity: The programs to some extent hide their data and assume that end users won’t know about the location of that data.

4) Initially, in the development progression of the SyncServer, the same cloud storage account will be used across devices by users wishing to share data. This could be an issue. It could be annoying, for example, for users to have to create a separate Google Drive account to share data for Petunia across separate devices. We should, however, be able to make this issue go away later when we incorporate sharing facilities into the SyncServer. For example, in a family that wanted access on multiple devices to their pet data, one person could have a Google Drive account, and other people could share that data using Petunia using their Facebook account credentials.

5) A single SyncServer instance would be used for a single app. E.g., Petunia would have its own SyncServer running. Another app, published by another developer, for example, and also using the SyncServer, would run a different SyncServer instance, likely hosted on entirely different hosting service. Since the SyncServer would be open-source this seems reasonable. A possible revenue stream for making money off of the SyncServer would be to offer hosting and support for the SyncServer to app/business clients.

6) Lock contention will be relatively infrequent for a specific cloud storage account. That is, I’m assuming that there will be a relatively small number of concurrent accesses to a specific cloud storage account — and typically there will be only a single user modifying or accessing the data for a particular set of files in cloud storage. For example, one family using Petunia would typically consist of a few people sharing the data for a set of pets across a number of mobile devices. And rarely would multiple family members be modifying, or even accessing that pet data, at exactly the same time, where that access needed to retrieve data from cloud storage.

7) Individual file size is small enough that we are likely to get individual files transferred to/from the SyncServer in the amount of time that the app is active on the mobile device and we have a network connection. For example, if individual image file sizes are say 2MB, then a 2MB file can be transferred relatively quickly. This is jointly an assumption about file size and network transmission rate. With larger file sizes/lower network transmission rates, we’ll need to incorporate mechanisms for restarting upload or download of individual files given that part of the file was already uploaded/downloaded. We have not yet done this.

8) Worst case temporary storage size on SyncServer is sufficient. When uploading data files, the data files are first temporarily stored in a file system on the SyncServer before being transferred to the final user cloud storage. The maximum size of this temporary data is dependent on the implementation of the app, though could be restricted by the SyncServer itself. In typical usage with Petunia, say, to upload a pet profile and some pictures of the pet, the temporary data storage space would be say, 10-100 MB depending on the size of the image files. Image file size, again, could be controlled by the app. The worst case, for Petunia, would likely be the initial conversion to using the SyncServer. If the user had many pets and images, this data size could be substantial, e.g., 100MB-1GB. If many users were simultaneously converting over to using the SyncServer, this could place a substantial temporary file space burden on the SyncServer. Likely we will have to place limits on the temporary file size that can be used by any given uploading app instance. The Petunia app could work within these limits in the case of converting to use the SyncServer by uploading the data for a single logical group of information at a time: E.g., the data for a single pet.

9) Long-term size requirements for Mongo collections are relatively small. As I write this, the SyncServer has five Mongo collections (again, think tables in mySQL). These collections represent: Locks, Operation Id’s, a File Index, Outbound File Changes, and User Credentials. The following table presents a synopsis of the usage and approximate data sizes involved in these collections.

For the following, it is important to realize that the SyncServer’s concept of a user is a single distinct account on a cloud storage system. E.g., a Google Drive user account.

Mongo Collection Purpose Life Span Size
File Index Per user, records one document (think: one table row) per file stored in cloud storage. Permanent. This will be the largest collection used by the SyncServer. Entries are typically only ever added, never deleted[1]. 8 fields per row; approx. 150 bytes per row; estimate 72KB per user assuming 500 files per user.
User Credentials Per user, this has identifying information and information granting access to cloud storage. Permanent. Some users may remove their accounts, but largely this will likely just grow but just related to number of users. Size varies according to the specific cloud storage system. For Google Drive, has 6 fields and is approx. 200 bytes per user.
Locks To make the upload of a logical group of files atomic, this is used to exclude other app instances from uploading to the same specific cloud storage. Temporary. Locks are created and removed as app instances need. The sizes of these temporary entries will be swamped out by the sizes of the uploaded files. E.g., a 1MB image file is much larger than 200 bytes for a lock in a Mongo collection.
Operation Id’s Used to track data transfer operations to cloud storage from the SyncServer for specific users. Temporary. Removed after transfer of data to cloud storage has completed.
Outbound File Changes Entries in this collection area created per file that will be transferred to cloud storage. Temporary. After transferring files to cloud storage, entries are removed from this collection.

[1] File Index entries can be marked as deleted, but to allow for proper synchronization across devices, entries are never actually deleted. Of course, an entire collection of file index entries could be deleted if the user decides to remove their SyncServer account.

About the author: Christopher G. Prince has his B.Sc. in computer science (University of Victoria, B.C., Canada), an M.A. in animal psychology (University of Hawaii, Manoa), an M.S. in computer science (University of Louisiana, Lafayette, USA), and a Ph.D. in computer science (University of Louisiana, Lafayette, USA). His M.S. and Ph.D., while officially in computer science, were unofficially in cognitive science, split between animal psychology and computer science. Chris is a dedicated animal person, and has also developed: Catsy Caty Toy, a customizable and shareable iPhone and iPad app for your cats (http://GetCatsy.com)Petunia, an app for recording and sharing pet health information (http://GetPetunia.com), and WhatDidILike, an iPhone app to keep track of restaurants and food that you like (http://WhatDidILike.com).

Creative Commons License
“The SyncServer: Permanent Access to Your App Data” by Christopher G. Prince is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at chris@SpasticMuffin.biz.