Multithreading, Swift and Core Data

The Swift documentation on multithreading is poor, particularly with respect to Core Data. So here are my learnings. I assume you are familiar with basic multithreading and Core Data.

Traditional Unix multithreading

The standard for multithreading on Unix has been defined by POSIX since 1995. It gives programmers a complete set of functions for managing threads and transferring data between threads. Some functions are heavyweight, some are lightweight, and there are enough options that it’s not necessary to contort one’s program to fit the library. The standard gives a two-sentence explanation for how the functions synchronize memory access, which is almost enough information to write well-defined multithreaded C programs. The drafters of the standard felt that it was more important to be understandable than to be technically complete. “Formal definitions of the memory model were rejected as unreadable by the vast majority of programmers. ... It was believed that a simple statement intuitive to most programmers would be most effective.”

Alternatively, programmers can use the new multithreading features in the 2011 version of the C standard. The 2011 C standard sets precise rules for when loads and stores are atomic, when loads and stores can occur out of order and when loads and stores can happen earlier or later than the programmer might otherwise expect.

Core Data multithreading

Apple provides less information about multithreading and Core Data. If you are not familiar with Core Data, it is essentially an easy-to-use, object-oriented database that is integrated with Swift. You use Core Data by defining types of “managed objects,” which are similar to normal Swift objects but have the ability to be saved into a database. Managed objects in Core Data are created, fetched and saved by an object called a “managed object context.”

A managed object context has a worker thread attached to it. All of the work you do with managed objects is performed by these worker threads by passing closures to the managed object context. (Exception: A managed object context on the main thread doesn’t have a separate worker thread. The main thread can work on managed objects from this managed object context without using closures.) For example, if you want to create a File managed object on a background thread, you might create a managed object context and pass it a closure which does the work to create the File managed object, saves the managed object context and then exits.

As long as a managed object is in memory, it is represented by a Swift object. In fact it is possible for two or more Swift objects to represent the same managed object. This is because Swift objects representing a managed object cannot be shared across managed object contexts (i.e., they are not thread-safe). Each managed object context accessing a managed object has its own Swift object representing the managed object. The Swift objects can be distinguished by their memory addresses (as with any ordinary Swift object) and the managed object is identified by an “object ID” assigned by Core Data.

What happens if both managed object contexts make changes to the same managed object, then try to save the changes? This is an error, even if the changes are identical. Core Data will detect the error and halt your program. This conflict is called an “optimistic locking failure.” Swift offers a few ways to resolve optimistic locking failures. The crudest way is to set a “merge policy” which is often used to tell Core Data to stomp over inconsistent data. (If you’ve ever wondered why your iPhone apps seem to become corrupted over time, this is one reason why.) The better way is to use thread synchronization primitives to prevent conflicts. Unfortunately this is not as straightforward as it should be.

The difficulty with Core Data

As a general matter, for thread synchronization to work in any language, the compiler must recognize and respect the thread synchronization primitives. For example, if POSIX allowed a variable access to be hoisted above a semaphore acquisition, the semaphore would not protect the variable. Unfortunately, Core Data caches data in a way that means that it effectively does not respect thread synchronization primitives. Apparently, the entire time from when a managed object is first fetched until all changes to the managed object are complete and saved must be protected by a critical section.

Continuing the example from before, along with File managed objects you might have Category managed objects to describe what kind of files they are. For example, you might have 1,000 File managed objects related to a “JPEG” Category managed object. When you modify a File managed object to relate it to a Category managed object, you are also modifying the Category managed object to relate back to the File managed object because Core Data relationships are two-way. Therefore, if two managed object contexts running at the same time create File managed objects related to the same Category managed object without thread synchronization, when the managed object contexts save changes, there may be an optimistic locking failure.

The obvious solution is to simply wrap in a mutex the entire block of code where the shared managed object is fetched, modified and saved. However, this can impact performance if the shared managed object is used for a lengthy period of time. (There is also no guarantee from Apple that this will work. Hypothetically, the managed object context could pre-fetch data before being requested by the program.)

An incorrect approach to solving the problem is to refetch the shared managed object immediately before modifying it and to put the mutex only around those few lines of code. This does not work because, even if the previous Swift object representing the shared managed object has been destroyed, Core Data may return a stale version of the shared managed object from the cache of the managed object context. You might think that the method refresh and the property stalenessInterval would solve the caching problem, but Apple warns “the staleness interval is a hint and may not be supported by all persistent store types,” which means that even if this seems to work based on testing, it’s not guarantee to work.

Another incorrect approach is to simply set automaticallyMergesChangesFromParent on the managed object contexts. This causes a managed object context to automatically fetch new versions of managed objects from the persistent store coordinator (or managed object context). It does not mean that all references to a managed objects will be kept in perfect sync at all times. The managed object still needs to be fetched, modified and saved by each managed object context, which means there’s a race condition. It is therefore still possible for two references to the same managed object to become inconsistent, resulting in an optimistic locking failure. Is it possible to combine this approach with the re-fetch with mutex approach? Possibly, however it did not work for me.

For what it’s worth, I settled on the first approach, reorganizing my code to minimize the performance impact.

More notes on concurrency and Core Data

To synchronize access to a managed object class, you may want to add a NSRecursiveLock class member to the managed object class. However, managed objects can only contain Core Data attributes, and NSRecursiveLock is not an attribute nor is a Core Data attribute a class member. A workaround is the Objective C functions objc_sync_enter and objc_sync_exit, which add locking functionality to an object without requiring space in the object. (They work by keeping a private dictionary of object addresses and locks. These are the functions used internally by the Objective C @synchronized syntax.) You might think, this doesn’t solve anything because we want to add a lock to a class, not a specific object. However, the .self class method returns a metatype object which is shared by all instances of that class. This object can be passed to objc_sync_enter and objc_sync_exit. As a bonus, these functions use a re-entrant lock, which simplifies code design.

To synchronize access to a managed object, you have the same problem that NSRecursiveLock is not an attribute. And objc_sync_enter and objc_sync_exit aren’t workarounds because they require an object address, which a managed object does not have. (A managed object has an object ID, which is not the same thing. A Swift object representing a managed object has an address, but there can be multiple Swift objects representing the same managed object.) The solution is to make a utility class which contains a Dictionary keying Core Data object IDs to NSRecursiveLock, with methods to lock and unlock.