The engineering team at Facebook is constantly tuning its systems and apps to run more smoothly and better suit users, particularly those on mobile devices.
To that end the Facebook team set out to reduce the number of crashes in the Facebook iOS app to increase its reliability.
“In the past, most of the crashes have been due to programmatic errors, and they always came with a stack trace that blamed the line in the code that caused the crash and always offered a hint as to what the issue might be,” said a post on the Facebook engineering blog written by Ali Ansari and Grzegorz Pstrucha, two engineers at Facebook.
According to the post, Facebook witnessed a drop in its measured crash rate but also noticed from App Store reviews that the community was still frustrated with the app crashing.
“We dug into the user reports and began to theorize that out-of-memory events (OOMs) might be happening,” the post said. “OOMs occur when the system runs low on memory and the OS kills the app to reclaim memory. It can happen whether the app is in the foreground or the background. We refer to these internally as FOOMs and BOOMs, respectively — it’s just a bit more fun to say that the app went BOOM!”
Facebook eventually fixed the problem with a combination of internal tooling, migration to the newest iOS technologies and some additional cleverness that helped the company accurately measure crash and reliability problems in the first place.
To get a handle on how often their app was being terminated due to OOM crashes, Facebook started counting all the known paths through which the application could be terminated and then logging them, the blog said. The question the team looked into was “What can cause the application to start up?” The company came up with six reasons why an app could need to start up:
* The app was upgraded.
* The app called exit or abort.
* The app crashed.
* A user swiped up to force-exit the application.
* The device restarted (which includes an OS upgrade).
* The app ran out of memory (an OOM) in the background or the foreground.
“By process of elimination, looking for instances that didn’t fall into the other cases, we could then figure out when an OOM had occurred,” the post said. “We also kept track of when the app backgrounded and foregrounded so that we could accurately break down OOMs into BOOMs and FOOMs, respectively.”
The logging showed that there was a higher rate of OOMs on devices with less memory, “which was expected and reassuring since the application process was more likely to be evicted on a constrained-memory device,” the post said.
The team’s first effort to reduce the number of OOMs was to attempt to shrink the memory footprint of the app by proactively de-allocating memory as quickly as they could, as soon as it was no longer needed.
Fixing the leaks led to some reduction in the OOM crash rate, but not the significant reduction the team was hoping for. So, “Next, we dived into the memory profiler in Apple’s Instruments application and noticed a repeated pattern of UIWebView allocating a lot of memory once the application opened any Web page. We also found that the memory was often not reclaimed, even after the user navigated away from the page and the web view was closed.”
How Facebook Cut Down the Crashes in Its iOS App
Thus the team tried a number of optimizations, such as cleaning up the cache and clearing the content, but the memory footprint of the app’s process was always significantly increased after navigating to a Web view, the team said.
However, iOS 8 introduced a new class — WKWebView — that performs most of its work in a separate process, which means that most Web-view-related memory usage would not be attributed to the Facebook app’s process. In a low-memory event, the Web view’s process could be killed and their app would be more likely to stay alive, the post said. “After we migrated our app to WKWebView, we did indeed see a significant reduction of OOMs in our app,” the engineers said in the post.
Yet, even after migrating to WKWebView, the team still found that a small memory leak could affect the OOM rate significantly, especially on the more memory-constrained devices. With the company’s frequent release schedule and many teams contributing to the app, Facebook knew it was important to both catch and prevent memory leaks in the apps they release. So the company used its CT-Scan infrastructure — originally designed to test for mobile performance — to log the amount of resident memory in the process, allowing CT-Scan to flag regressions as soon as they were introduced. This has helped to keep the OOM rate much lower than when the team first started working on it.
However, the last key tactic the team used was to construct an in-app memory profiler, to allow profiling the app quickly by tracking all Objective-C object allocations. The team configured this in CT-Scan and in the internal builds of their app.
“Here’s how it works: For each class in the system, it keeps a running count of how many instances are currently alive,” the post said. “We can query it at any point and log the current number of objects for each class. We can then analyze this data for any abnormalities release-to-release to identify changes in the overall allocation patterns of our app, usually helping identify leaks when any count shifts drastically. We managed to implement this in a way that is performant enough that it doesn’t produce any noticeable impact in application performance to the user.”
After the company rolled out the changes to resolve memory issues in the Facebook iOS app, the team saw a significant decrease in (F)OOMs and in the number of user reports of the app crashing.
“OOM crashes were a blind spot for us because there is no formal system or API for observing the events and their frequency,” the post said. “No one likes it when an app suddenly closes.”