Site Reliability Engineering for Native Mobile Apps
Mobile apps play a vital role in how users perceive the reliability of service. Indeed, almost always users interact with the service via mobile apps. It’s often assumed that mobile app development is simple, but this is far from true. In fact, delivering reliable apps at scale poses a number of challenges, including those listed below:
- A plethora of device variations, especially when it comes to display sizes
- Limited battery capacity
- Memory constraints
- Ensuring backward compatibility on multiple versions of the OS
- Varying network conditions
- Large teams coordinating across the organization
- No change control. If users have a problem with the build, they can’t roll back the binary unlike in servers.
Given the above challenges, continuous delivery of features is a daunting task.
Site reliability engineering (SRE) is an approach founded on principles, practices, and organizational dynamics that aims to ensure the reliability of continuous application development. Large scale and mature organizations like Netflix and dropbox adopt this approach for greater feature velocity with improved reliability.
SRE started with the aim to achieve reliability for large-scale distributed systems. In this article, we will describe how we can apply SRE principles to the reliability of mobile apps. Established organizations and startups like Halodoc follow this approach, albeit not explicitly, with the help of various tools, processes, and organizational dynamics, as will be discussed here.
This article is divided into two sections. First, we will describe the key SRE tenets for mobile apps. Then we will delve into organization topology, i.e. how the organization can be designed to adopt SRE with mobile apps.
- 1 SRE tenets for Mobile Apps
- 2 Organization topology
- 3 Conclusion
SRE tenets for Mobile Apps
Achieving 100% reliability is the wrong target if one keeps its cost in consideration. Site reliability engineering strives instead to balance the risk of unavailability with the goal of greater feature velocity. The end goal is to keep both business and end-users happier.
Measuring app risks
Measuring apps against risk tolerance, defined in terms of an acceptable level of unplanned downtime , is essential. That will act as a guard rail against unanticipated risks and help us take necessary actions at the right time with the help of alerts.
In SRE, we make use of Service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) to describe the basic properties of metrics that matter. Choosing the right metrics helps to take necessary action at the right time and also gives confidence to the team.
Service level agreements (SLA) are contracts agreed upon by a team developing a service and its users to define a set of objectives (SLO) in terms of availability, responsiveness, etc. Service level objectives (SLO) are agreements within an SLA for a specific metric like responsiveness. Service level indicators (SLI) are quantitative measurements of certain aspects of the metric, e.g.: 95th percentile latency of homepage requests over past 5 minutes < 300 ms.
App availability is one of the most important metrics to measure reliability. Two broad categories where the app becomes unavailable are crashes and app version management.
Crashes, if frequent, will make any app unusable. Fortunately, we have a great set of tools like Firebase Crashlytics available to help monitor this. Unhandled exceptions are the issues that need to be fixed immediately with high priority.
At Halodoc, we follow the following rules:
- Respond to Firebase Crashlytics velocity alerts. We use dedicated slack channels with relevant engineers to monitor and quickly respond to high-velocity issues.
- Monitor Android Vitals to detect “Application Not Responding” conditions and fix them.
- Fuzz test with tools like Monkey testing and SwiftMonkey to stress the apps.
- Rolling out apps in phases. This will reduce the blast radius of the impact i.e. reducing the impact to fewer users and allow you to halt the rollout and release a newer version of the app.
- Wrap changes with feature flags. This will help in reverting any buggy features on the fly. This also provides greater flexibility when the bugs are caused by server changes.
- Update server response to circumvent crashes. For example, if server response causes a crash, due to parsing or some unanticipated values, this can be changed to handle the catastrophe as an immediate first step.
- Define SLI and SLO for crash-free sessions.
Performance and Efficiency
Apps run on devices that depend on batteries, flaky networks, limited storage, CPU, and memory. Mobile OS themselves kill resource-hogging apps, which results in their unavailability.
Performance plays a vital role, and we need to define SLIs for key flows which impact users and business.
Platform tools such as Android Vitals, Xcode instruments, and Firebase performance help in identifying and monitoring the defined SLIs. As an example, we can define some custom engineering metrics that can be logged to Firebase Performance. Additionally, we can define some engineering events, which are non-product events used to measure non-functional behaviors, to be logged to another analytics engine. These engineering events will include the response time as an additional parameter, which will help us detect and monitor anomalies.
The above event helps measure the SLI of homepage load latency. If the value falls below the defined SLO, an anomaly alert will be raised.
At Halodoc, we built NFMonitor, a tool to monitor non-functional requirements like consumed memory, network bandwidth, etc. for critical business flows. Using such tools, any anomalies or SLA violations can be detected and fixed prior to release.
The ability of mobile apps to respond to backend state changes with minimal latency is crucial for user experience, e.g., changing an order’s state from confirmed to delivered. Polling solves the problem of fetching data but at the cost of latency. At Halodoc, we make use of our in-house push mechanism, LiveConnect, which makes it possible to avoid polling and pushes certain commands to clients with near-0 latency. Uber implemented their own push platform, RAMEN, to solve the same problem.
Debugging mobile apps is a challenge. There are many platform-supported tools for debugging while in development, but in production, it’s a different story altogether. Solving issues in production with a good turnaround time is key since it will reduce the non-availability of the apps.
Unlike server logs, debug logs and info of mobile apps in production are not readily available. At Halodoc, we make use of our in-house framework Transporter to get the logs required for debugging at scale, in real-time our production apps.
App version management
Unlike server applications, native mobile apps can’t be reverted once rolled out. This makes bugs introduced in a release hard to fix. As a precautionary measure, rolling out new releases need to happen progressively, as the following table exemplifies.
For each rollout phase, the stability of the app version is analyzed before moving to the next step. To adopt phased rollout, please checkout platform-specific details for Android and iOS.
Adoption of the latest version of an app by the users is slow. The less the number of releases, the better. However, this will impact businesses, especially those which are run in an agile way. In an agile environment, indeed, new features are released frequently and users being on the latest version helps the business to roll out new features.
This clearly poses two problems:
- New features are not available to users of an older version of the app
- Engineering resources are spent on maintaining legacy features.
At Halodoc, we mitigate this using in-app app updates, which helped us to expedite new version adoption and roll forward. We followed the same strategy of flexible and immediate updates on iOS as well, however, they were managed via the App Store.
Monitoring and Alerting
Monitoring and alerting for issues and anomalous behavior at the right time helps to solve them in a quicker manner. We have already mentioned Firebase velocity metrics and how it helps us to solve the metrics.
Alerting for issues while a feature is in development is important as well. At Halodoc we make use of tools such as StrictMode to monitor coding malpractices. This is one of the best white-box monitoring tools to help developers identify issues, especially in larger teams.
For memory leaks, we use Leakcanary for Android and Memgraph and MLeaksFinder for iOS. Both tools help us to filter out any memory leaks before release.
Security is an important aspect and ensuring all the features we deliver are secure is an important step. At Halodoc, we use MobSF alerts and dependency check to identify and respond to any security issues or concerns. Integrating to CI/CD pipeline helps in monitoring the builds going out for production.
Postmortem Culture: Learning from Failure
Every production outage needs to be analyzed postmortem. At Halodoc, we write and publish RCA (Root cause analysis) documents. Typically, RCA documents involve:
- The problem statement
- Impact on the business
- The timeline of acknowledgment of the problem
- Short term fixes to get the business going and
- Long term fixes with learnings
RCA documents are reviewed and then shared across the tech groups for wider learnings and comments. The learning from these docs serves as a playbook for the teams to handle future incidents.
Organization topology plays an important role in adopting the SRE mindset. I will recount here my experience at Halodoc with the way we structured our organization to achieve this.
Already at an early stage of Halodoc’s life, we recognized that building and running software systems requires a sociotechnicalapproach and not an assembly line like in a factory. We adopted this approach in three phases.
Phase 1: Developing tools and operating principles
In the above picture, Digital outpatient, In-Hospital Services, and Insurance are stream-aligned teams at Halodoc, i.e. teams aligned to a flow of work from a segment of the business domain.
The platform team interacts with the stream-aligned teams to understand both functional and non-functional needs and problems in developing mobile applications reliably. Various tools and methodologies as referred to in this article were developed to that aim.
Phase 2: Socializing and adopting SRE
In a second phase, the platform team will be enablers for stream-aligned teams and ensure SRE principles and practices were followed working in close collaboration with stream-aligned teams. A feedback loop was also established to continuously improve SRE of mobile applications
Phase 3: Stream-aligned teams operating autonomously
The end goal of enabling teams is to increase the autonomy of stream-aligned teams, by growing their capabilities in SRE. We usually had a few weeks of overlap (or duration, depending on the problem being solved) between the teams to achieve autonomy, and the practices from there on tend to be business as usual for stream-aligned teams.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR). The most relevant metric to measure against for these teams is MTTR. Stream-aligned teams with good response for incidents, postmortem, and help from enablers are constantly on the run for maintaining a good MTTR. Stream-aligned teams defined SLA’s with various stakeholders like product, business, and operations teams and strived to abide.
Many startups usually adopt SRE at a very later stage of their growth, mostly due to lack of resources. This may have a high impact on the overall development cost. The approach taken by Halodoc can be instead employed by early-stage startups with fewer resources.
Delivering reliable mobile apps at scale is a challenge. Adopting the SRE approach sooner has served Halodoc well. The right set of tools and practices enables this process, along with CI/CD, which quickly becomes the development backbone. SRE is a journey and a sociotechnical approach to organization will help achieve it.