How to Design Software — Feature Flags
Take an in-depth look at how I designed, developed, and migrated a legacy system to a new feature flag system
Take an in-depth look at how I designed, developed, and migrated a legacy system to a new feature flag system
A previous company had a problem: our deploys were thousands of lines in size, took nearly an hour, and were massively risky. Engineers hated deploying, which led to a backed-up queue of dozens of pull requests awaiting deployment. Product managers reasonably wanted to wait to release things until it was all done, which meant our releases could contain thousands of lines of code, any one of which could cause a catastrophic crash. The result? Incomplete work was piling up, value wasting away without delivery.
Something had to change.
Releasing Deploys
To help resolve these issues, one of the things I implemented was technology improvements that allowed us to separate the concept of a deploy from the concept of a release.
Previously, the deploy was the release, as soon as the code made it to production, it was being used by millions of users. This made deployments take on incredible risk.
This directly put engineers at odds with the product manager. Engineers wanted to minimize risk through smaller deploys, but product managers wanted to release working, comprehensive solutions. Instead of working together, our lack of technology capabilities created an environment with a combative dynamic.
We wanted to ensure that our technology could support releasing functionality at a later time than when it was deployed. Put another way, we wanted to be able to send code to production without users being able to see it, and then allow the business and product managers to own the release of features on their schedule.
To accomplish this separation, I pursued the development and integration of a centralized feature flagging system as a ninja project in-between my scheduled work.
What Are Feature Flags?
A feature flag system, also known as a feature toggler, boils down to storing whether a feature is enabled or not and then checking it when you need to. If the feature is enabled, allow whatever it is you are toggling. If not, hide it from users.
Code-wise, it theoretically resolves to boolean check and the execution (or not) of an accompanying code path. An example:
run_code_path if flag
The flag
The flag itself is a value that the code checks for in making its determination to run or not run a specific piece of code. This is the thing that the code would check for when making a runtime determination.
The code path
The code path is the code that is being toggled or disabled/enabled.
Perhaps it checks whether a page component can be displayed or not. Maybe it adds an extra fee to a transaction or not. Maybe it isn’t even completed, and you are just putting the skeleton in place.
These various reasons for hiding the code will determine the longevity of the flag, how it is used, and where it is placed, but the end mechanics are the same.
The check
The check is the code that determines whether the code should run or not. It’s the if-statement that will look at the flag and run the code path.
For us, we divided the system conceptually into two different kinds of checks: system-level checks and context-level checks.
System-level checks perform checks system-wide. If the check passed for one case, it would pass for all cases.
Context-level checks perform checks within a specific context, such as within the scope of a User record. If the check passed for one context, it did not determine whether it would pass for another.
Finally, checks could be layered. By combining flags for system-level and context-level, we could toggle functionality for features on specific records, or records belonging to a specific user, or records belonging to users within a specific group, the possibilities are endless.
What Feature Flags Are Not
One critical thing to point out is what feature flags specifically are not intended for.
Because it is a boolean check, you can be tempted to leverage it in any situation where you would have a boolean check. However, just because you can doesn’t mean you should. Developers can fall into the trap of using it for account-specific authorization related logic just because it exists. It is important to not mix this area of concern.
You should never use a feature flag system as an account permission check, such as checking if a user is an administrator or not. Even if the base concept of checking a flag is the same, everything else is different enough to warrant separate consideration. Security is important, and concepts like authorization should be addressed as a primary concern within your system, not relegated to an unrelated feature flag infrastructure.
Dealing With Legacy Work
The product already had a few false starts in this area scattered throughout the codebase.
The most common feature flag implementations I saw in this product, as well as many others throughout my career, fell into these three camps:
A boolean on a record
A string array on a record
An environment variable check
The boolean column
The boolean on a record is simple, add a boolean representing whether the feature is enabled, then check it:
process_surcharge if @cause.surcharge_enabled?
The downside of this approach is that if you have dozens of features, or flags that can be shared across multiple kinds of records, you’d end up with just as many feature flag columns all over your system, and various checks for them.
It would also clutter database tables with things specific to mechanics of how the system works, and not the domain. This can get unwieldy quickly.
The string array
The second approach was clearly designed to solve the first downside of supporting multiple features.
process_surcharge if @cause.features.contains?('surcharge')
It solves one problem but still has the other downsides. It also then introduced a problem of being inconsistent with the already established feature flag pattern, which hadn’t been ported over to the new approach.
Add in a couple of years of proliferation and we ended up with dozens of places where one or the other were being used.
The environment variable
Finally, system-wide features were flagged with environment variables.
process_surcharge if ENV['IS_SURCHARGE_ENABLED']
Solving for legacy constraints
Greenfield would allow us to do anything we wanted, but we live in a legacy world. Any new solution we built had to corral all of these various partial solutions to ensure there was only one true way of toggling a feature.
Otherwise, developer psychology would lead to developers continuing to create their own or copying one of the other feature toggling mechanisms due to the inconsistency.
First Steps
We started by addressing the legacy constraints. We solved the problem of all of these disparate methods of feature checking by adding a layer of indirection and consolidating all of the differences within that layer.
The implementation
We created a service class called FeatureToggler
with a method called enabled?
that accepted a flag and a record:
class FeatureToggler
def enabled?(flag, record)
end
end
Because different kinds of records in our system had a different way to check whether a feature was enabled or not, we enumerated all of those ways within the enabled?
function:
def enabled?(flag, record)
return record.features.contains?(flag) if record.instance_of?(User)
return record.send("#{flag}?") if record.instance_of?(Cause)
# ...and so on...
end
We then replaced all of the various if x.<y>_enabled?
, if z.features.contains?(a)
checks with calls to the new function.
The code movement screamed of violations of every kind from “don’t repeat yourself” to “single responsibility principle” to “breaking the rules of basic inheritance.”
It was all for the greater good, though. Sometimes you have to make the code a bit messier before you can clean it up. Like a sliding puzzle, you may need to do the “3 steps forward, 2 steps back” dance until you complete the refactor.
Minor updates to our existing test suite ensured it continued to pass. The new consolidated interface actually made it possible to split out the testing of feature toggling functionality from other tests and creating a specific suite just for that, which made testing flagging much easier and more comprehensive.
A New Data Model
Once the FeatureToggler
abstraction layer was in place, we wanted to transition to a new data model, one that was divorced from all of the other data models within our system.
The new data model
valuevalue
was the flag itself, the value we would check in our code to determine if a flag was running or not.
statusstatus
was a string that represented whether the flag was enabled
or disabled
.
owner_id and owner_typeowner_id
and owner_type
were two fields that tracked the owning record in a polymorphic manner. If the owner was a Contract record with ID 2, owner_id
would be 2
and owner_type
would be Contract
.
Having no owner meant that the flag was intended to be a system flag, not a context flag.
Transitioning to the New Data Model
Transitioning was a bit more complex than we desired. Our transition plan required us to transition two aspects of flags: reads and writes.
Reads were easy
Reads were straightforward, we could insert a check in our new FeatureToggler
layer to look at the new data model:
def enabled?(flag, record)
FeatureFlag.exists?(owner: record, enabled: :true, value: flag)
end
Writes were a bit more challenging
Writes were made harder due to legacy constraints.
In order to allow operations to toggle features on our old system, we had a CRUD interface that was data-model aware and generated via DSL. This meant that it was hard-coded and specifically knew whether it was inserting into an array, setting a boolean on a column on a specific record.
There was no way to abstract that behavior, we were severely limited in our ability to change it out. If we added a flag using the new data model, it would not reflect in the old one. If we added a flag in the old data model, it would not reflect in the new one. This meant flags could go out of sync. It was a potential source of significant confusion.
Legacy constraints were forcing us to head down the path of an all-or-nothing release, which was explicitly an anti-goal of our initiative.
We couldn’t have both systems being written-to in parallel, which meant we had to do a complete transition if we wanted to migrate writes.
The solution: Only transition reads
We decided to not migrate writes just yet and only migrate the reads. We kept the old data model as the source of truth for changes by propagating any changes made to them down to the new data model.
We wrote a migration to take all of the feature flags we had stored in the other various approaches and copied them to the new data model.
To keep the data between the old and the new data model in sync, we added an extra step to the existing writes through callbacks, ensuring we were also writing the same data to the new data model in the various locations flags were being written using the approaches.
Finally, we changed the reads of feature flags to point to the new system, but to also verify results via a quorum system that logged when the checks disagreed:
def enabled?(flag, record)
if new_enabled?(flag, record) == old_enabled?(flag, record)
return new_enabled?(flag, record)
else
log_quorum_failure(flag, record)
return old_enabled?(flag, record)
end
end
We also decided to eat our own dog food and feature flag the feature flag check to determine whether to use the old system or the new system as a just-in-case. Talk about extra safe.
By removing the requirement of transitioning writes, we greatly simplified the project and it allowed us to deliver a small piece of value rapidly. Sometimes the simplest thing is to do nothing.
Flawless victory
We ran the system in production for a while, and it ran perfectly, we didn’t see any disagreements being logged, except for those we intentionally created as a test.
At that point, we switched over to using the new data model for all of our reads and started looking at transitioning writes.
Transitioning Writes
Because we had full confidence in the stability of the reads and the usefulness of the new system as a whole, we approached migrating writes with a whole lot more confidence.
We had two data models in play, with the old data model being the source of truth for changes. Since changes to the old data model got synced to the new data model, this meant we were still reliant on all of the limitations of the old approach.
We needed to make our new data model the source of changes. Because our old system was reliant on an inflexible DSL for feature flag modification, this meant that we had to create a new user interface to provide the same functionality that was expected before we could actually transition writes over.
Turns out it was a piece of cake with the new data model. Operationally supporting the new flagging data model was as complex as inserting and reading records in a table. There were no special rules or gotchas. The interface literally boiled down to a table with some buttons, it was CRUD functionality in its purest form.
Releasing
The actual release was pretty straightforward.
We provided a brief training to the main group of people who modified feature flags, wrote and shared a one-sheet to document it extensively, and then provided links to the new interface where the old interfaces were.
Iterations and Improvements
Our new system was fantastic.
It allowed us to rapidly apply different kinds of feature flags or apply multiple flags within different contexts without the need for database migrations.
Cascading flag checks were now easy, we could toggle functionality based on a specific record or for a specific user. We didn’t stop there. Now that we weren’t limited by legacy constraints, we could quickly add more functionality as time went on.
Descriptive enumerations
One problem with the old feature flag approach was that nobody knew what the flags actually did or meant.
Because many of them were boolean fields or string values without descriptions, it was difficult to tell exactly what the consequences of toggling them were. Unless you were already familiar with the system, you would never be able to find out.
When we transitioned them over, we preserved their unfortunate naming.
For non-engineers, flags with values like bpfee
and gp
were indecipherable. It provided a terrible user experience and led to a lot of operational errors that ultimately made their way back to engineering in the form of bug reports and investigations, which we wanted to prevent.
We wanted to clearly describe what flags did and what they meant, so we introduced a data model to list what the available options were:
We changed our modification interface to decorate the list of feature flags with these details, providing significantly more clarity to anyone modifying the feature flags.
They no longer had to see just:
bpfee
They also saw what it meant:
Receipt Basis Points Fee (bpfee)
Display fee percentages charged as basis points on the receipt.
The increased clarity led to significantly fewer errors and questions reaching us, resolving a large source of upstream issues.
Defaults
Sometimes when a record gets created we want certain flags to be automatically enabled. By adding a column to the FeatureFlagOption
table, default_status
, we can achieve this:
Now, whenever a record was created, we could pull up all of the flags where default_status
was enabled
and automatically enable the flags on that record.
Categorization
Over time, you get a lot of different feature flags, all for different purposes.
Some are flags intended for short-term use, typically to hide code until it is done or to mitigate risk by performing canary releases. Once the code is deployed, the flag should get removed. Some are flags intended for mid-term use, such as those used to perform experiments. Some flags are intended for long-term use. Features might be toggled based on the contract or plan the customer is on.
Each of these flags have different owners and usages, so it is important to be able to separate them. By adding a category
to FeatureFlagOption
, we are able to scope the enumeration and behavior even further:
Experimentation
Sometimes, we want flags that split groups into different cohorts for A/B testing of features. We could store the various parameters of experiments in the FeatureFlagOption
, to be sent to the A/B subsystem:
While an A/B system is outside the scope of this post, I recommend checking out the split gem if you are using ruby.
Performance
As usage of the feature flag system grows, you don’t want to be performing that many database reads. The abstraction layer provides a perfect place to put in a caching layer as well to ensure rapid lookups of feature flags.
As always, remember the words of the famed Donald Knuth:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Ensure you have an actual need to improve performance and that the usage patterns are appropriate for your solution.
As another tip: don’t randomly add caching logic ad-hoc. It quickly gets out of hand when you try and figure out whether you are dealing with cached data or not. It is made monumentally more complex when caching layers on the controller, data, and database all start interacting with each other.
Add caching thoughtfully and with centralized controls so you know exactly how stale the data is and how to force a refresh, if necessary.
Client-side checks
Our new technology platform greatly leveraged a lot of the feature flag infrastructure. We created a front-end service and component that showed/hid nested components based on the feature flag’s status.
By exposing flag checks in an API and then consolidating all of its logic within the service and component, we were able to avoid having feature flag logic all over the front-end.
The Effects
The transition went incredibly smoothly, and it was shocking how much separating releases from deploys improved team collaboration and delivery effectiveness.
Deploy sizes went down and frequency went up because developers could deploy code without causing production issues. This in turn decreased the change failure rate since deploys were smaller and bugs were easier to catch before they made it to production.
Releases went more smoothly and adapted better to the business’ timetable because it didn’t require engineers to keep track of different branches of work and deal with integration challenges for tens of thousands of lines of code.
It wasn’t a silver bullet to all of our problems, but it helped streamline processes and remove a lot of collaboration friction, paying off dividends.
Frequently Asked Questions
Why didn’t we go with a vendor?
There’s a lot of vendors that provide flag functionality like Optimizely or LaunchDarkly. Why didn’t we go with them, why reinvent the wheel? In a word: budget.
At the time, we were under significant budget restrictions, and it was difficult to get any additional spending approved. As engineering didn’t control our budget, we had to make do with what we had.
Even getting resources to work on technical debt like this was impossible, we had to ensure we made progress on engineering initiatives during the time we found in the “in-between:” the small slices of time between tickets, projects, and meetings.
We had to work within the context we had.
Why didn’t we go with a library or gem?
We examined a lot of gems and realized that none of them suited our particular desires for future use cases, legacy constraints, or migration requirements. Several gems in particular would have required a big-bang approach to deployment, which was a level of risk we didn’t want to accept.
A few would have required a complete overhaul of how teams external to our department understood flagging, which expanded the scope of the changes we wanted to make and increased the delivery burden significantly by adding more stakeholders. We were already working on this as an unofficial ninja-project. We didn’t want to increase the odds of it never being delivered.
Most of the gems we looked at would have required us to go the route we did anyways. We decided to kick the selection down the road and roll our own quickly, but hide all-access behind an interface so we could swap out the implementation in the future if resources freed up.
We were already working on this as an unofficial ninja-project. We didn’t want to increase the odds of it never being delivered.
Why didn’t you just transition everything over at once?
An all-or-nothing approach was not desired for a few reasons.
Recall that this was a ninja project we were doing in-between actual work. This meant I had a few minutes here, an hour or two there to actually complete this migration. An all-or-nothing approach would have required significantly more focus than we could afford to allot.
We did introduce extra complexity in the migration by doing it in smaller pieces. However, we de-risked almost the entire initiative by doing so. Feature flags were used extensively within the system for a variety of reasons including contractual obligations, and an error in it would have been highly consequential.
Doing it in smaller pieces was absolutely worth the safety, even if it ultimately wasn’t needed in the end.
Where Do You Go From Here?
If you’re interested in learning more about feature toggling, Martin Fowler’s site has an in-depth article written by Pete Hodgson on feature toggling which I highly recommend.
Feature Toggles (aka Feature Flags)
Pete Hodgson Pete Hodgson is an independent software delivery consultant based in the San Francisco Bay Area. He…martinfowler.com