Loading...
Home
  • Tech Blogs
  • Videos
  • Conferences
    • Droidcon News
    • Upcoming Conferences
    • Become a Partner
    • Past Events
    • Keep Me Informed
    • Diversity Scholarships
  • Community
    • droidcon Team
    • How to Hold a Droidcon
  • Android Careers
Sign In

Global CSS

droidcon News
Rolling-out like a Rock-Star
By
Nicola Corti
droidcon EMEA 2020
Rolling-out a feature is a fine art. Users are becoming more and more demanding. A single crash could entice them away from your app (and land a shameful 1-star review). In this context, the possibility of remote controlling your app is a key tool. First, it can protect you against crashes and incidents. Moreover, it can help you experiment on the user experience to fit your users’ tastes. At Spotify, feature rollout and experimentation are at the foundation of our development. We deliver daily more than a thousand feature flags on a variety of different apps (Android, iOS, and more). To achieve this, we built an in-house solution to support our experimentation needs. Throughout the years we collected a series of learnings, success stories and pitfalls. In this talk, I will share some of them. Afterwards, you will be able to set the stage for a flawless rollout.
Transcript
English
00:10
i assume i can start right uh yeah okay so hi everyone thank you very much for joining this session my name is nicola corti and today we're gonna talk about rolling out like a rockstar so mandatory is lied about myself i work as android infrastructure engineer at spotify in stockholm sweden i want to spend a couple of seconds on what infrastructure means here exactly i mostly spend my day working on sdks and tools for other android engineers so that's that's like a little bit different than like the canonical android engineer you can find my contacts on twitter as cortinico and also on github and by the way if you like running make sure you scan this spotify code like if you go inside the spotify app and search there is like a camera icon you can scan this code and you can subscribe to my running playlist i really love running and like make sure we are synced on the same tune but now let's get started and to get started i want to start with a little bit of history and i need to start with this so if you don't know this this is abba why i'm starting with abba first because i love them second because they're swedish and also spotify is a swedish company we are based in stockholm and uh let me actually stop yeah and uh the reason why i'm telling you that is like if you happen to be in stockholm make sure you stop by at the appa museum because this is amazing you will enjoy that for sure but abba is also the name of our first experimentation framework and i think that the name was just like genius because abba sweden music spotify and also abba it's the best tool for running a b testing so i want to start from here as i said abba was our first naive experimentation framework at spotify we use that as the foundation to running all the experiments and rollouts that we run in inside a spotify app unfortunately it came with some drawbacks specifically it was really hard to scale like once spotify started to grow more and more it was really hard to to keep having abba following our growth piece the problem with abba was everything was designed as global so imagine that you have like a single file on android where you have like all of your flags defined and once you reach like something like 100 android engineers touching the same code base it gets really complicated to touch the same file concurrently and it's just like a mess from the ownership point of view it was hard to maintain as i said because it was not designed to fulfill our needs and also it was not flexible enough first it was hard for us to adapt it to our developer experience that we wanted to offer and also we have a lot of data scientists at spotify it was really complicated to define custom metrics so abba was really elementary and we spent quite some time recently reshaping it specifically we divided abba into different components and we rewrote them from the ground up so today i'm here to share with the rest of the community our work and with the idea of giving inspiration to others that might start to that my others might just want to start doing experimentation or feature aloud so the tool that i'm gonna present today it's called remote configuration and it's a tool that we use internally at spotify to do a b testing and experimentation and feature aloud so i want to give first a couple of disclaimers there are a couple of topics that are extremely crucial for running proper a b testing and experimentations that i will not touch today and here is why the first one is all the data science behind a b testing specifically how you do user allocation whenever you're running multiple experiments you want to make sure your users are not biased by being inside a previous experiment so you want to make sure they're fully random i will not touch this we could give an entire talk only on these and it's extremely interesting but at the same time this is like a little bit out of scope of what i want to talk about today and then i will not mention analytics so analytics are a fundamental part of an experimentation framework because they allow you to understand what your users are doing either doing even reporting or on the higher level on the higher level talk about user behavior like is the user clicking on a certain button is more interested in a certain screen rather than another you need those to understand what users are doing but again we could give another talk only on this so they will be outside of the scope of this talk so whenever we talk about feature allowed experimentation a b testing the code that an android engineer has in his mind looks more or less like this you have if then else you check if a feature like showing lyrics of a song is enabled or not and if it's enabled you want to show a certain experience like in this case the real-time lyrics experience if it's not you want to fall back to a default experience unfortunately it's not that simple like the code here is really trivial that could be like a boolean guard like this one it could be like something a little bit more complicated but it's really elementary but what needs to happen under the wood is a little bit more complicated specifically if you also want to control the value of lyrics from remote like you want to be able to change the behavior you want to be able to change this flag from remote so we started uh as i said our rewriting basically from the ground up and we followed some principles when when doing it and today i want to walk you through some of our principles and we will see during the talk why they are crucial and important for our architecture so the first principle that we followed while developing is safety so our lighthouse is and will always be to always deliver a consistent configuration at spotify we deeply care about the user experience of our users so we want to make sure that they always end up with a consistent and a valid configuration we will see later what this means but we want to make sure that the experience is always consistent and not broken for our users
00:36
we also designed a system there was a fault tolerant specifically we need to support a lot of offline scenarios maybe our i don't know on a train or on an airplane and we want to still be able to have a system that works correctly and we rely a lot on default values so all of our flags will always have a full back scenarios where we can fall back if something goes bad like there is an internet there are no values and also we try to bake in developers safely as much as possible so we followed those principles typical of software development like single responsibility immutability etc etc and we baked in inside our own system we will see some examples later the second principle is ubiquity the spotify experience is shipped across a lot of different devices not only android apps that's the reason why we spend a lot of time building sdks that should support android ios web and c plus plus as i said we don't support only android but also ios devices and not only phone also wearable devices tablet and automotive and if you don't know spotify is doesn't only have the so-called spotify music app but there are a lot of other apps like spotify stations spotify lights spotify for kids and all of those apps should be able to experiment
01:02
and the last principle is simplicity so our motto here is it should just work so we want to have a system that our developers internally this will just be able to set up with a couple of lines and everything should just work smoothly it should be really easy to use so like it should not be that complicated really and to achieve this we baked code we baked in code generation in every possible step of our of our setup and then we also spend time offering developer tools to support our developers for example in their daily work if they're running the app and they want to see different experience like debug menus and so on and also in the testing we deeply care about testing so if we are offering two different user experience we want to have the tools to run to write tests for those two different experiences so now that i presented those principles let me spend some time giving you a bigger picture overview of our architecture so to explain our architecture i need to introduce three actors so the first actor is a developer the second one is a data scientist and the third one is a spotify user so it all starts from a developer a developer wants to create a new rollout or a new experiment so they basically create a properties.yaml file where they define all the flags once they create this file they're able to build a new feature and use them inside their business logic after they're done building the picture the feature the feature will be merged and the properties will go through the so-called publish phase the publish phase allows those properties and flags to reach one of our backend services called remote config admin this is where we store all of our properties then is up to the data scientist the data scientists will have all the configurations for setting up a new experiment or a new reload they will use a tool called backstage backstage is a tool that allows them to interact again with remote config admin and send out all the configurations for running a proper experiment then is up to our mobile users when they open the spotify app they go through the so-called resolve phase the result phase is when the spotify app hits another service that is called remote config resolver and this service is responsible of giving runtime values to our mobile app and obviously we interact with data data service remote config admin so now i want to walk you through those uh dark green block boxes one by one and we will deep dive on each of those to understand what are like the nuances the first one is the data scientist story and as i said before this talk is not really focused on data scientists but i want to give like a short mention of backstage so backstage is an open source tool we open sourced some months ago you can find it on github and on backstage.io and it's basically a tool a framework for building developer portals so imagine that you have a web application where inside you have direct access to your ci tool to the code source to the documentation to all the tools that are used by developers and specifically internally we also have a plug-in for backstage that is used by data scientists to set up new experiments and roll out and see all the metrics do all the evaluations that a data scientist needs to do so i invite you to check out backstage.org the next step is the wheel phase so the developer phase and this is like where the interesting part uh actually starts as i said it all starts from a properties.yaml so let me introduce this property so yaml and let's see how it's how it looks like so that's how a developer can define a first flag in this case is a flag to control if lyrics are enabled or not as you can see here the properties are namespaced so they're like player module dot lyrics enabled and this is like a first form of ownership like with these syntax we are able to always identify who is owning the this specific this specific flag in this case this is just a boolean flag but you can define others for example this is a hello message property and here we can define an enum property like with a list of all the possible values and the next one is message size that is the text size of of a message and is an integer property where we can define like lower and upper bound for all of those properties there is always a default value this is our requirement of the system so once one of this file is ready the developer can start building a feature so let's go from the developer point of view so they're working on the player module and they have a new file called properties.yaml our experience is shipped as a gradle plugin it means that a developer will have a task called generate properties that they can just invoke this task will take properties.yaml as input and will run code generation specifically two files two or more files will be generated one is a properties class and then we generate dagger module this means that the output will look more or less like this player properties are java and player properties module.java those two files can be used inside the business logic of the feature and the developer can just start using them and writing the logic as i said we tried to build we baked developer safety inside our system so player properties is fully immutable like you can't change the values once you got an instance if you want to get an updated value of a flag you need to get a new instance of a player properties we also uh we are strongly opinionated on type safety specifically we decided to don't have string properties you can't have a string flag that might sound odd specifically if you worked with tools like firebase remote config where you can actually use a string the idea behind this is that with our types there is integer enum and boolean we are able to sort of restrict the amount of possible values and prevent scenarios like developers sending source code in a string property or developer sending json or data that might like break the app in some way we want like strong enforcement on data types and we realize that with enum and boolean and integer we are able to cover the vast majority of the experimentation use cases last point we also enforced strong ownership as i said before with this naming convention and also we use backstage again because backstage has the capability of give ownership to every source code file in this case this properties to yaml needs to be registered and needs to have an owner so we can always pinpoint which team is owning these the yaml file so we can always see like hey this is the team we need to ping them and we need to ask what's going on with this property so from the developer point of view it looks more or less like this they go inside they'll be the gradle and they drop these three lines that enables the remote config plugin and then there is a properties.yaml file that file that we already saw and so let's have a look at how the feature code of a developer will look like so for example let's say that they have like a dagger module with already all the dependencies like this player module dot kt and they just need to include our generated properties module that's really easy this module already have all the provides for all the properties classes so inside the business logic let's assume there is a player presenter file they just need to do injection of properties that can be done like with field injection or constructor injections like classical job classical dagger code really nothing nothing crazy and inside the business logic code they can actually write something like this if properties dot lyrics enabled here we reuse the same name of the properties so we do like a caramelization of the property name and they can just like do their branching like if true so the lyrics experience otherwise show the default experience this is the code that represents at the beginning but this is possible with code generation and also i want to mention that uh using code generation in this field allow us to avoid a lot of common errors for example like a lot of basic experimentation framework requires you to write the string of the property or of the flag name and if you have a typo there or if you use like lyrics dash enabled instead of underscore then you might end up in in situations where you shift up to the play store and you enable the flag and then nothing happens because you have a typo in the property name that is really annoying so using code generation helps us avoid a lot of common errors like this one let's see the second step there is the publish tab again so the developer created properties.yml file started building it and now it's time to publish our properties so again player module properties.yaml in our codebase we have tons of modules so imagine that there are a lot of other modules the search podcast playlist whatsoever and all of them they the cooled ava properties so yaml is not mandatory but i could define one so we offer another grail task called published properties there is a responsible of searching all the properties so discovering them through the code base and hitting our backend remote config admin and telling which properties are found to fully understand this step i need to mention that this is happening on ci and is happening on post merge so once the feature is ready they can just click merge and this task will run and will publish all the properties so why we do this first because with this system we don't need to define properties twice they're just defined inside the properties they are all file you don't need to do you know you don't need to log in inside a web ui and create a new property there it will be taken from the code base and will be available inside admin and all the other system that will use this property can benefit from this setup we can run ci checks like static analysis on this property so yamu to validate that the properties are valid to use the correct naming etc etc and also this helps us to achieve strict versioning that means that we are always able for every version of the app to go back in time and see which properties were published because we can just check out the code base at that point in time and we'll run this task and we will find the same properties and again this is a huge power because we can rely on our code base as being the single source of truth like properties are defined only once they are defined in the code base and they are nearer to the feature that are actually using them this solves a lot of headache that we had with abba where like imagine again this huge file with thousands of properties and like you don't know who owns what like instead with this setup it's really easy to understand the code and who is owning it one little note so at spotify to do code sharing between android and ios we use c plus so imagine that there is a underlying c plus plus layer used shared between android and ios and also there developers can define their properties at enamel in the c plus plus layer we have code for example for running music and imagine that a developer there wants to experiment on the on the bit rate so there are actually several properties the yaml file and it's up to this gradle task to find all of them so now to the last step again so the resolve and this is the step that is happening on another front the runtime part so this is the step that is happening when our users are opening the spotify app let's see it in details
01:28
so we do have several modules again the same that i showed before and as i said we do code generation so we do have a lot of files like java files that are generated by our our framework and then we also ship a runtime client sdk so there is a part of code inside the spotify app there is responsible of calling like doing a network call to remote config resolver and getting a runtime allocation so for a specific user in a specific country in a specific version of the spotify app get the values of all the properties that is up to the run time like once the player module starts to get an instance of a player properties and ask the client sdk to get an instance of it the client sdk will return a valid instance of player properties with the allocated value at runtime and this is the same mechanism for all the features that require to have one valid allocation around time to fully understand the resolve i need to quickly mention what is our fetch strategy so at spotify we use the so called activate and fetch this is a terminology that comes from firebase remote config so fetch is the action of actually hitting the backend so it's this red arrow so you do a fetch whenever you hit the backhand and you get new new values for all the properties and you do an activate inside the client sdk when you make those values available for all the modules to be used we will see them we would see this strategy a little bit more in detail in the next slide but i want to mention that we also do recurrent fetch in the background so we constantly uh keep on like sort of polling i mean polling is probably not the correct term but like we constantly in the background interact with remote config resolver to get fresher value to be used for the app let's see uh how the fetch strategy looks like more in details so the user opens it up and the first question that we ask ourselves is is the user logged in like is the user seeing the login screen or they already logged in with the spotify app if they're seeing if they're not logged in so that they're seeing the login screen we do the so-called fetch and activate so not activating fetch but fetch and activate it means that we immediately reach our back end we get a configuration and we activate it immediately to make it available at the first run of the application so this is the so-called first user session we want to make sure that when you open the spotify app for the first time you will see the login screen there will be like like a loading spinner and then you will see your correct experience if you are already logged in then we do the activating fetch so how it works first we load the remote configuration sdk and we checked is there a cache configuration like do we have something that we fetched in the past if yes we just activate it if not we fall back to the default config so after we do this we schedule a delayed fetch this is what i mentioned before it means that the user will see the app they will see the experience there was fetched for them in the past and then they will activate what we have right now in the future like new fetch will happen so newer values will be refreshed for that users and they will be available the next time you reopen the spotify app so this this is also uh this scheduled delayed fetch are happening in every scenarios and they allow us to be in constant connection with remote config resolver and make sure you have like up-to-date values so now i want to share with you some lessons learned
01:54
so when building this kind of sdks we went through like several failures things that didn't work as we expected things that we would have done differently so we want to share it with the rest of the of the community now and maybe they might be helpful if you are also starting to develop like a system like this one so the first one is correctness so at the beginning i mentioned that one of our goal is to always deliver a consistent configuration so we truly believe in this and this is also one of the reason why we decided to follow this activate and fetch mechanism and to let me explain this i have like a simple gif so imagine that you are like you have two buttons for confirming a order and the user can click on yes or not and then you do an activate at the certain point and you change the experience because you have a pro tip banner to show on top this will actually change the layout for the user and will basically end up on them clicking the yes button that is extremely annoying so the idea behind this is that we deeply care about the user experience and we don't want to let our user experience like a fragmented app for example like if you're experimenting on the color of the buttons buttons should always be of the same color and you should not see screens with a red button and others with a green button that's just like not acceptable that's why we do this activating fetch and we re-fetch configurations but they will be available for you at the next opening of the app so for these i also want to mention that make sure you cover every entry point to your app in the diagram i presented before i shown like the open the up entry point like on the left there was open the spotify app how about if i have a deep link and push notifications like you need to take care that every entry point of your app is covered with the proper experimentation initializations
02:20
the the problem is that if you don't do this you might end up in a scenario that is called miss bucketing so if you you have users that they should be in a certain group like they should have a certain experience but because they entered through a wrong entry point like a wrong one entry point that you haven't handled they will see a different experience that's something that you want to avoid another lesson learned is think about differences between experiments and rollout they are not the same thing so the this old talk was about having values that you can control remotely and you can use them to do both experiments and roll out but actually there those two use cases are somehow different and specifically because experiment it's something that you run to evaluate the user behavior to collect insights like to understand what your user is doing and which screen of your app is the best and the end goal is to make an informed decision like maybe you have again two shape of a button and you want to experiment on them and the end goal is to decide which shape is the best to use well a rollout instead can be used to test a new library or do a gradual release of a new feature or maybe you just have a kill switch that you want to control remotely so the life cycle for example of those two use cases is really different experiment might run just for like a couple of weeks while roll outs they might run for months so it's crucial for you to develop a system that adapts to your use cases and also your users like developers in your company should understand what's the difference it's between those two then another crucial point is propagation time so propagation time let me explain it with an example so developer is building the feature and a data scientist is setting up the experiment then once the developer is done you there is a new release of a new version to the play store and this in our case that spotify will release every week so this can take up to seven days then once the app is out in the store you can actually enable the experiment you could potentially enable it before but i mean you need to to press the button that could be immediate but even even not and then once you press the button like once you go on your web ui and you enable a configuration to to be true from false to true your users should start experiencing it and this can go this can go from minutes to days like days there might even be users that never experience what you you change over there so this is called propagation time or ramp up time and this is the time between you coding the feature and your users actually experiencing it and really like sometimes i really feel that developers just go on on the web ui and they change the value and they expect that the value like is true for everyone no that's not actually the case it might take days specifically if you have a setup like hours where like you do a background fetch of the values and you make it available only at the next start of the app so you need to find a way to mitigate this propagation time and here i have like a potential solution so uh to to explain it i think it's also crucial to to understand what is the implication of experiments with the up light cycle specifically there are the so-called startup experiments or experiments that are happening early on during the app life cycle let's say that you want to experiment on up startup like the app starts and you want to to go between between two different libraries and then you enable the value like you want to use the new library and then the app starts crashing you need to have a way to wipe the the so-called faulty configurations because you enable something that is crashing for your users you might go on your web ui you disable it but all the users that have the value set to true they you need to be able to recover them somehow there are multiple ways like you could have like a counter of crashes and if the app is crashing i don't know more than x times in a matter of minutes just wipe whatever you have there or you can have something like a push mechanism that allows you to clear broken configurations and also speed up the propagation time this is something that firebase remote configuration is doing for example like it allows you to use firebase cloud messaging to push your configurations to uh your users that is in beta right now and then uh just to close i want to mention a little bit like buy versus build so this whole talk was on how we built our own system and the idea was to share with you folks what we like what was our story what we built what was our goal but this doesn't mean that like every company needs to build something like this there are a lot of valid alternatives out there and i want to mention some so if you if you want to like use something pre-built firebase remote config might be that if like it's probably the default the factor solution on android you can use other tools like reload.io if you and if you want instead to build your own sdk you can use something like facebook plan out this is probably the most cost effective solution because you don't have to build anything you rely on other tools and it's probably the fastest like you will have something working in a matter of hours days but if you end up building something custom this was like the decision for us and specifically we ended up with this very for scalability reason for example firebase remote config doesn't allow you to have more like than i think 2000 flags and this was not the case for us like we would have already reached the limit then um flexibility like in our case we have a lot of custom metrics and we want to do a lot of data analysis on our experimentation so we wanted to build our own system money on it and also safely as i mentioned before we built we built a system that was safe for our developers to use and it was safe for our infrastructure just to run the end goal and here i want to close is as i said i work for a for an infrastructure team so my customer at the end of the day is another android engineer like our other engineers in the company and in this case we deeply care about developer happiness so we want to make sure that our developers are spotify they like our tools they're able to do experiments and they're fine with it so that's why we decided to go for a build solution rather than using a third-party tool that being said thank you very much for listening and we are hiring so if you're interested in any position just drop me a message and i'm open to answer any question awesome
02:46
okay so uh
03:12
did you begin the first one is from anonymous okay so did you begin with this developer data scientist architecture when spotify first started if not how smooth was the transition did you have to rewrite a lot of the code from scratch uh no so as i said at the beginning our first tool was called abba then something like one two years ago we decided to reshape our approach and we we ended up re building basically our new tools from the ground up and actually um that was the occasion to actually split the tools into more logical tools so like a tool for doing experimentation another tool for handling metrics um and from the transition point of view i mean as every migration it was complicated um we relied a lot on ownership so also with abba we had a way to find the owner of every flag so it was um somehow possible for us to allocate a sort of like immigration task to every squad to every team and ask them like hey you have 10 flags please migrate them and then it was not really complicated at the end of the day like we once every team was like pinked multiple times to do the immigration we were able just to um like remove the flags and our code the values and then obviously the older service will keep on running for for the foreseeable future but we saw like a really a really strong adoption of the system because developers generally liked it okay so next one
03:38
how does this config interact with client version for example if a rollout of a new feature showed a bug how could you then limit the new feature to only upstream clients that add a fix for the bug so
04:04
this yeah um so as i said before we we strongly rely on this publishing mechanism so for every version of the app we are able to specify which flags are published and are available and on our web ui um data scientists can see the range of possible versions that have that flag defined if um like for example sometimes a developer might just decide to create a new feature flag and if they remove the previous feature flag and they create a new one the data scientist will be informed that the that the the flag is not available i don't know if this answers exactly your question if not feel free to bring me on twitter and i'll be happy to follow up um so the next one is from guillermo hey and he asks if consider a feature that is not shared between android and ios what happens if ios and android defines a property so yaml inside the module doesn't that breaks the single source of truth of the properties it is on two places so properties are client-scoped so if you define the same property inside android and ios they will actually be different properties like they will like we have a way to distinguish the client at the top level if it's like music android music ios stations android stations ios and then for all of them we have all the properties so even if the file looks the same it's actually like a different client
04:30
and there was like another question that they asked me if the tool is going to be open-sourced i want to follow up on that uh it's not on table right now mostly because this is like a really customized tool that works on um that works on our internal needs so it's really it wouldn't be really helpful for others um it's not really like generalized as firebase remote config that you can really like use that for any any kind of experimentation but really um we do a lot of open source so feel free to follow up on on github.com uh that being said i don't think we have any other questions thank you very much for having me feel free to to bring me on twitter if i wasn't able to to answer your question
droidcon News

Tech Showcases,

Developer Resources &

Partners

/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/home-details/EmployerBrandingHeader
EmployerBrandingHeader
https://jobs.droidcon.com/
/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/jobs-droidcon/jobs.droidcon.com
jobs.droidcon.com

Latest Android Jobs

http://www.kotlinweekly.net/
/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/kotlin-weekly/Kotlin Weekly
Kotlin Weekly

Your weekly dose of Kotlin

https://proandroiddev.com/
/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/pad/ProAndroidDev
ProAndroidDev

Android Tech Blogs, Case Studies and Step-by-Step Coding

/detail?content-id=/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/Zalando/Zalando
/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/Zalando/Zalando
Zalando

Meet one of Berlin's top employers

/detail?content-id=/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/Academy for App Success/Academy for App Success
/portal/rest/jcr/repository/collaboration/Groups/spaces/droidcon_hq/Documents/public/employerbranding/Academy for App Success/Academy for App Success
Academy for App Success

Google Play resources tailored for the global droidcon community

Follow us

Team droidcon

Get in touch with us

Write us an Email

 

 

Quicklinks

> Code of Conduct

> Terms and Conditions

> How to hold a conference

> FAQs

> Imprint

Droidcon is a registered trademark of Mobile Seasons GmbH Copyright © 2020. All rights reserved.

powered by Breakpoint One