Tech Showcases,
Developer Resources &
Partners
droidcon News
Scaling Android Builds in Pandemic Times
By
Inaki Villar
droidcon Americas 2020
In 2020 COVID sent us home. This event produced a completely different scenario in companies where employees enjoyed fast corporate networks. The optimization in the Build System of teams working remotely has an impact on the speed and productivity of the engineers. Learn how at Tinder we worked in caching strategies to optimize the build speed.
Transcript
English
00:10
hi everybody thank you very much for having me i'm in yakivijar tinder and yeah so this presentation is called scaling country bills in pandemic types because since everybody was moved to what was first were from home since the pandemic started at tinder we tried to analyze some ways to improve the developer experience in a scenario with a high volume of change on the main branch and massive modularization some developers are working in some specific features and then when they are pulling the chains from the main branch or from the repository they are going to have to compile and to execute again with different parts of the build because they don't have to change we found that caching is the most simple and effective way to improve the performance and this talk is focusing on remote caching for our build system cradle first of all this introduction we need to understand how it works internally from the build system perspective so imagine that in your android project you need to execute your bill and your bill needs some dependencies to build right so that's we call the build dependencies that's like the gradle version the griddle dropper we need some plugins because we are working with android and we need the android level plugin and the content label plugin and additionally some of the build scripts are going to have an additional build dependencies too if we are talking about the dependencies of run time and compile configurations in the gradle we need some dependencies so that's the things that we now we need to build okay once it's built we are going to well gradle is going to generate an output this output contains all the artifacts and the results of the different tasks but as well gradle is storing in the gradle folder under your user properties of different information of different dependencies that we have used in the build meaning that once we are going to build again so we don't need to download again the gradle dropper or we don't need to download the undergraduate plugin and the dependencies like dagger that you are using so in the moment that we are using a gradle clean uh we are going to remove the output of the wheel but still we are going to use the dependencies that previously we have stored in the gradle property in the gradle folder
00:27
additionally to this mechanism there is something that is called the local cash and the local cash is coming uh not from the dependencies and not from the artifacts itself right the local cache comes from the outputs of the task gradle is a build system based uh in tasks like many other build systems and we are going to store their the output of one task meaning that once the build is finished we can have those different outputs there and the most important thing we are going to store not only the current state of the project imagine that you are switching between different branches you can store different states for the same output of the task and how is working internally from the gradle perspective so as we mentioned gradle uh the task of gradle is going to generate an output but to calculate the key uh meaning that to calculate if something can be uh reduced from the previous or from the previous field means that we need to have some common values for example imagine that i want to reuse the previously assembled task calculated so the inputs of the property or some condition has to be has to be completed in that case gradle is going to uh compose different cache items for all the items all the inputs of the of the task itself some of these inputs are to have like more clear perspective is for example the class path another point is the build step components uh every time that you are updating for example the android global plugin there are some tasks that they are going to generate a different output meaning that i cannot reuse this task task outputs the same with the output properties that we are defining on the task and something that is as well important is the build script when affects the execution of the task here is something that we have asked ourselves all the time imagine that you have 100 models and you are updating one bill gradle that means that this change on the bill gradle is going to have an impact on the task well only if the bill or the task of the task executed the build script affects the execution of the task finally you are going to compose all the different keys calculated for each of the inputs and you are going to generate other a key for all the elements so again that's how gradle identifies that one task can be reused or can be a user that's from cast from a previous build and that's something that is starting to be more cool because first of all when we are doing a gradle clean we are removing the output of the project but still we have the local cache available and we can restrict the values that's something that uh the local cache survives not only to the change of branches but as well to the clip
00:44
additionally to the local cast we have the remote guys the remote cars are the fundamentals are the same but instead of being local in your own machine are going to come in from a third part like from a remote caching
01:01
it works like similar so the remote cache is going to store different outputs from different tasks uh in the cloud once you are doing like a clean a completely clean state imagine that you are starting from a scratch or project or you have removed everything the theme that still persists is the remote cache and that's something that is called because we are going to save the time calculation and we are going to decrease the pressure of memory in our machines so we have two types of caching the local cache in the in our local environments and the remote cars the remote cache can be whatever you want we have the typical build cache node we have the built-in node that's coming with cradle enterprise we have radius or we have s3 whatever you want to use as persistent solution in the cloud we are not going to enter in the implementation details here to configure a remote cache it's quite easy the build cache configuration from gradle api is going to add this specific closure to update the configuration this http build cache is a common object from the gradle api that allow us to define when we want to push uh what is like the url and some kind of authentication too the other hand if you want to use a custom s3 implementation for example the build cache as well is adding the build cast configuration is having a possibility to register different cache services like here we are registering our s3 instance and we are going to later to configure under our requirements and what else we need if we are working with a different implementation so caching is not something new and we have caching of like different gears different plugins i remember years ago working with american express plugin but even in tinder we have created our own plugin because all requirements but i would say that well the last one for example the gradle s3.cas is quite recent and adds some statistics to the usage of the remote caching too and finally we have to say or what is the strategy of caching one of the i would say that the basic configuration is having on ci every comment on the main branch is going to trigger the action of populate the remote cash and later the local bills are going to enjoy or consume these elements of the of the build cache again this is a basic setup and would be like something to start perfectly but again everything that we have seen now it's not something new that's something that well i remember maybe three years ago working with caching too so that's something that we have done during the years but the real key concept in the caching is how we can optimize and better how we can identify some of the problems of caching because this the scope of this tab is with big projects and in big projects we need to optimize the remote caching it's not enough to do a basic setup we need to understand why we are having problems with the cars to have a better understanding let's this example when we have a nine project with nine tasks remember that our gradle is a dependency acyclic graph build system that is going to combine different tasks once we apply the remote caching in a clean state we are going to have something like this so far so good right but in another scenario and that's more close to the reality when we have thousands and thousands of tasks when we apply the same behavior in your in our local machine we're going to have this uh scenario and in numbers represents that we are hitting uh 99 of the possible tasks that can be catchable in a 30 minutes so it's something that could be good but still i don't think that times like this are going to be good right one thing that we can try to understand how what or well what's the level of our problem is apply the same uh test in a ci build if we apply the same in a ci bill we have a 100 percent of heat radio with three minutes so totally 10 minutes of difference that's a lot but that's perfectly normal imagine that you have your huge project with a massive modularization and i'm talking more than 500 models so we have a lot of developers trying to update different parts of the application that means that the lower parts of the project are going to be more sensible to be affected by the change of the upper task and imagine in your own project that you have to execute the kapt or the compiling task from the main entry model it's going to take a while and that's the thing that we want to investigate and see from where comes these problems first of all the first approach that we can do is comparing the bills ci bill and the local build to do that we are going to use the bill scans that's a feature for from gradle or you have the little enterprise and the free versions of the bill scans and you have to go to the performance build cache section and it's going to say and it's going to tell you what's the different levels of heat for the local for the local and remote caching you have to use that as a scan if you want to include in your build configuration for that we have for the ci bill that we have requested uh 6 000 a task and the first thing uh in our case is that we are disabling the local cash for the ci bills so we want totally clean remote bills from the from the ci perspective but we have like a very a very good number so that means that this is the perfect scenario if we do the same in the local build we generate the build scan we can see on the same section that as well we have this difference so we are using the local cache with 2000 tasks and the remote with all quite uh good numbers but still we have 50 missing tasks
01:18
the build scan has in the lower section once you have like in the header you can see the hits you can see as well what operations or what mistakes you have performed in your build and we see that well we have missed the kpt uh library library uh library modules and they compiled the back the same right so meaning that well uh we have a problem there but we don't know how to fix it in tinder we are using gradle enterprise and the product allows to compare build scans meaning that for example if i want to see what's the problem that i have with ci i can compare both field scans and i can go deeper to understand better the problems to apply that first of all you have to go to your build scan you're going to click it's going to show another another screen with the possible candidates to compare i'm going to select the ci build and then i'm going to have like a screen like this a user interface like this the comparison of the build scans are separated in different sections for example we have the infrastructure there we can see the differences between the different builds once in my case is built by macos and another is built by linux so it shouldn't be determinant that we have different os because at the end the output is agnostic of the output of the task should be agnostic of the operative system but maybe imagine that we are using or another investigation that you are doing maybe one developer is using another version of the avm could be here where you are going to find the problems another section is the switches the gradle switches that's a common uh practice that well maybe we have uh different we have the same switches in the gradle configuration but it's possible that somehow somewhere uh somebody has a different values and it's here where you're going to find the the problems the build dependencies and dependencies are quite uh clear so they're going to show their differences but uh in terms of the task the task inputs we are going to have uh this screen right where we are going to compare both bills but we have a message and this message says that it cannot be possible to compare the bills in terms of the input task because we are we are not capturing the task to do that we need to apply in the build scan configuration the property is capture task input files or if you want you can add in the in the command in the command for your build this is going to uh going to generate the screen like this well we have to repeat the process again we have to build our execution we have to do the same in ci with these properties and compare again in the in the little enterprise uh this screen uh the comparison is going to give us the information about the different task differences we can filter out then and we are going to give uh investigation case for for example in our case uh we have seen that we had 50 missing tasks we want to understand what's the origin of this of this differences one of the tasks that was failing or was failing was not hitting the cash was app compile flavor debug cutling and it's going to contain this kind of entry in the bilstein configuration there we can see first of all the the cache key is different for both tasks that's something that we know and the outcome is so well different from cache in ci success how we can understand the next section is the the next section is the differences that we find in the file properties or the value properties of the inputs that they are considered to calculate the key of the of the task if we click there we can start seeing where are located the differences here for example we can see that wall is in the debug flavor build config configurations where it's coming the difference but we can go deeper every uh dot that you are seeing on the screen represents one of the build variants of the comparison if we click again in the flavor of the back we are going to see a specifically uh we are going to see where is the problem and the problem for the use case that we are talking is some differences in the build conflict configuration by the build conflict the other configuration and that's something that was uh normal in our case because on ci we were trying to build with a specific property and this property was not enabled in the local in the local build meaning that the content of the build config are going to be is going to be different and with this uh simple property or addition of the property in the bill this is going to cause that the remote crashing are going to be like uh the remote fashion are not going to be hit
01:35
in tinder thanks to gradle we we have a very long investigation to understand better or catch misses in a project again in a big project it's very important to understand this section uh the typical use case that we have and the thing that yesterday nelson commented it was about the empty folders or yeah empty folders that for example in your workspace says one empty folder and not in cr so again if your project has a high volume of comets each with a high number of developers it's normal that your workspace locally it has could be a in a bad state meaning that maybe you have some stale files or some style directories that doesn't correspond with the reality and as basic as this if you remove all the directories empty directories in your local version it's going to help you with the remote cast
01:52
yesterday nelson commented that maybe in six eight or maybe the future versions this uh problem is going to be fixed and grail is going to take care about this these problems but again think about how well is your base code maybe you haven't moved from your local workspace since one year ago and that means that maybe you have sure some empty folders files that doesn't exist in the remote repository and you have under the test another test that are another problem that we that we found on tinder was this example one developer uh had maybe like hundreds and hundreds of tasks that were missing to the remote cache and we proceeded with the investigation comparing the bill scans and we saw this difference on ci we had a difference in the class path regarding the local build meaning that the version or the content of the android yard was different and it was very strange and the reason for that it was the developer was using uh android preview version well you know that typically when it starts android or when a new person is coming you install the preview and never was updated and for this simple case this developer was not enjoying the remote caching so thank you thanks to this investigation we we were able to help and fix this problem another case that we've founded tinder it was well well quite special so again a one developer with a lot of different uh misses in the in the remote cache and we were seeing uh a difference in the in the shot of one library so it was totally impossible right because you were like checking oh what is the show of the poster current dependency uh where was and it was correct in terms of ci but the developer had a different version and we started the investigation and we ended with this problem in the in the repository itself so where there was like different versions of the artifact deployed in the repository and the developer had the bad luck to have the the dependency at that time that was different from the one that we were using after that so again they're like uh can be a little bit hard sometimes to investigate the cash missiles but it's quite important to understand and how can improve when you resolve these issues there if you want to know more about how to investigate catch misses nelson uh that actually is working in gradle has these two great articles about how to uh investigate the the cash misses in your in your bills
02:09
but it's fair to say that not everybody is using gradle enterprise and maybe you want to investigate too and it's possible so if we take an example the class path problem that we saw before we can have uh something with this flag to investigate more this black is going to to generate the output in your in your build about the different
02:26
properties of the the values for the different inputs for a given task for example for this task that is called generate the vat build config we have these inputs that are important for the cache key calculation for the test that's uh of course as you can see here very coupled with the unreal plugin we have the flavor names the library the version code and then the output is going to tell you what's the the hash calculated for these fields and at the end we are oh and at the end we are going to have the content of the of the composing all the different cache uh hash keys for the for the task itself so one way to trying to solve or to understand this problem and of course that you need to automate is again generate a ci bill with the flag of debug caching for both a local wheel and cibo then we can pick the different inputs that are going to be important for the calculation of the task and we can try to all to much where it's coming the problem and here it is the same uh the same input or the same property that this class pack is failing once we are comparing with a lot of manual manual comparison like we're doing with with regular enterprise
02:43
if you are building uh with mac and your ci is unix you will see that it's quite hard trying to achieve the 100 percent of the remote cache hits that we are seeing for example in the in x machine and it's possible for example once i have optimized my remote cache or my local deal i have aligned my local repo my local version of the repository with the upstream we are still seeing some tasks that they have some differences and you start again doing the process to comparing the wheels comparing the outputs and it's not clear how to fix these problems to fix these problems and they're related with the android level plugin we have this plugin coming from gradle that is called android cache fixed gradle plugin that is going to have some workarounds for some tasks uh before our merging the other people plug in these fixes so what meaning that for example in our case the specific case that i have problems with the compile the back library resource applying this plugin or checking this plugin i see that there is a workaround for the for all unregulable plugin version and applying this plugin in my project means that i can hit the cache there so uh usually when you are updating the new version of underground plugin they're like new improvements but maybe there are some tasks that you can have problems take a look of this plugin and you will check and you will see what are the what are you will increase the the heat of the remote cache hit for this test okay so um once we have what we have learned how we can try to analyze better the bills each moment to go a little bit uh one one step forward right so if you think uh from the caching perspective of for the caching concept of computer science we are just in a corner meaning that from the from the android perspective we are just uh fetching or we are pushing elements in the cache that's it so we are not entering in complex hierarchies or compost things but still so there are a lot of room to improve or build applying some concepts of caching on the first the first section we saw that the typical built-in vitamin to populate the gas is where one comet on the main branch is going to trigger the remote cache uh well it's going to populate the remote cache and then the local developers are going to enjoy the the output of this task but not everybody works on the same way for example we have this typical scenario where you have an external built environment an internal built environment the external bin environment could be a ci provider circle ci travis github actions whatever and then you have your internal built environment where you don't have any connection between the possible caching that you are going to enjoy from ci in your remote caching also of course once you will try to do some kind of hoops with the repository you still can populate the job if you have like a a secondary uh ganking system in your in your internal built environment means that you can still do a hook and populate the remote cache but still the ci is going to be totally independent and something that we tried in tinder is trying to speed up the ci process with only with the remote caching somehow imagine this picture where we have a parallel ci build executions meaning that maybe like uh 20 or 25 bills can be confident at the same time we implemented a bill cash and we didn't have any restriction in the main branch of when it has to be populated the cash or when it has to be consumed always the bill cash or a remote cash was enabling for all these steps and we implemented this approach with uh s3 packet as simple as this our external ci provider was is a circle ci and we tried to put very close or what we analyzed how was like the beneficial to use instead of aws instead of using google cloud and we choose aws closer to your circle ci provider because circle cio as well as in aws and apply some uh security compliance online with our security department for example we were using like the assumption roles and we were using encryption for our items in the remote cast and something as simple as they speed up a lot or process in our in our inner build inner build system that's a result so previously and that's the truth so once you are growing in terms of models so the resources are going to well you are going to need more resources in terms of ci the task operations or the test task operations are quite expensive and if you are going to have some link with mojito with roboelectric and so on so you are going to consume a lot of memory and we have this problem so we were struggling a lot with the test as configuration we went very conservative while we were growing in terms of models we tried to um minimize the number of of workers working that's uh increase our time but once we implemented the approach of using the s3 build cast we went down up to 20 minutes and it was something good and the most important part is after that and this is from the last three months we were able to establish stabilize the time for the main brands for independent of the number of models and that now maybe we are like with 600 models and still we are enjoying the the same time as three months ago so quite interesting another benefit is that some models or some business models of for example circle ci they are what you can use different kind of instances these instances are bigger or less bigger using this approach or without build custom remote caching means that your instance is not going to work or it's not going to have this memory pressure meaning that you can downgrade the instance that you are using instead of using a 2x large you can use xlr saving credits or saving money because as well you are trying to save the operation there and another interesting approach experimental approach would be this one too we say that once we have a new comment in the main branch well we are going to start a process and this comment is going to populate the cache and for the given task x we have this output the theory of the things that we have seen have seen before is that the local bill once is going to pull the change from the main repository is going to have the the output of the task from the from the remote cache and that's works right that's how it works and we have like the same elements but that's not i think the normal use case uh usually we have this situation where we have a new commit in in in the main branch we have a new or we have a given hash value for a task a cache key but the local build contains chains as well and the same task is going to generate a different output and this different output is translated to a different uh cache key so once the developer is pulling the change from the repository we are not going to enjoy the remote cast and as well the combination of the differences of both the delta or both tasks are going to generate a new one and how we can try to improve this of course now or today we cannot uh predict the future right or we cannot anticipate what are the movements that the developer is going to say but maybe you can think more simpler things and one thing that we know is that there are some kind of chains that are related to the build stack the gradle version the android the underground version the kotlin gradle version and the build source and the most important thing that we know that those chains are going to invalidate the cash so we ended with a mid solution where we are going to update some of the elements of or some build component we are going to analyze or we are going to get the latest pull request maybe the latest 40 pull request pending we are going to merge the chains of the bilstack components and this is going to generate different outputs for a given task and we are going to store these values meaning that the developer or the different developers are going to enjoy the different values from the remote cache because we have pre-populated the values uh the day before usually how how it works or how implemented is that we have uh we wanted to minimize the updates on the build stack components because we know that they are going to invalidate but once we have a pull request that contains an update of lego we are going to start this process of pre-populating the cut the cache with the combination of the different pull requests and we measure this new process we measured and we we had some results and it was working well so in this this is the gradle enterprise uh user interface where you can see some avoidance savings and we can see how well how much time we have with or we have well gained in terms of the remote build cast and we see the peaks after the days that we were following this process so the hypothesis worked so we can try to anticipate this kind of change we can pre-populate the cash and this is going to have a benefit at the next day when the developer is going to retrieve the results and the most important part is going to well it's open indoors to think in more clever ways that how we can populate this cash in there and yeah so this is the end of the talk first of all i want to thank uh my my team in tinder so the android uh the android developers there they're awesome so they're always helping trying to get some suggestions but to give thanks to sami of course to coordinate everything and additionally i want to thanks all these people well maybe they are not aware right but so working on the from the build perspective in the building of the build engineer in big projects is quite a challenge right and one cool thing that i really enjoy is how people try to experiment with new things uh more well far from the from the from the conventional conventional ways right when you have a 1000 uh project models you have to try to think how to improve the experience not only in terms of the build times not only in terms of the ci but as well in terms of the of the local development too and yeah so that's it uh this is uh my twitter user in case if you have any questions so on and now i think that we have four minutes left for questions
03:00
well anyways um uh to mention that the future is well it looks very well right so in terms of caching for example we are we are going to enjoy uh the configuration cache that is going to be in the undergraduate plugin for two already using gradle as well we have the file system watcher that has been improved in the version 671 that was released this week too and if you notice all the uh if you notice uh well i have like one question how do you configure the pr pro location well we have like a process and this process is going to set up with a custom script that is going to retrieve the latest pr from github and we are going to do a different types of merge of the brands and we are going to do the the building of the main uh the main task for the entry point that's going to generate a before an output for the task and it's going to and it's going to push to the cache
03:17
so uh well yeah i was saying that yeah we have like the file system watcher as well in the gradle 671 and if you notice all the solutions are going to be based on caching too
03:34
what is the most challenging task that you had to do during busy weeks of constant delivery well so i remember uh having problems with docker uh java 8 and linux and it was uh it was like a uh a nightmare because in terms of of the memory when you have some memory pressure in terms of docker with java a so you are not going to have like a hip memory error so gradle or well the container is going to kill your it's going to kill your your your java instance and you don't know from where it comes so yeah did you manage to test configuration cache uh yeah i tested but i tested in an android project and the underground plug-in i was testing for one or 401 it was not ready so but yeah um how much did that improve well i don't know but imagine that yesterday nelson said a typical configuration time for 100 models is around 10 seconds so in my aim for 500 modes it's not linear the growth i can tell you so we have situations where the time spending in the configuration time is bigger than the execution time so we cannot be more excited about this feature of the configuration time because you are going to save all this time
03:51
thank you
04:08
and yeah again exciting times are coming in terms of the build system still
04:25
thank you yeah yep well regarding your question about java a2 uh what we moved to java 11 and we don't have this situation because docker has a better integration in terms of the jvm arguments in the jvm arguments applied to the memory consumption and everything is okay but yeah i remembered hard times with jabba hey docker and linux
04:42
okay anyway so if you have more questions or them so thank you okay thank you very much thank you very much everybody and if you have any questions about the caching about uh build system so on uh that was my my twitter there and yeah thank you very much everybody
droidcon News
Tech Showcases, Developer Resources & Partners
EmployerBrandingHeader
jobs.droidcon.com
![]() Latest Android Jobs
Kotlin Weekly
![]() Your weekly dose of Kotlin
ProAndroidDev
![]() Android Tech Blogs, Case Studies and Step-by-Step Coding
Zalando
![]() Meet one of Berlin's top employers
Academy for App Success
![]() Google Play resources tailored for the global droidcon community |
Droidcon is a registered trademark of Mobile Seasons GmbH Copyright © 2020. All rights reserved.