Includes resume, one-pagers of work done, and examples of programming and data science projects
View the Project on GitHub RicardoFrankBarrera/Data-Science-Portfolio
After my success with the Markov Model, I became the go-to mathematician on the team and was subsequently tasked with optimizing storage compression policies for Cosmos (click for the whitepaper). This project was much simpler at face value than my Markov Model. Surprisingly, it turned out to be rather simple in the end, but could’ve been quite complicated had circumstances been different. Let me elaborate.
A data-center has resources (e.g., storage, computation, network, etc.) but these resources are not independent when executing tasks. If I’m writing data to a distributed storage system, I’m consuming a bit of every resource along the way: disk space to store files, memory to store data sent over the network, network to transfer data, and compute to orchestrate and execute everything. Thus, I needed to make sure we didn’t use too much computation for the extra storage savings because different compression algorithms consume different amounts of CPU to compress the same piece of data.
Also, there are tradeoffs to be made for optimization and manageability given the system design. In our case, compression policies were managed in configuration files controlled by engineers, so it wasn’t wise to try to squeeze out extra savings at the risk of making the configuration process overly complicated. To keep things simple, we decided to have just two compression levels at any time: one default compression used immediately upon write, and other used for long-term storage after some time has passed (e.g., 6 months) when we knew short-lived data was likely deleted and the CPU cost to recompress wouldn’t go to waste.
Evaluating the compression gains for different policies requires some care because different datasets have different compressibility, and I needed to estimate compression gains for an Exabyte of data for different policies without compressing an Exabyte of data. This was done by intelligently identifying and sampling key datasets that consumed the majority of the storage space in the data centers.
Some data exploration highlighted that Pareto’s 80/20 principle applied to our data, where the vast majority of the storage space was consumed by a small subset of all datasets. This meant that sampling these key datasets and learning their characteristics (e.g., retention policies, compressibility, etc.) would allow me to accurately predict overall compression gains from a small subset of samples.
The rest of the project was rather straightforward:
The effort and time involved in the project was small given the amount of cost-savings to the business. The key takeaway is optimizing small parts of a system operating at scale is worthwhile.