Alex Mitelman Personal website

System Design Weekly 019: August 2021

Highlights HashiCorp State of Cloud Strategy Survey HahiCorp surveyed 3200+ decision-makers from their contact database. Here are key takeaways: 76% of the companies are already multi-cloud. 86% plan to be multi-cloud in two years. 90% of large corporations adopt multi-cloud solutions while only 60% of startups moved towards this direction. Digital transformation is the main multi-cloud driver. Other reasons are avoiding single cloud vendor lock, cost reduction, and scaling. Service mesh adoption is expected to grow 2.

System Design Weekly 018: August 2021

Highlights Logging at Twitter: Updated Centralized logging at Twitter was limited by low ingestion capacity and query capabilities which resulted in poor adoption. The previous solution ingested around 600K events per second per data center. However, only around 10% of the logs were submitted, with the remaining 90% discarded by the rate limiter. To address this, Twitter adopted Splunk Enterprise and migrated centralized logging to it. Now it ingests 4 times more logging data and has a better query engine and better user adoption.

System Design Weekly 017: July - August 2021

Highlights How WhatsApp enables multi-device capability WhatsApp phone client was previously a source of truth. If someone wanted to use WhatsApp on another device, the messages would be transferred through the smartphone app. If the smartphone battery was drained, such a companion app would not be able to work. The smartphone kept the data. WhatsApp now allows connecting 4 additional devices that are independent of the smartphone. Each device gets an identity key.

System Design Weekly 016: July 2021

Highlights DoorDash: Building Faster Indexing with Apache Kafka and Elasticsearch The DoorDash team faced an issue of a very long time for updating the search index. They’ve built a search system relying on open source technologies. It uses Kafka as a message queue and for data storage, Flink for data transformation, and sending data to Elasticsearch. A reliable indexing system would ensure that changes in stores and items are reflected in the search index in real-time.

System Design Weekly 015: July 2021

Highlights Managing Asynchronous Workflows with a REST API Building a REST API, sometimes there is a need to run some complicated logic that takes some time. In these cases, the REST call sparks an asynchronous job. For example, a call to generate a PDF report: POST /api/v1/report. In response, REST API answers with status HTTP/1.1 201 Created and a Location header to get the result Location: /api/v1/report/123. What are the options to fetch the result of this asynchronous job?

System Design Weekly 014: July 2021

I came across the word “exabyte” three times in just one today. Previously I didn’t even know this word exists. So 1 exabyte is 1,000 petabytes, or 1 exabyte is 1,000,000 terabytes. Companies operate at a scale of millions of terabytes now. “Apple is apparently Google’s largest customer now, followed by ByteDance (parent company of the TikTok app). Apple holds 8 exabytes of data with Google Cloud, ByteDance is in the region of 500 petabytes — 16x less.

System Design Weekly 013: June 2021

Highlights Learn how Dream11, the World’s largest fantasy sports platform, scale their social network with Amazon Neptune and Amazon ElastiCache Dream11 is a fantasy sports platform that has social network features. The team evaluated different graph database solutions for the social network service and chose Amazon Neptune after a load/stress PoC. Dream11 is already operating within AWS infrastructure so including a fully managed graph DB into the VPC was one of the factors.

System Design Weekly 012: June 2021

Highlights Uber: Handling Flaky Unit Tests in Java While the headline mentions Java, this experience is language-agnostic and can be helpful with any other programming language. Uber team has moved all their repositories to a single monolithic repository. This move helps to better manage dependencies, testing infrastructure, build systems, static analysis tooling. Although individual repos had stable tests, after merging to a monorepo there were lots of flaky tests. Why did it happen?

System Design Weekly 011: May 2021

Highlights Pinterest: Shallow Mirror Kafka MirrorMaker is a tool to replicate Kafka clusters across different regions. Data from different Source Brokers is transferred to MirrorMaker which then sends this data to Destination Brokers in other regions. Pinterest started experiencing scalability issues at some point. Monitoring showed some CPU and memory spikes. During the investigation, it became apparent that most of the CPU time was spent on message decompression and recompression. Memory consumption was often 2-10 times bigger than the actual data being sent.

System Design Weekly 010: May 2021

Highlights AWS: Diving Deep on S3 Consistency Werner Vogels, CTO at Amazon, shares the journey of building strong consistency for AWS S3. When S3 was launched 15 years ago in 2006, it was simple storage for files, backups, etc. The eventual consistency model was more than enough for such purposes. This means that sometimes API would return an older version of the object that was not yet propagated through the nodes.