An Open-Source ML Pipeline
Whether our model is predicting churn, detecting fraud, or forecasting sales, there are a few components that are common across any ML pipeline. In particular, every ML pipeline includes the following…
Goodhart's Law
I've recently become less reliant on measures like Yelp and Google reviews when selecting a restaurant. I've noticed many restaurants with many high Yelp reviews are sometimes not as good as…
The Dead Sea Effect
Recently, I read a post from Bruce Webster's blog that highlights a pattern happening at large organizations, where the quality of retained employees exponentially worsens overtime when talented…
Comparing Common Implicit Recommenders
Since ALS was used by researchers at Yahoo in , more recommenders have been developed to handle implicit data interactions. Many of these recommenders are based on matrix factorization and still…
Improving Health Care with 1% Steps
According to the 1% Steps Project, the three most effective ways for lowering health care costs include reducing surprise billing, capping provider prices, and providing real-time adjudication for…
Outlining the A/B Testing Procedure
Experimentation using A/B testing is a crucial component in measuring customers' changes in behavior when making any changes to a business, including site changes, product changes, etc. Most of the…
Intuition behind BTYD Models
In many of Peter Fader's presentations, he has demonstrated how LTV forecasts can be accurately estimated by modeling and comining two other estimates together. In particular, these two models refer…
Potential Reasons for Recent Shortages
Recently, there have been shortages for many different products, especially in the last few years since COVID-19. The latest shortage of semiconductors has impacted multiple industries, including the…
Six Dimensions of National Culture
In the field of cross-cultural psychology, Dimensionalizing Cultures by Hofstede is still one of the most widely accepted and cited papers, even though his first paper about dimensionalizing cultures…
Collider Bias and Police Use of Force
Over the last decade, I've grown up hearing and reading about police-related killings of unarmed black men, including Eric Garner, Michael Brown, Ronell Foster, George Floyd, and countless others…
Gram Matrices in Neural Style Transfer
In this paper, it has been shown that matching the Gram matrices of feature maps is equivalent to minimizing the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, the paper…
An Argument against Meat
Over this last year, I've grown to enjoy plant-based foods, while becoming convinced agricultural biotechnology will be increasingly important over the next decade as well. I think most people have…
Understanding the Glicko Rating System
In 2007, Microsoft released their TrueSkill 1 paper, which essentially was a modified implentation of the Elo system. At the time, most of their games used Trueskill 1 as a player ranking system…
Building a Prototyping Pipeline in Python
Most data science projects undergo various stages, which require communication with the business, determining use cases and opportunities with high ROIs, collecting and exploring raw data, feature…
Selection Bias and COVID-19
Survivorship bias refers to our tendency of only focusing on the observations that make it past some selection process, while overlooking those that do not. Typically, these observations aren't…
Creating Custom Awaitable Objects
The goal of the asyncio module is to implement asynchronous programming in Python. It achieves concurrency by using evented I/O and cooperative multitasking, whereas a module like achieves…
Internal Structure of Pandas DataFrames
A object relies on underlying data structures to improve performance of row-oriented and column-oriented operations. One of these data structures includes the BlockManager. The BlockManager is a core…
Performance Benchmarks: PyArrow
As of 2020, there has been development towards parquet-cpp, which is a native C++ implementation of Parquet. This development process was moved to the Apache Arrow repository. At a very high level…
Performance Benchmarks: Parquet
A Parquet file is a popular column-oriented storage format for Hadoop. For more information about column-oriented stores, refer to my previous post. A Parquet file is used for fast analytics that…
Basics of Database Internals
A data store is a place used for storing data. This includes a database, repository, file system, etc. There are two ways of storing data in a database, which are the following: Row-oriented data…
NoSQL Basics: Graph Databases
In a previous post about NoSQL databases, graph stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…
NoSQL Basics: Column-Family Databases
In a previous post about NoSQL databases, column-family stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use…
NoSQL Basics: Document Databases
In a previous post about NoSQL databases, document stores were introduced at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…
NoSQL Basics: Key-Value Databases
In a previous post about NoSQL databases, key-value stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…
Relational and Non-Relational Databases
Ensuring a stable form of data storage is an important decision for any business. The data in an organization can last much longer than many of its applications. Unfortunately, there isn't a single…
File System and Database Differences
In most cases, database storage is implemented using file system files, where databases are usually stored in files, which exist in filesystems. The data within a database are usually stored in files…
Database Sharding
Query optimization, indexing, and NoSQL solutions are all popular scalability strategies when designing server-side systems. If those options aren't enough, then sharding may be the next best strategy…
Consistent Hashing for Load Balancing
As you know, a hash function maps key values to index values. Typically, these functions are used to determine the location (i.e. index) of a record within a table. They have other applications, such…
Scaling a Pizza Chain
In computer science, systems design refers to the process of defining and developing a system that satisfies certain requirements made by the user. Obviously, this involves a detailed understanding of…
Testing Spark Applications with Mesos
This post walks through an example of running a cluster using a Mesos cluster manager on Mac OS. In the coming posts, we'll explore other examples, including clusters running a standalone cluster…
Testing Spark Applications with YARN
This post walks through an example of running a cluster using a YARN cluster manager on Mac OS. In the coming posts, we'll explore other examples, including clusters running a standalone cluster…
Testing Spark Applications in Standalone
This post walks through an example of a cluster running in standalone mode. In the coming posts, we'll explore other examples, including clusters running a YARN cluster manager and Mesos cluster…
Data Locality in Spark
This post provides an overview of different types of data locality in Spark. In the coming posts, we'll dive deeper into more low-level concepts. Meaning, we'll explore the Spark internals using…
Spark Deployment Modes
This post provides an overview of the different deployment modes in Spark and how each deployment mode changes the behavior of Spark components. In the coming posts, we'll dive deeper into more low…
Runtime Architecture in Spark
This post provides a high-level introduction to generic objects in the Spark API, along with the responsibilities for each object. In the coming posts, we'll dive deeper into more low-level concepts…
Datasets and DataFrames
Describing Spark SQL Unlike the basic Spark API, the Spark SQL API provides additional data structures used for holding data and performing computations. As a result, Spark SQL is able to perform…
Visualizing DAGs in Spark
The goal of this post is to provide a general introduction to the API. Each example has a snippet of PySpark code with explanations. Another goal is to provide a general introduction to Spark's web…
Spark RDD Fundamentals
This post provides a high-level introduction to the RDD object in the Spark API. In the coming posts, we'll dive deeper into more generic objects in the Spark API. Then, we'll explore low-level…
Hadoop as a Distributed OS
Before investigating Spark in detail, we should develop a core intuition behind Hadoop. This post compares Hadoop to a traditional computer operating system. In the coming posts, we'll begin exploring…
Relevance of Hadoop
It's been 14 years since the initial release of Apache Hadoop, which is a long time for any software. Unsuprisingly, the internet is flooded with clickbait articles about Hadoop being replaced by…
Running Hugo on GitHub
In this post, we walk through the steps of running a site on GitHub created by a static site generator. This post assumes a directory containing hugo source files has already been created. For more…