machine learning

An Open-Source ML Pipeline

Whether our model is predicting churn, detecting fraud, or forecasting sales, there are a few components that are common across any ML pipeline. In particular, every ML pipeline includes the following…

I've recently become less reliant on measures like Yelp and Google reviews when selecting a restaurant. I've noticed many restaurants with many high Yelp reviews are sometimes not as good as…

The Dead Sea Effect

Recently, I read a post from Bruce Webster's blog that highlights a pattern happening at large organizations, where the quality of retained employees exponentially worsens overtime when talented…

machine learning

Comparing Common Implicit Recommenders

Since ALS was used by researchers at Yahoo in , more recommenders have been developed to handle implicit data interactions. Many of these recommenders are based on matrix factorization and still…

Improving Health Care with 1% Steps

According to the 1% Steps Project, the three most effective ways for lowering health care costs include reducing surprise billing, capping provider prices, and providing real-time adjudication for…

machine learning

Outlining the A/B Testing Procedure

Experimentation using A/B testing is a crucial component in measuring customers' changes in behavior when making any changes to a business, including site changes, product changes, etc. Most of the…

machine learning

Intuition behind BTYD Models

In many of Peter Fader's presentations, he has demonstrated how LTV forecasts can be accurately estimated by modeling and comining two other estimates together. In particular, these two models refer…

Potential Reasons for Recent Shortages

Recently, there have been shortages for many different products, especially in the last few years since COVID-19. The latest shortage of semiconductors has impacted multiple industries, including the…

Six Dimensions of National Culture

In the field of cross-cultural psychology, Dimensionalizing Cultures by Hofstede is still one of the most widely accepted and cited papers, even though his first paper about dimensionalizing cultures…

Collider Bias and Police Use of Force

Over the last decade, I've grown up hearing and reading about police-related killings of unarmed black men, including Eric Garner, Michael Brown, Ronell Foster, George Floyd, and countless others…

machine learning

Gram Matrices in Neural Style Transfer

In this paper, it has been shown that matching the Gram matrices of feature maps is equivalent to minimizing the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, the paper…

An Argument against Meat

Over this last year, I've grown to enjoy plant-based foods, while becoming convinced agricultural biotechnology will be increasingly important over the next decade as well. I think most people have…

machine learning

Understanding the Glicko Rating System

In 2007, Microsoft released their TrueSkill 1 paper, which essentially was a modified implentation of the Elo system. At the time, most of their games used Trueskill 1 as a player ranking system…

Building a Prototyping Pipeline in Python

Most data science projects undergo various stages, which require communication with the business, determining use cases and opportunities with high ROIs, collecting and exploring raw data, feature…

Selection Bias and COVID-19

Survivorship bias refers to our tendency of only focusing on the observations that make it past some selection process, while overlooking those that do not. Typically, these observations aren't…

Creating Custom Awaitable Objects

The goal of the asyncio module is to implement asynchronous programming in Python. It achieves concurrency by using evented I/O and cooperative multitasking, whereas a module like achieves…

Internal Structure of Pandas DataFrames

A object relies on underlying data structures to improve performance of row-oriented and column-oriented operations. One of these data structures includes the BlockManager. The BlockManager is a core…

Performance Benchmarks: PyArrow

As of 2020, there has been development towards parquet-cpp, which is a native C++ implementation of Parquet. This development process was moved to the Apache Arrow repository. At a very high level…

Performance Benchmarks: Parquet

A Parquet file is a popular column-oriented storage format for Hadoop. For more information about column-oriented stores, refer to my previous post. A Parquet file is used for fast analytics that…

Basics of Database Internals

A data store is a place used for storing data. This includes a database, repository, file system, etc. There are two ways of storing data in a database, which are the following: Row-oriented data…

NoSQL Basics: Graph Databases

In a previous post about NoSQL databases, graph stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…

NoSQL Basics: Column-Family Databases

In a previous post about NoSQL databases, column-family stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use…

NoSQL Basics: Document Databases

In a previous post about NoSQL databases, document stores were introduced at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…

NoSQL Basics: Key-Value Databases

In a previous post about NoSQL databases, key-value stores were described at a fairly high-level. In this post, we'll dive into more low-level details, which includes features, behavior, and use cases…

Relational and Non-Relational Databases

Ensuring a stable form of data storage is an important decision for any business. The data in an organization can last much longer than many of its applications. Unfortunately, there isn't a single…

File System and Database Differences

In most cases, database storage is implemented using file system files, where databases are usually stored in files, which exist in filesystems. The data within a database are usually stored in files…

Database Sharding

Query optimization, indexing, and NoSQL solutions are all popular scalability strategies when designing server-side systems. If those options aren't enough, then sharding may be the next best strategy…

Consistent Hashing for Load Balancing

As you know, a hash function maps key values to index values. Typically, these functions are used to determine the location (i.e. index) of a record within a table. They have other applications, such…

Scaling a Pizza Chain

In computer science, systems design refers to the process of defining and developing a system that satisfies certain requirements made by the user. Obviously, this involves a detailed understanding of…

Testing Spark Applications with Mesos

This post walks through an example of running a cluster using a Mesos cluster manager on Mac OS. In the coming posts, we'll explore other examples, including clusters running a standalone cluster…

Testing Spark Applications with YARN

This post walks through an example of running a cluster using a YARN cluster manager on Mac OS. In the coming posts, we'll explore other examples, including clusters running a standalone cluster…

Testing Spark Applications in Standalone

This post walks through an example of a cluster running in standalone mode. In the coming posts, we'll explore other examples, including clusters running a YARN cluster manager and Mesos cluster…

Data Locality in Spark

This post provides an overview of different types of data locality in Spark. In the coming posts, we'll dive deeper into more low-level concepts. Meaning, we'll explore the Spark internals using…

Spark Deployment Modes

This post provides an overview of the different deployment modes in Spark and how each deployment mode changes the behavior of Spark components. In the coming posts, we'll dive deeper into more low…

Runtime Architecture in Spark

This post provides a high-level introduction to generic objects in the Spark API, along with the responsibilities for each object. In the coming posts, we'll dive deeper into more low-level concepts…

Datasets and DataFrames

Describing Spark SQL Unlike the basic Spark API, the Spark SQL API provides additional data structures used for holding data and performing computations. As a result, Spark SQL is able to perform…

Visualizing DAGs in Spark

The goal of this post is to provide a general introduction to the API. Each example has a snippet of PySpark code with explanations. Another goal is to provide a general introduction to Spark's web…

Spark RDD Fundamentals

This post provides a high-level introduction to the RDD object in the Spark API. In the coming posts, we'll dive deeper into more generic objects in the Spark API. Then, we'll explore low-level…

Hadoop as a Distributed OS

Before investigating Spark in detail, we should develop a core intuition behind Hadoop. This post compares Hadoop to a traditional computer operating system. In the coming posts, we'll begin exploring…

Relevance of Hadoop

It's been 14 years since the initial release of Apache Hadoop, which is a long time for any software. Unsuprisingly, the internet is flooded with clickbait articles about Hadoop being replaced by…

Running Hugo on GitHub

In this post, we walk through the steps of running a site on GitHub created by a static site generator. This post assumes a directory containing hugo source files has already been created. For more…