We want to give a warm welcome to the 300+ people who joined us since our last newsletter. To everyone else reading this, your support means a great deal to us. We hope you enjoy reading the curated links below as much as we did.
This week’s experiment is having an open comment section for everyone! So stop on by and share whatever’s on your mind. :) We’ll see you in there!
And now, let's dive in:
Editor's Picks
Meaningful metrics: How data sharpened the focus of product teams How do you decide on the metrics that matter? And how do you advocate for an organization to adopt new metrics? And what happens if existing metrics stop moving? Our Data Science team developed a growth framework that helped to grow DAUs by 4x since 2019. Let’s explore the path that led us to that framework (the Growth Model), the tangible impact it’s had on our business, and how we’re thinking of evolving the framework to take us into a new phase of growth…
Data science maturity and the cloud I regularly speak to data scientists who are frustrated in their roles because the tech in their organization simply does not give them the ability to do their job in the best way possible; or, even worse, they do not have the agency to do their job well. Data science, and data scientists, need the right conditions to flourish. So, if you’re looking at your own organization’s data science offering, what are the key things you should be able to do? And how can we ensure that data scientists have them? Let’s take a look at how to check an organization’s data science maturity..
What you’ll get as a subscriber above and beyond this weekly link roundup:
Office hours covering: careers, job-hopping, getting started, working with recruiting agencies
Study groups: an invite-only discord server with various chat-rooms dedicated to different data science / data engineering / ML / AI learning materials where ourselves and/or “teaching assistants” will be there to help guide you and answer your questions
Q&A’s with various companies whose tools you are already using
and more! (People have asked us to produce a “theory”-style weekly newsletter as well as a “tool”-only weekly newsletter, so maybe we’ll put that together for you as well)
If you’re in a professional organization, you can expense the subscription as a tax deduction. You can also expense the subscription out of your learning, professional development, or training budget.
Why didn't DeepMind build GPT3? Trying to answer the question “Why didn’t DeepMind initiate and deliver GPT3?” is one way of shedding light on this puzzle. I say specifically GPT3 because that was the significant innovation — we’ve been following a playbook since then, and most of the perceived advantage of OpenAI stems primarily from how fast they ship, and their appetite for it, not from the pace of discovery. As someone professionally interested in how you build extraordinary scientific teams, there are three things that strike me quite profoundly about GPT3…
Founded Upon an Error A recent post on Reddit asks, “Why was Bayes’ Theory not accepted/popular historically until the late 20th century?” Great question! As always, there are many answers to a question like this, and the good people of Reddit provide several. But the first and most popular answer is, in my humble opinion, wrong. The story goes something like this…
Snowblowing is NP-complete The recent winter storm left a lot of snow on my driveway. A lot. My driveway is the perfectly place for huge snowdrifts to form. A tweet of my shoveling resulted in the discovery of The Snowblower Problem by Esther M. Arkin, Michael A. Bender, Joseph S. B. Mitchell, and Valentin Polishchuk…The Snowblower Problem (SBP) answers the following question: “How does one optimally use a snowblower to clear a given polygonal region?”…The snowblower problem is like the Traveling Salesman Problem…
Algorithmic Black Swans Organizations building AI systems do not bear the costs of diffuse societal harms and have limited incentive to install adequate safeguards. Meanwhile, regulatory proposals such as the White House AI Bill of Rights and the European Union AI Act primarily target the immediate risks from AI, rather than broader, longer-term risks. To fill this governance gap, this Article offers a roadmap for “algorithmic preparedness” — a set of five forward-looking principles to guide the development of regulations that confront the prospect of algorithmic black swans and mitigate the harms they pose to society…
Long commutes show structural inequality in cities, and bad health outcomes During President Biden’s State of the Union address, he spoke a lot about rebuilding our highways and railroads to improve the infrastructure of America. But if we don’t address the inequalities in the ways we use those roads and trains, and which communities are most in need of support, we risk embedding structural challenges in the very fabric of our cities…Research from Raj Chetty at Harvard on 5 different social factors found that shorter commute times were found to be the strongest predictor of upward mobility. In fact, investments in public transportation have been shown to reduce local inequality and drive down local crime…
The Significance of A/B Testing and Power Analysis in Fraud Detection In this post, I will focus on the fraud detection domain, specifically on cases where we constantly retrain and replace multiple models…In the first part, I will present possible approaches for collecting data to compare models and discuss their advantages and disadvantages in the context of fraud detection. I will show that A/B testing has powerful advantages over other options…In the second part, I will address cases where we cannot afford to perform A/B testing for all model replacements…
I tested how well ChatGPT can pull data out of messy PDFs (and here’s a script so you can too) I spent about a week getting familiarized with two datasets and doing all of the preprocessing. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. You can paste in a record and say “return a JSON representation of this” and it will do it…But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. It will also decide on its own way to parse values. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up…
Exploring Data Distributions with an Interactive Ridge Plot Ridge plots are particularly helpful for identifying differences in distributions between multiple groups or variables…We'll start with an overview of what ridge plots are and why they're useful, then dive into the technical details of building an interactive ridge plot…Along the way, we'll discuss different use cases for interactive ridge plots and the benefits they can offer for data analysis and decision-making. By the end of this post, you'll have a solid understanding of how to create an interactive ridge plot and how to apply it to your own data analysis projects…
Training Deep Networks with Data Parallelism in Jax One of the main challenges in training large neural networks, whether they are LLMs or VLMs, is that they are too large to fit on a single GPU. To address this issue, their training can be parallelized across multiple GPUs. This means either parallelizing the data or model to distribute computation across several devices. In this post, we'll cover batch splitting, also known as data parallelism, and show how to use JAX's pmap function to parallelize computations across multiple devices…
What should you use ChatGPT for? So I’ve been trying to understand the hype. I’m interested in what its impact is on the ML systems I’ll be building over the next ten years. And, as a writer and Extremely Online Person, I’m thinking about how it could change how I create and navigate content online…
Building a Bloom filter In this post, we will explore the Bloom filter — a data structure that is ingenious in its simplicity and elegant in its design. We will delve into the underlying principles of Bloom filters, understand its benefits and drawbacks, and walk you through the process of implementing a Bloom filter using Python. Bloom filters are a popular topic in computer science, particularly in systems design and optimization, and in technical interviews. Understanding this data structure and how to implement it can be an excellent way to showcase your skills as a data engineer…
DiffusionFastForward A course on diffusion generative models in a fast forward mode…There are three elements integrated into this project: a) 💻 Code, b)💡 Notes (in notes directory), and c) 📺 Video Course (to be released on YouTube)…
Soundscapes: Creating Visuals with Meyda+Shaders Lately, I have found some very inspiring 3D pieces, most of them combined with sound. I wanted to create some personal pieces (still work in progress) and share here some insights from my experience. With two basic examples ( Experience 1 and Experience 2 ) I explain the whole process from setting up the audio to retrieving the data and visualizing it.
Do Right Joins even matter? [Reddit Discussion] This is one of those out-of-the-blue thoughts that you get randomly. I am an expert in sql with many years of experience, but have yet to use Right Joins lol. Is there any specific reason or use-case for this type of join?…