Ross McNairn's Blog

Simple product frameworks: Frequency

Ross McNairn — Mon, 21 Oct 2024 09:24:56 GMT

Years ago, before I made the leap into software engineering I was training as a lawyer in Edinburgh. I sent one cold email to the CEO of the hottest unicorn startup in the country and astonishingly he replied, inviting me for a coffee.

We chatted about a great many things, largely around what fundamental skills a tech entrepreneur should be building. However the part of that meeting that sticks in my mind most acutely was the diagram that he drew on the back of a napkin. “Frequency is everything”. Its growth, its habit, its the great unlock. Over the years I heard different framings of the same guidance with googles “toothbrush test” and alike. However this model was always the cleanest representation for me.

For over ten years i’ve been meaning to write this up in the hope that the lesson he taught me that day would in some way be helpful to others. It’s guided a huge amount of how I think about building product over the years. I often shared this post below as a PDF with other execs and PMs that worked for me, i’ve kept it as raw as possible.

It’s deceptively simple by design.

Frequency

Understanding the relationship between the frequency with which a user interacts a given service (and extracts value) and the percentage of the population to which that service applies can help give an indication of how quickly that service will organically grow.

The broad rationale is similar to that of general marketing psychology when brands look to create awareness with impressions. The more front-of-mind your brand or service, the more people think about it, speak about it etc, the faster it grows.

At its very simplest: Services with a broad appeal that provide utility very often, will grow faster.

Consider the following graph, along the Y axis, we plot the % of a given population that can draw value from a company’s service. Along the X, how often users extract value (It's worth noting that the nature of these interactions are key, for example email spam is likely to have the inverse effect).

On first pass (very roughly) it’s simple to drop technology companies from across the spectrum.

linkedin is an evolution of the CV website monster and the high frequency social content helped drive its growth

In the top right you have a clustering of the tech hyper-scalers. These ‘category 1’ companies are unlikely to have been envisaged in a world without the internet, now, without mobile, they apply to very broad bases of users and sit at the heart of many peoples lives. They become verbs, they typically see explosive hyper-growth. While they often have a high frequency core model they fall back on, they are aggressively executing on a strategy to increase the frequency of interaction with their services, diversifying their offerings continually. Amazons prime is there to give you a single hook to hit their services across multiple touch points. Meta MnA strategy is to acquire any free touchpoint you might have social interaction even if they are individually loss making.

The next band down, the ‘category 2’ companies are businesses that are either relatively specialist and have not yet broadened their appeal, or are revolutions of traditional industries.

The final ‘category 3’ companies are web enabled businesses. Largely a replica of an offline business model, they could be considered a bricks and mortar business with a website.

Of particular note should be the evolution that some models are capable of going through. For example a recruitment site like monster.com, was evolved by Reid Hoffman into linked-in, as they leveraged the network effect of your professional network to crowd source content and drive high value, high frequency interactions. To move from being a CV site that you would update every 10 years between jobs, into a network used daily.

The example we always referenced at Skyscanner was the Edin-bus app (effectively a scheduling app for the bus). Within the population of Edinburgh, this product, which had a frequency of twice a day, grew at multiples faster organically than the Skyscanner flights app, which people used a few times a year.

Put simply, if you want to grow without paying google, one of the cleanest and simplest north stars is frequency.

Applied AI software engineering

Ross McNairn — Mon, 16 Sep 2024 10:03:16 GMT

This was a guest post that i wrote for the popular engineering blog the Pragmatic Engineer. This was a paid post only accessible to subscribers for the past few months. I wanted to make it generally available on this blog.

Many of these lessons where learnt in the early stages of building Wordsmith.ai. This is a legal toolkit for in-house lawyers. We give them a suite of tools that are usually only available to lawyers that are part of major firms, from document management, to professional support lawyers and chatbots.

I hope this is in someway helpful.

Today, we cover:

Providing an LLM with additional context
The simplest RAGs
What is a RAG pipeline?
Preparing the RAG pipeline data store
Bringing it all together
RAG limitations
Real-world learnings building RAG pipelines

Today’s article includes a “code-along,” so you can build your own RAG. View the code used in this article at this GitHub repository: hello-wordsmith.

Introduction

This post is designed to help you get familiar with one of the most fundamental patterns of AI software engineering: RAG, aka Retrieval Augentented Generation.

I co-founded a legal tech startup called Wordsmith, where we are building a platform for running a modern in-house legal team. Our founding team previously worked at Meta, Skyscanner, Travelperk and KPMG.

We are working in a targeted domain – legal texts – and building AI agents to give in-house legal teams a suite of AI tools to remove bottlenecks and improve how they work with the rest of the business. Performance and accuracy are key characteristics for us, so we’ve invested a lot of time and effort in how to best enrich and “turbo charge” these agents with custom data and objectives.

We ended up building our RAG pipeline, and I will now walk you through how we did it and why. We’ll go into our learnings, and how we benchmark our solution. I hope that the lessons we learned are useful for all budding AI engineers.

1. Providing an LLM with additional context

Have you ever asked ChatGPT a question it does not know how to answer, or its answer is too high level? We’ve all been there, and all too often, interacting with a GPT feels like talking to someone who speaks really well, but doesn’t know the facts. Even worse, they can make up the information in their responses!

Here is one example. On 1 February 2024, during an earnings call, Mark Zuckerberg laid out the strategic benefits of Meta’s AI strategy. But when we ask ChatGPT a question about this topic, this model will make up an answer that is high-level, but is not really what we want:

ChatGPT 3.5’s answer to a question about Meta’s AI strategy. The answer is generalized, and misses a critical source which answers the question

This makes sense, as the model’s training cutoff date was before Mark Zuckerberg made the comments. If the model had access to that information, it would have likely been able to summarize the facts of that meeting, which are:

“So I thought it might be useful to lay out the strategic benefits [of Meta’s open source strategy) here. (...)
The short version is that open sourcing improves our models. (...)

First, open-source software is typically safer and more secure as well as more compute-efficient to operate due to all the ongoing feedback, scrutiny and development from the community. (...)

Second, open-source software often becomes an industry standard. (...)
Third, open source is hugely popular with developers and researchers. (...)
The next part of our playbook is just taking a long-term approach towards the development.”

LLMs’ understanding of the world is limited to the data they’re trained on. If you’ve been using ChatGPT for some time, you might remember this constraint in the earlier version of ChatGPT, when the bot responded: “I have no knowledge after April 2021,” in several cases.

Providing an LLM with additional information

There is a bunch of additional information you want an LLM to use. In the above example, I might have the transcripts of all of Meta’s shareholders meetings that I want the LLM to use. But how can we provide this additional information to an existing model?

Option 1: input via a prompt

The most obvious solution is to input the additional information via a prompt; for example, by prompting “Using the following information: [input a bunch of data] please answer the question of [ask your question].”

This is a pretty good approach. The biggest problem is that this may not scale because of these reasons:

The input tokens limit. Every model has an input prompt token limit. This is 2.048 tokens for GPT-3, 32,768 for GPT-4, 4,096 for Anthropic models. Google’s Gemini model allows for an impressive one million token limit. While a million-token limit greatly increases the possibilities, it might still be too low for use cases with a lot of additional text to input.
Performance. The performance of LLMs substantially decreases with longer input prompts; in particular, you get degradation of context in the middle of your prompt. Even when creating long input prompts is a possibility, the performance tradeoff might make it impractical.

Option 2: fine-tune the model

We know LLMs are based on a massive weights matrix. Read more on how ChatGPT works in this Pragmatic Engineer issue. All LLMs use the same principles.

An option is to update these weight matrices based on additional information we’d like our model to know. This can be a good option, but it is a much higher upfront cost in terms of time, money, and computing resources. Also, it can only be done with access to the model’s weightings, which is not the case when you use models like ChatGPT, Anthropic, and other “closed source” models.

Option 3: RAG

The term ‘RAG’ originated in a 2020 paper led by Patrick Lewis. One thing many people notice is that “Retrieval Augmented Generation” sounds a bit ungrammatical. Patrick agrees, and has said this:

“We always planned to have a nicer-sounding name, but when it came time to write the paper, no one had a better idea.”

RAG is a collection of techniques which help to modify a LLM, so it can fill in the gaps and speak with authority, and some RAG implementations even let you cite sources. The biggest benefits of the RAG approach:

1. Give a LLM domain-specific knowledge You can pick what data you want your LLM to draw from, and even turn it into a specialist on any topic there is data about.

This flexibility means you can also extend your LLMs’ awareness far beyond the model’s training cutoff dates, and even expose it to near-real time data, if available.

2. Optimal cost and speed. For all but a handful of companies, it's impractical to even consider training their own foundational model as a way to personalize the output of an LLM, due to the very high cost and skill thresholds.

In contrast, deploying a RAG pipeline will get you up-and-running relatively quickly for minimal cost. The tooling available means a single developer can have something very basic functional in a few hours.

3. Reduce hallucinations. “Hallucination” is the term for when LLMs “make up” responses. A well-designed RAG pipeline that presents relevant data will all but eliminate this frustrating side effect, and your LLM will speak with much greater authority and relevance on the domain about which you have provided data.

For example, in the legal sector it’s often necessary to ensure an LLM draws its insight from a specific jurisdiction. Take the example of asking a model a seemingly simple question, like:

How do I hire someone?

Your LLM will offer context based on the training data. However, you do not want the model to extract hiring practices from a US state like California, and combine this with British visa requirements!

With RAG, you control the underlying data source, meaning you can scope the LLM to only have access to a single jurisdiction’s data, which ensures responses are consistent.

4. Better transparency and observability. Tracing inputs and answers through LLMs is very hard. The LLM can often feel like a “black box,” where you have no idea where some answers come from. With RAG, you see the additional source information injected, and debug your responses.

2. The simplest RAGs

The best way to understand new technology is often just to play with it. Getting a basic implementation up and running is relatively simple, and can be done with just a few lines of code. To help, Wordsmith has created a wrapper around the LlamaIndex open source project to help abstract away some complexity. You can get up and running, easily. It has a README file in place that will get you set up with a local RAG pipeline on your machine, and which chunks and embeds a copy of the US Constitution, and lets you search away with your command line.

This is as simple as RAGs get; you can “swap out” the additional context provided in this example by simply changing the source text documents!

This article is designed as a code-along, so I'm going to link you to sections of this repo, so you can see where specific concepts manifest in code.

To follow along with the example, the following is needed:

An active OpenAI subscription with API usage. Set one up here if needed. Note: running a query will cost in the realm of $0.25-$0.50 per run.
Follow the instructions to set up a virtual Python environment, configure your OpenAI key, and start the virtual assistant.

This example will load the text of the US constitution from this text file, as a RAG input. However, the application can be extended to load your own data from a text file, and to “chat” with this data.

Here’s an example of how the application works when set up, and when the OpenAI API key is configured:

The example RAG pipeline application answering questions using the US Constitution supplied as additional context

If you’ve followed along and have run this application: congratulations! You have just executed a RAG pipeline. Now, let’s get into explaining how it works.

3. What is a RAG pipeline?

A RAG pipeline is a collection of technologies needed to enable the capability of answering using provided context. In our example, this context is the US Constitution and our LLM model is enriched with additional data extracted from the US Constitution document.

Here are the steps to building a RAG pipeline:

Step 1: Take an inbound query and deconstruct it into relevant concepts
Step 2: Collect similar concepts from your data store
Step 3: Recombine these concepts with your original query to build a more relevant, authoritative answer.

Weaving this together:

A RAG pipeline at work. It extends the context an LLM has access to, by fetching similar concepts from the data store to answer a question

While this process appears simple, there is quite a bit of nuance in how to approach each step. A number of decisions are required to tailor to your use case, starting with how to prepare the data for use in your pipeline.

4. Preparing the RAG pipeline data store

To start, we need to identify the data we will use to enrich responses with. We then load it into a vector data store, which is what we use during the search phase.

Steps in preparing a RAG pipeline

Load the data

Before we can harness the data, we need to convert it into a format that lets us perform the manipulations we will need, such as creating embeddings (basically, vectors.)

In our example RAG pipeline, we use the LlamaIndex, a comprehensive data framework to bridge storing data, and make this data accessible to LLMs. Another popular option for an LLM data framework is Langchain.

Here’s how we prepare our data in the code:

Loading all files in the “public_wordsmith_dataset” directory. For now, this is one file: the us_constitution.txt. You can add any file to this directory to include in the RAG pipeline.

Cleaning up data before storing it is a common-enough step with real world RAG applications, but we’ve omitted it from our simple use case. For example, if your application uses web page data as HTML files in the RAG pipeline, then you’ll need to add preprocessing to remove HTML tags and anything else that is irrelevant for text processing.

There is an ever-growing list of services and tools that assist with the cleaning of data. FireCrawl is a good choice for working with web pages. In general, it helps get from “raw” data to “cleaned” data. There are many similar tools which clean data for AI use cases; it’s a vibrant and fast-evolving part of the AI ecosystem.

Split and chunk the data

Once we’ve loaded and cleaned the data, we want to split our document into ‘chunks.’ These are the parts we want to retrieve and pass on as context to our LLM.

With RAG pipelines, it’s common enough to work with long documents, such as wiki or Confluence pages, contracts, and other lengthy documentation. So, why not just feed the whole document into the LLM? Why “chunk it up?” The reason is that feeding a long document into an LLM can cause these issues:

Degraded performance. LLMs like ChatGPT have self-attention (every token being aware of every other token) scale quadratically. This means that when predicting the 100th token, around 10,000 operations are needed. But to predict the 1,000th token, circa 1 million operations need to be done. This means that the longer the input, the worse the output performance of the LLM.
Less useful output. We’ve observed that inputting a long document results in the LLM receiving a lot of irrelevant data, and responses can become confusing.
Increased cost. The longer the input, the higher the cost of operating the model. This cost will be crystal clear when using an API like OpenAI, which bills you. If running your own infrastructure, you’ll observe higher required compute resource usage, which translates to higher compute cost.

Therefore, deconstructing a long document into “small enough” pieces is a smart move. With these pieces sized correctly, we can pass in the relevant pieces, and make the LLMs’ answers more specific, accurate, and faster, at a lower cost!

Deciding how to chunk your data is a major decision with RAG pipelines. There are many options to choose from when chunking data, but all choices have their own set of tradeoffs. Here are some examples:

The simplest approach: break the text into 250-500 character chunks.
A slightly more advanced approach: split the text by paragraph.
Even more advanced option: divide chunks by “concept” and do some preprocessing to make breakpoints between chunks logical.

Chunking strategies is an area you can get very deep into. There’s a variety of strategies to use, which the article Chunking strategies for LLM applications by Roie Schwaber-Cohen, goes through:

Fixed-size chunking: split by the number of tokens
“Content-aware” chunking: chunking by sentence.
Recursive chunking: divide the input text into smaller chunks in a hierarchical, iterative manner using a set of separators
Specialized chunking: when working with structured and formatted content like Markdown or LaTeX
Semantic chunking: attempting to take the meaning of segments within the document.

In general, smaller chunks tend to help create smaller, more relevant concepts when you retrieve them. At the same time, they lead to very narrow responses because small chunks can become disconnected from related chunks.

Chunking is more an art than a science. My advice is to spend plenty of time iterating your chunking strategy! Get it right for your use case and the source data you have to work with.

In our code, chunking happens here:

The code that does the chunking. We use a fixed-size chunking strategy, breaking our document down every 512 characters

This step splits the data up and embeds it, which requires a little more explanation.

Create embeddings

We’ve broken our documents into chunks, hooray! But how will we know which chunks will be relevant for a question that’s asked?

Here’s how Evan Morikawa of OpenAI defines the concept of embeddings in the article, Scaling ChatGPT:

“An embedding is a multi-dimensional representation of a token. We [OpenAI] explicitly train some of our models to explicitly allow the capture of semantic meanings and relationships between words or phrases. For example, the embedding for “dog” and “puppy” are closer together in several dimensions than “dog” and “computer” are. These multi-dimensional embeddings help machines understand human language more efficiently.”
Creating an embedding from a token (a string). Source: Scaling ChatGPT

Let me offer an alternative way for thinking about embeddings which I use, and that builds on the concept of vectors, and K nearest neighbor vector search.

Vector. A vector is a string of numbers, which allows a piece of information like a sentence, or a paragraph to be expressed in a way an algorithm understands. When compared correctly, similar, related concepts are closer to each other in the vector space.

Imagine using keyword search and a more traditional search method for searching chunks; retrieving all the records from a database would typically require that you pre-categorized all the data, or had an index that could help you look them up. But we want something more flexible, as we don't have well structured metadata for every query we might want to run. Using vectors makes it easier to deal with this kind of problem space.

K-nearest neighbors (KNN.) KNN is an algorithm that takes a bunch of vectors and organizes them, based on how similar they are to each other. Using KNN on a collection of vectors, we find “similar concept groups".

A vector embedding is a vector that represents a concept like a token, a sentence, a paragraph, or anything else. It’s effectively the “fingerprint of an idea.”

Using vector embeddings makes it simpler for the AI model to interact with these concepts. It also makes it straightforward to search for similar concepts when it wants to query your database. For example the Vector of an ‘apple’ and the vector of a ‘pear’ will be more similar than the vectors of an ‘apple’ and an ‘app’.

Turning words into embedding and visualizing those embeddings in a 2D space. Source: Hariom Gautam on Medium

Pre-trained embedding models generate a vector embedding from any input. Thanks to the pre-training, they already categorize the inputs reliably enough.

OpenAI offers an API called Embeddings that can be used to process chunks. Feed in a chunk, and receive an embedded vector in return. Of course, using this API comes with its own cost. The collected works of William Shakespeare are 3,000 pages long, or circa 835,000 words. Embedding the complete text with OpenAI’s ada v2 embedding model would cost about $0.10 (as the cost is $0.10 per 1M tokens, and an English word usually comes to about 1.3 tokens)

When we query our data, the query goes through a similar process to embedding. The semantics of the query are deconstructed into vectors, and these vectors make it easy to compare with stored data.

The open source community also offers some exceptional options. The online AI community, Hugging Face, has a leaderboard of the best embedding models available, ranking them according to the Massive Text Embedding Benchmark (MTEB.)

A screenshot of LLM models ranked by MTEB. Source: Hugging Face

What the best model is for you will depend on your use case, and most RAG pipelines will want to use a semantic embedding model like Bidirectional Encoder Representations from Transformers (BERT.)

Store the data

Now we’ve extracted and cleaned the data, chunked it, and created our embedding, it’s time to store it. We need to choose a database that’s effective at running our KNN operations. Vector databases are tailored to support KNN lookup with great performance, so using one will provide optimal performance for search.

Popular vector databases include Pinecone and Weaviate. Additionally, all major cloud providers offer multiple vector databases. Vector databases are a fast-evolving space, so you need to do research.

When prototyping, you can get away with using a more traditional database like MySQL or PostgreSQL to store embeddings. Should your application receive production traffic, the performance of these SQL databases will likely become critical enough to justify moving to a vector-based one.

5. Bringing it all together

With our data pipeline prepared, the remaining steps are surprisingly simple!

A RAG pipeline at work. The data store is ready and we just need to do steps 2 and 3

The work that’s left to do:

Step 2: Collect similar concepts from the data store. We use a vector database to query the data.

In our code, the very bottom line does the retrieval:

Collecting similar concepts (chunks) from our stored data. See this line in the example codebase

The “retriever” variable now contains the 20 chunks in similarity (as the value of _TOP_K_RETRIEVAL is 20).

Step 3: Recombine these concepts with the original query to build a more relevant and authoritative answer.

With the related chunks available, we now create a new query, which is the updated query we want to pass into the LLM:

Context information from multiple sources is below.
---------------------
{LIST OF THE 20 CHUNKS}
---------------------
"Given the information from multiple sources and not prior knowledge, answer the query.
Query: {ORIGINAL QUERY}
Answer:

In our code, we create the above string by filling out the list of the 20 most relevant chunks (as context_str) and the original query (as query_str):

The code to generate our query pipeline. This is it! Browse the code here.

If you’ve followed this code-along, then congratulations! You now know how to code a simple RAG pipeline!

6. RAG limitations

RAG is a powerful set of tools that can help you focus AI onto your data. However, it's not a perfect fit for all use-cases, and there are some areas in which RAG does poorly.

Summarization

The RAG approach will not result in great output for summarizing. This is because with the RAG approach, documents are broken into many small sections, meaning that results will be poor if seeking insight that needs context from an entire document. For example, if asked: “give a summary of the key points in our contract with Microsoft,” an LLM that performs this well must process the entire contract, not just 2 or 3 chunks that are the assigned “key parts.”

For a model that performs well with summarization paths, consider an alternative approach of detecting use cases for which you route summarization queries to a different pipeline. For such cases, load the entire document that needs to be summarized. Doing so results in worse performance and higher cost, but it’s the only way to get a more accurate response.

Multi-part, or hybrid questions

More complex questions often have an element of reasoning. Take the question:

“What percentage of agreements have their governing law in South Africa?”

This question needs to be broken down. Also, data needs to be collected from other data sources to determine the correct answer. For this specific question, to answer it correctly we need to follow these steps:

Retrieve all agreements with their law in South Africa.
Determine how many such agreements there are.
Process the math to recombine the output. It’s important to note that most LLMs are not good with math and you may want to not use an LLM for it!

To do a good job with such a complex query, RAG alone is insufficient, although it’s necessary at some steps. For the best results, build a higher-level orchestration layer, and coordinate other AI agents working in this pipeline to process complex queries.

7. Real-world learnings

We’ve only scratched the surface of RAG in building this simple pipeline. For the more technically-minded reader who wants to experiment in this space, below are seven things that we at Wordsmith wish we’d had known before, which would’ve saved time while building our AI solutions.

#1: Natural language is not always the best input for an LLM

Instead of using plain text, it is often better to use more structured input and output with LLMs; for example, JSON. A structured format simplifies parsing of the results, and can also increase the quality of an answer. This is because additional structured metadata can be passed along with the chunks of text to the prompt. The model can then be asked to produce additional outputs, and these extra outputs can help improve the answer itself.

This structured approach helps make your instructions highly targeted and very precise.

#2: The quality of your evaluation “evals” are critical for making reliable progress

“Evals” refers to a set of scenarios used to grade the quality of an agent's responses to questions. Each scenario has an input and an output which you can run as you go. LLMs have inherently unpredictable output, meaning there’s a lot of trial and error. So it’s essential to have a strong set of test cases for tracking your progress.

Invest time in defining these “evals” upfront. I’ve written more about how we approached our evaluation criteria.

#3: Improve performance by asking the LLM to do extra things

Here are two ways to significantly improve performance

Combine the context with the original question, then ask the LLM to rebuild a better answer.
Ask the LLM to capture the user’s intent from the original question, and offer reasoning.

In my experience, both approaches improved the output’s quality.

#4: Get the right token size as LLM context

If you feed too few chunks or too little context into the LLM, you’ll get narrow, lightweight answers. Feed in too many chunks and too much context, and the model will start overlooking essential information and get confused. Experiment to get the right chunk size, and the correct number of tokens for better performance.

A simple reference point that seems to work well is passing about 16,000 tokens for GPT-4 Turbo.

#5: Chunking matters a lot – A LOT!

A way to improve performance is to blend multiple chunking strategies to create overlapping chunks, which can help build some resilience into the data to ensure you get the most relevant manifestation.

For example, create fixed-size chunks for 2,500 character size and 500 character size. Calculate the embeddings for both options, which means embedding the same data several times. During a search, your system will retrieve and use the best fitting chunk, which could be the shorter or longer one!

#6: Use suitable document parsers

There are so many open source solutions to help parse and pre-process documents, so take some time to research the best ones for your use case, whose output fits well.

Document parsers tend to struggle to handle certain formats, like nested numbered lists. But at Wordsmith, these are very important in contracts and legal documents! So we had to “hand-roll” our custom solution, after failing to find an open source document parser that did the job.

#7: Use output formatting which the model is comfortable with

Each foundation model has been trained on different source data. This means each model will work better or worse with certain input and output formats. For example, GPT4 and Mistral are efficient when using JSON and Markdown, suggesting they have been extensively trained with this kind of data. Meanwhile, Claude seems to work well when using Markdown, but less so with JSON. Experiment with models, learn which formats work better, and choose models based on their strengths.

Beyond RAGs

RAG is the foundation of nearly every LLM application, and this space is moving fast. I sense a “new dawn” is starting to break in multi-agent orchestration and interaction. These new architectures will give developers the ability to chain many agents together, with each one performing a specialist role in a pipeline. Such advanced pipelines will help progress beyond many of the constraints with which basic RAG approaches struggle.

Right now, the cost of running LLMs can still be pretty high; for example, our test suite cost $30 to execute on each run. However, the cost of running LLMs is falling quickly, and the performance of these tools is increasing just as fast.

It’s an exciting time to be building on Gen AI and LLMs. I hope this overview and code-along helps you get started!

A big thanks to Derek and Gigz for the effort they put into helping contribute to the hello_wordsmith repository!

Takeaways

Gergely again. Thanks very much Ross and the Wordsmith team, for this detailed walkthrough about building a RAG pipeline. You can follow Ross on LinkedIn, X or subscribe to his blog, where he writes on topics like defensible moats in the age of generative AI. Also, Wordsmith is hiring for product engineering and sales.

My takeaways from this deep dive:

Data is one of the biggest moats in most GenAI use cases. There are two types of AI startups:

Those building foundational models, of which there are a handful and among whom OpenAI is the best known, with its GPT models. There’s also Anthropic (Claude,) Google (Gemini,) Meta (Llama,) and Mistral. These companies spend up to hundreds of millions of dollars on training these models, then offer them for use; sometimes for a fee, and sometimes for free.
Ones building applications on top of foundational models. The majority of startups utilize foundational models, and build creative use cases like professional headshots (Secta AI, as covered in the bootstrapped companies article,) or Wordsmith, which offers LLM-powered tools for legal professionals.

For the second category which includes most startups, the two biggest advantages are speed of execution, or access to unique data which competitors don’t possess. Speed of execution can be a competitive advantage, but having access to data which competitors don’t feels like the biggest, durable advantage for a startup.

RAG is one of the simplest ways to use “data as a moat” with AI models. For any company with a data moat, RAG is the simplest way to enhance any LLM model without exposing the underlying data to the outside world.

Sourcegraph has used RAG to produce superior code suggestions. Head of Engineering Steve Yegge wrote last December:

“Cody’s [Sourcegraph’s AI coding assistant] secret sauce and differentiator has always been Sourcegraph’s deep understanding of code bases, and Cody has tapped into that understanding to create the perfect RAG-based coding assistant, which by definition is the one that produces the best context for the LLM.
That’s what RAG (retrieval-augmented generation) is all about. You augment the LLM’s generation by retrieving as much information as a human might need in order to perform some task, and feeding that to the LLM along with the task instructions. (...)
Producing the perfect context is a dark art today, but I think Cody is likely the furthest along here from what I can see. Cody’s Context has graduated from “hey we have vector embeddings” to “hey we have a whole squad of engines.”

RAG is surprisingly easy to understand! At root, all RAG does is take an input query and add several sentences or paragraphs of additional context. Getting this additional context is an LLM search task, and performing this task involves preparing the “context” data, which is additional data which the LLM hasn’t been trained on.

Unoptimized RAG is expensive! Running the example code, I saw a $1.38 charge after asking 5-10 questions from this model. I was wondering where the billing was coming from: the price of creating embeddings, or the cost of using GPT-4?

It turns out that all the cost was for using GPT-4, and the model charged $5 per 1M tokens. For each question I asked, the RAG pipeline added plenty of additional context, which made the query expensive in cost (and processing, I might add.) For a prototype approach, this cost is not a problem. However, for production use cases, heavy optimization would be needed, which could come from passing in more targeted – but less, overall – context. Or it could mean using a model that’s cheaper to run, or operating the model ourselves for better cost efficiency.

A RAG pipeline is a basic building block of GenAI applications, so it’s helpful to be familiar with it. One reason for this deep dive with Ross is that RAG pipelines are common at AI startups and products, but they remain a relatively new area, meaning that a lot of “build-it-yourself” takes place. It’s easy enough to build a basic RAG pipeline, as Ross shows. The tricky part is optimizing things like chunking strategy, chunk size, and the context window.

If you need or want to build LLM applications, a RAG pipeline is a helpful early building block. I hope you enjoyed this deep dive into such an interesting, emerging area!

This is an intro to the very basics and the space is developing really quickly. Already many of these techniques will only get you up and running with the basics and there is a huge amount you can build on.

How do I become a Product Manager?

Ross McNairn — Fri, 09 Feb 2024 06:19:25 GMT

“How do I move into product”

I get this question multiple times a week through LinkedIn so I thought I’d share my thoughts.

At Travelperk we built one of the best 'product schools' in southern Europe. Every position in the Associate Product Manager (APM) program had over 2,000 applications. This took a long time to establish and was one of our secret weapons to scaling effectively (see my blog on 3 lessons, 3 unicorns).

There can be an assumption, due to the perceived soft skills that somehow Product is more accessible than engineering or other functions. You rarely get people outside of engineering asking “How do I become a principal engineer”. They know there is a technical path, often connecting back to formal tertiary education. With product management, the skills are just harder to identify from the outside. You also have a 6:1 ratio of engineers to PMs so competition is intense. I’m going to try and unpick this a little and give people some structure and hopefully, some actionable steps they can take.

Where do you even start?

My starting point is to work on fundamental skills rather than on searching for a golden introduction or opportunity to open a magic door. Building competencies make you more hireable and will ultimately increase the altitude you can rise to. If you get an APM role at a great company, take it, but most people want to know what they can do today, without a perfect job opportunity sitting on their LinkedIn feed.

So let’s start with the outcome. Great PMs need 4 things:

They need to understand how technology works so they can craft exceptional experiences and coordinate well with engineers.
They need to be exceptional communicators. So they can listen to and coordinate the business around them.
They need to be able to rapidly and intuitively analyze new domains, problems, products, and people. So they can ensure they are solving the right problems and coordinating with designers.
They need to be able to get results. So they can execute effectively and drive a team forward.

Most of these need to be learnt while doing and are highly practical. The exception is technology which I will come to. This is why “product degrees” are not really a thing.

So how do you learn these skills?

1/ Technology

In many ways, this is the simplest to learn. The internet is awash with resources boot camps and self-guided paths. Pick a real project, a problem, or a goal and drive towards making something real. When I started learning to code I wanted to make a simple way to analyze, index, and search legal documentation at work. This objective forced me into Ruby, rails, java, JavaScript, night classes, IDEs, deploying code, GitHub, etc.

If you want the high-speed route, Le Wagon or these coding bootcamps turbocharge you. I made all APMs graduate one of these courses as a minimum. Do this early and you are ahead.

Ultimately your job is to build stuff people want. So get building. making your own products end to end is the very best home workout you can do.

2/ Communication

This is far more applied in that you need to polish it in the industry at work. While there are fewer structural paths, you can actively make learning this a priority. Take every single opportunity to speak, to write, and to relay information. Analyse execs, and unpick their presentations their emails. If you cannot move people, you cannot be a PM.

Slow it down and say fewer, better things. Think upfront and give people simple, structured output.

Write. Even if it’s just for you, start building the muscle. Structured prose irons out your thinking and radically improves the quality. It lets you get feedback loops as you read it back. Do it every day. Ask for feedback constantly on how you could make it more engaging and get your point across.

3/ Analysis

Much like engineering, this is highly accessible, start with Excel, move into Python, this is an awesome place to leverage AI as it can analyse and teach you at the same time.

In nearly every element of your work if you “why is this happening?” enough times there will be a hypothesis that you can unpick with data. Get hold of the underlying 2,500 rows of raw CSV data dump it into Python, excel and start building data-informed communications. One sentence, one simple graph.

“Our performance on x has improved by 30% month on month for the past 3 years”

Analyse other products. Pattern matching is the secret weapon of PMs. To this day I spend hours going through and benchmarking products, making notes. How does their search work, why did the PM do that? What is the advantage of that approach of X.

The number 1 mistake I see people make in this domain is that they are far too theoretical and have not spent enough time actively thinking about existing digital products while they use them.

4/ Execution

Start setting goals, write down what you (or the team you manage) want to achieve on daily, monthly and annual horizons.

This really has two universal elements.

A) set and prioritize your personal goals. I started using an Eisenhower matrix (priority matrix) and daily/weekly curation. You then just scale this up.

B) Reflection and improvement. Put time aside to work on your system and think about how you could be more effective. There are hundreds of agile frameworks etc, I would largely ignore these for now and focus on the time behavior of saying where you want to go and getting there. Scrum and planning tools are easy to learn at a later date.

Set real-time aside to analyse how effective you and your team were at getting to them and start to build a mentality for getting things done. Build your personal playbook.

What are the most common Entry points?

I see 3 common avenues that people take to get into the function.

1/ APM

Associate product manager programs. Are typically 2 years long, and have a curriculum, rotations, and mentorship. If you are serious and it’s a good school this is an amazing way to get a structured lesson and you will end up a very rounded and polished product manager. Cons are that it’s hard to get into and many people who want lateral moves are not happy to take one step back to take two forward. In my view, it’s nearly always worth the title and pay cut (if you are moving laterally) to learn the craft.

2/ Tech Entrepreneur

If you have to build a business you need to learn most of these skills. Talking to clients, fighting competitors, it’s all there. You have the bonus that survive or die is an accelerator and forces you to get good quickly. You will maybe be more “spikey” than a classically trained PM but you will be very effective, outcome-based, and a bit of a bulldozer who will get things done. This will be a superpower over time.

3/ Horizontal shift

An engineer, a customer success agent, and lawyer, etc. Here your superpower is domain knowledge so you might be able to shift over and be lighter on the above competencies because the product you are working on requires more domain specialism. This is a hard shift as often you don’t have the bandwidth to backfill some of the other gaps and are thrown in the deep end. So it’s critical to carve out the time to get familiar with the engineering or nail your comms to ensure you land.

Preparing your application

Finding a great transition into product is in part luck. But the more you build your baseline of competencies the luckier you will get.

Good luck ;)

Three Unicorns, Three lessons

Ross McNairn — Mon, 22 Jan 2024 08:10:50 GMT

I’ve had the incredible fortune to be part of several unicorns over the past few years. Typically during a period of hypergrowth. I recently left Travelperk and felt this was a great time to retrospect. While many elements made these companies (Skyscanner, Letgo, Travelperk) fantastic, three non-obvious patterns jumped out at me.

Own your distribution.
You need a unique approach to attracting talent.
The customer at all costs.

Owning Your Distribution

One of my all-time favorite blog posts that I frequently reference is a short piece by Tomaz Tunguz on proprietary distribution channels. Every one of these companies had a flavor of this.

Skyscanner’s SEO Strategy: In the early days the team invested incredibly hard in SEO. Before Google had a chance to swat the "bug" they had amassed over 30 MAUs (monthly active users). They built the everywhere feature and generated millions of canonical URLs and destination pages /cheap-flights-to-london-from-prague. The product was designed to facilitate a massive SEO funnel. The whole business and technology moat was built around being the best in the business at acquiring traffic through this channel.

This “free distribution” meant that with £2.5M of early funding, Skyscanner was able to scale to profitability and take 6 years, getting themselves almost to unicorn status before raising additional investment.

Letgo's Unique Access to Capital: Letgo, was founded by 2nd-time founders who had flipped a previous company for over $3BN they had a track record and connections. This gave them access to lots and lots of money very quickly. This was in total contrast to Skyscanner. It lets them own a different distribution channel.

TV in the US was typically reserved for established enterprise brands. Out of reach of other startups trying to scale large P2P (peer-to-peer) networks. For a product where metclaf’s law (a network’s value is proportional to the square of the number of people in it) this was powerful. This meant that the 100th user was more valuable to the network than the first user as the 100th user would likely be able to get involved in lots of different transactions, while the first users had limited options.

In that situation, lots of capital can give you a big edge, as you can theoretically kick-start your network and quickly get to the high-value, high-volume network.

Travelperk’s Sales Machine: Born in Barcelona, an expat haven, there was a huge volume of smart, ambitious English speakers flowing through the city, who just wanted to get stuck into the tech industry. This opened up an ability to approach sales and implementation differently to someone with their cost base in London, San Francisco.

A big intake funnel and a really meritocratic culture meant they built a very high-quality and uniquely cost-competitive sales team. It gave them the ability to distribute more effectively than many of their competitors.

Talent

In all three of these cases, the companies were HQed in non-traditional tech hubs (at the time). London, SFO and NY all have much higher talent density and the "second city" approach would seem to be a disadvantage. However, each company had a unique approach to finding great talent and exploiting those markets.

Skyscanner’s graduate and internal talent pipeline: Skyscanner partnered deeply with the universities. Edinburgh has one of the best informatics schools in the world and Scottish universities like St Andrews and Glasgow have exceptional engineering pedigree. Every year Skyscanner took cohorts sometimes of 50+ grads. Then whittled them down over time. This pipeline became uniquely critical over time. Today the CTO is a product of that pipeline with 14 years at the company. Many of their best and brightest are homegrown. Exceptional talent, very loyal...

Letgo’s access to capital: remember that capital thing… They paid 30-40% over market. Done.

Travelperk’s culture: Given the often transient nature of the workforce in Barcelona, they invested heavily in building a culture that was magnetic for people who didn't have a natural family and roots in the city. They made a place where they could socialize, and connect and this appealed heavily to that demographic in a foreign city. Weekly end-of-weeks, Calcotadas, summer parties, winter parties, rooftop yoga, a rockstar office. It all helped them attract and retain talent.

The customer at all costs

All of these companies used different tactics to achieve superiority in the quality of their service and their product. Long term this was critical in holding customers that their distribution engines acquired for them.

Skyscanner’s engineering advantage: With Skyscanner, it will not surprise you that Gareth believed that a dollar spent on the quality of your product and thus your engineering was the very best investment, short, long, and medium term that you could make. That giving money to the "Google tax" was a waste of time. That principle and constraint, in part, manifested in their pursuit of the SEO channel, but it also channeled resources to their product. 80%+ of the headcount was RnD and they threw everything at just making sure that their prices and their performance were the best in the world.

Fake it till you make it: At Travelperk behind the scenes for years people were supplementing the technology until we had time to build the functionality. Offering value and connecting the dots while making a loss. Margin was an optimization, user satisfaction was something that if you didn't solve, optimization would be pointless.

There are other elements to building fantastic companies. However, I find it helpful to take some non-obvious perspectives to analyze other businesses. All too often they are dropping the ball on one of these. Doing these won’t mean you are a unicorn overnight, but it will stack the deck in your favor.

CTOs view on a defensible moat in the age of generative AI

Ross McNairn — Wed, 10 May 2023 08:37:04 GMT

There is a lot of talk about generative AI. You will see "sizzle" feature after "sizzle" feature as the world wakes up to the state of modern AI. The exploration of AI will manifest as a tsunami of experiments, some with long-term value, many without. This poses a difficult question for CTOs and executives: how to make long-term investments and ensure that they are building a unique and defensible moat?

The Impact of Generative AI on TravelPerk

At TravelPerk, you will already feel the transformative impact of AI on our roadmap. From the semantic analysis of emails and "better than human" levels of triage and routing to ingesting complex invoices or helping our companies with their account setup, generative AI is flowing through our entire roadmap and creating beautifully magical touchpoints everywhere you look. We recently shared this implementation, you can easily see this is just the beginning.

The Power of Combining AI Services

The real power in this technology comes from seamlessly weaving multiple different models and services together, handing context between each of them as you use the AI best suited for whatever subtask you need.

Where is the long-term moat behind all this, and how should you start to think about it? Our engineering leadership found this overview from a16z a helpful framing. I've made some modifications of my own to the diagram, demarcating the areas that are most likely to become rapidly commoditized and are currently undergoing the most unpredictable development and iteration.

original article here

We can break this diagram down into two broad buckets:

Apps:
Most of what we see on our LinkedIn feeds sits in that top layer, the "apps", where you are gluing services and doing some basic fine-tuning on other models.

The overview from a16z is a little high-level. The distinction between an ‘end-to-end app’ and an app we suspect will be more blurred. Proprietary end-to-end models will likely be a collection of hundreds of fine-tuned, optimized, and well-composed models (and indexes of large bodies of data) that are orchestrated to output results that on aggregate dramatically outperform the larger models.
Open / Closed Sourced models:
The quality of models, both closed and OS are evolving at breakneck speed. In weeks, not years sizeable changes are taking place.

New techniques like LoRA are making it really easy to build and host models that are really good at targetted tasks (and starting to catch up to ChatGPT in terms of performance), with pretty tiny data sets and a laptop. This is going to lead to an explosion of lots of niche models for very specific jobs.

Investing in data and positioning for a fast-moving external world

How do you ensure you are going to transition through as an AI-native player with a moat and not simply build on shared foundations that will ultimately be available to everyone? From all this, we have started to build a few early conclusions on the direction of this space:

The quality of models, both closed and open-source, are evolving at breakneck speed. In weeks, not years, sizable changes are taking place. Betting on a single provider is a mistake. When you add the risk of regulation and the unpredictable response from various governments the issue compounds.
Data quality not GPU resources will be the most accessible moat for most organizations. Building ML ops teams and pipelines that make it simple for anyone to train models that can outperform major models in targeted domains and use cases, will be what the most successful organizations achieve.
Composability is a moat. This giant 'decoupling' will mean that end-to-end apps of the future compose hundreds and hundreds of models, both in-house and externally hosted. We for example foresee fairly extensive investment in our PII architecture and privacy around this technology. Which will mean we want self-hosted models that help us with PII scrubbing and data governance before we leverage externally hosted infrastructure. Being intentional even at this stage about breaking that out is critical.

Practical Steps for Embracing AI in Your Architecture

Embrace the unbundling:
Bake abstraction in. We are starting to leverage LangChain and other abstraction layers that let the teams work above any single model.
Remain model-agnostic:
Operate under the assumption that within 24 months, there will be 10 alternatives to OpenAI, and your hand will be forced to use models based on their geography and regulatory adherence. Try to remove dependencies where possible.
Focus on fewer, better data:
If you can get your telemetry right, you will start to see what steps in your chains are not performing as you would like. To tune proprietary models, large corpuses of data will be less useful than small, highly curated data sets, giving you outsized returns.
Emphasize composability:
Like writing a book requires many different steps and phases, you will want a model to help you with the outline, a different one to perform a deep analysis on historically accurate character profiling, another for creative flair, and so on.

Conclusion

In the age of generative AI, building a defensible moat for your business is crucial. By focusing on the right areas of investment, remaining model-agnostic, and leveraging the power of composability, your company can stay ahead of the competition and thrive in this rapidly evolving landscape. By embracing these principles, you can ensure that your business remains at the forefront of AI innovation, creating unique and lasting value for your users.

Reflections on OpenAI

Ross McNairn — Mon, 08 May 2023 14:53:03 GMT

I'm sitting in SFO airport after a day with OpenAI, reflecting on a fascinating developer enablement session the team hosted. This trip is one I won't soon forget. Yesterday, I attended a closed-door session for a group of CTOs at OpenAI's HQ in downtown San Francisco. We watched presentations from engineers, product managers, and execs as we delved into their upcoming features and dev tooling. Most striking was GPT's exceptional aptitude for software engineering. Given just a few bullet points, it could troubleshoot complex bugs and draft hundreds of lines of code.

After sleeping on it and starting to process what I've seen, I've begun to unpack a few thoughts.

First, gentle anger. How is it that only two or three companies in the world have had access to this technology for years, ring-fencing it from the rest of us to drive advertising revenues? Leveraging it to draw our attention towards the highest bidder feels akin to having the cure to cancer but using it to make the nicotine in cigarettes more addictive. When my peers ask why public sentiment is hesitant to entertain a bailout of a bank branded "Silicon Valley," they need look no further than the damage we do to our own reputation with examples like this.

Second, a rising, nervous excitement. It's the kind you get before a very important meeting that might lead to something fantastic. Feeding an Excel sheet into GPT-4, it can build conclusions in seconds that would have taken a human analyst hours. GPT-4's superpower is reasoning—breaking questions into logical steps and connecting dots. Feed it a tax code, and it instantly weaves together the chaos into a sequence of bulletproof calculations. It's like the second half of your brain you've been missing since birth. I can see years of my training as a lawyer and an engineer vanishing before my eyes, and I'm delighted.

Third, I feel fear. OpenAI's original mission was to provide an ethics framework for this superpower. Quickly, this pivoted into a $10 billion investment from Microsoft and a pricing plan.

I think back to "Superintelligence," the Nick Bostrom book. He talks of inflection points, moments when AI moves into a realm of self-awareness with the means and momentum to expand its IQ beyond our control. Computers don't have the same limitations as flesh and bone. Biology has placed guardrails on us, but when neurons are a combination of silicon wafers and self-improving algorithms, evolution takes nanoseconds, and resources are nearly limitless.

The ethics issue bothers me for several reasons. AI has been a part of our lives for years, its power concentrated in a few targeted places. It builds our newsfeeds and feeds us a never-ending stream of meme bubblegum, handing the spotlight to those who can excite and hold our attention. Teams at big tech companies invest billions in "trust and safety," cleaning content and supposedly getting rid of the nasty stuff so we can enjoy hours of uninterrupted monkey-on-lawnmower content guilt-free! Yet, the public increasingly understands that the drive for profit pollutes algorithms, directly impacting our politics and mental health. If attempts to manage AI's second-order implications have failed so far, why would another collection of largely the same West Coast elites fare any better this time?

My second concern is rooted in contradictions around OpenAI's founding story and origins. It started as a not-for-profit, but that is clearly no longer the case. Perhaps CEO Sam Altman felt that without commercial firepower, it would be nearly impossible to bring the necessary cash to compete with big tech. Regardless, contradictions in their narrative make people nervous, especially when they've just joined forces with a tech giant that recently terminated its entire AI ethics team.

If I have one call to action, it's for those custodians of AI morality: open the rulebook to the world. There's a high chance you are authoring a constitution for a future era of human society—an era of abundance or Armageddon. Getting this wrong leaves no room for a do-over; there are no second drafts. Open the document for editing and let the world examine and challenge it. I commend you for breaking this into the open and forcing other players to open their platforms to the world. But if you're going to choose a path for humanity by injecting this technology into nearly every piece of software on Earth, then at least let us have a say.

Having arrived two days earlier, excited and curious about how best to leverage this new trend, I now leave with no doubt that this is the next supercycle of human evolution. Over the next 10 years, nearly every aspect of white-collar work will change, and the potential for humanity at this critical moment is mind-boggling.

With that, let me paste this into my GPT-4 preview playground. I'm sure I've made more than a few errors it can clean up for me before I take off.