Wait, like it's hard?

The easy and hard parts of building AI applications.

A few weeks ago, I went to an AI infrastructure event. I heard talks about early stage products that help AI assistants work more effectively. These covered ways to make them remember conversations more efficiently, systems that give them a consistent a sense of identity, and tools that help large companies manage the deployments of AI systems. I attended as part of my self-education in AI. I wanted to understand what the frontier looked like, and to develop mental models for how AI would shift business, work, institutions, and the social fabric.

I debriefed afterwards with two friends. My overall takeaway? Damn, this stuff is really not that uniquely complex on a technical level. One of my friends is running an AI agent startup of his own, and he agreed with my sentiment. AI application development is actually shockingly simple compared to most other forms of software engineering, including fullstack engineering. While technically simple to get off the ground, it’s hard to get AI models to behave consistently. My friend described his challenges with testing and constructing prompts that ensured his end users would get what they wanted.

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 Modern AI development often comes down to API calls to foundation models. In just a few lines of computer code, you can connect to OpenAI’s latest Large Language Model (LLM) and start chatting with it. Beyond that, you might need to organize the responses, store information for the AI to reference, and connect multiple AI outputs to solve complex problems. Behind the glitz and the glam of AI agent-based software are simple programming concepts - function calls, prompt templates and fancy for loops.

While there are deep technical ideas behind the training, testing, and even deploying LLMs, most of that is hidden away by abstraction. Just as most developers can use cloud services without understanding data center cooling systems, today’s AI developers can leverage powerful models without needing to understand their underlying architecture or the intricacies of training on petabytes of data.

Moreover, AI development isn’t trivial by virtue of the fact that LLMs are simple to invoke. High quality LLM-based software will have dexterously integrate AI into stunning and intuitive workflows and interfaces. This will require deep domain knowledge and strong product skills to do well.

The hard thing about AI

There are two key hard things about AI development:

Pace of change
Psychology of LLMs

Pace of change

There is a flood of daily updates about AI. Use cases that were mere science fiction or cost-prohibitive just months ago become possible overnight. Moreover, it takes effort to filter the signal from the noise. The market is full of tools that are not quite yet user-friendly, or don’t pose a meaningful improvement over existing workflows.

To sift through the massive waves of updates, and those dreaded listicles (“10 AI Tools that Will Supercharge your Deal Flow” 🤮) requires staying grounded and slowly curating your information sources. Leaning on your friends helps as well. Just yesterday I hosted an AI tools exploration event where a design friend sifted through some of the latest AI-based design tools to see if the hype was well-founded. Turns out, it wasn’t - and that saved me time checking out how AI is transforming the design industry, at least for the next month or two.

Psychology of LLMs

A friend of mine made a fascinating observation: developing AI-enabled applications is closer to psychology than architecture. In contrast, when we build traditional software, we create precise blueprints that execute the same way every time.

The phrasing of LLM prompts is make-or-break for whether it will understand and fulfill our stated objectives. People building AI Agents constantly fine-tune prompts to ensure their users receive on-topic, helpful, and personable responses. The LLMs also need to work in the case of adversarial input, and as the context window grows with continued use, the total possibilities grow, as does the potential for out-of-distribution behavior. Moreover, when the underlying models are updated, the behavior of prompts can lead to significantly different outputs.

Tracking down variances in LLM outputs is challenging. Like humans, LLMs don’t respond to inputs in a deterministic fashion. In other words, LLMs are stochastic in nature. No two runs of the same prompt is exactly the same — even if you controlled all the software variables, the physical hardware itself introduces tiny bits of randomness that compound into different results.

To better see the art behind prompt construction, let’s look at an example. I got the following prompt from Anthropic — it’s one of the prompts that popped-up from their starting suggestions.

Hi Claude! Could you write speech drafts? If you need more information from me, ask me 1-2 key questions right away. If you think I should upload any documents that would help you do a better job, let me know. You can use the tools you have access to — like Google Drive, web search, etc. — if they’ll help you better accomplish this task. Do not use analysis tool. Please keep your responses friendly, brief and conversational.

Please execute the task as soon as you can - an artifact would be great if it makes sense. If using an artifact, consider what kind of artifact (interactive, visual, checklist, etc.) might be most helpful for this specific task. Thanks for your help!

I find this prompt highly curious. Here are some of my observations:

It starts and ends with friendly formalities.
It has language that incites urgency: “ask me 1-2 key questions right away” or “keep your responses friendly, brief” or “Please execute the task as soon as you can.”
It includes specifics about the possible space of actions that Claude can take, including the ability to invoke tools or create an artifact.

Prompts like this make clear that working with LLMs is not a hard science. Engineers won’t be able to rely on standard unit tests to verify correctness. A probabilistic process needs corresponding testing methodology. There are new companies, like Patronus, that seek to solve this kind of issue. Through iterative calls, Patronus is able to diagnose your prompt for common failure modes, ranging from hallucination to erroneous tool calling.

Final thoughts

AI is technically simple to invoke, yet requires psychological dexterity to deploy skillfully. In that way, AI is the ultimate genie in a bottle. With an unassuming brush against metal keys, a powerful LLM will be ready to answer your every question. Let’s just hope nothing gets lost in translation.