Lessons From Building an AI Storybook App

May 1, 2024

The notion of building the StoryTime app came to me after using ChatGPT to generate stories for my two sons (4 and 3). Rather than input an elaborate prompt each and every time, I threw together an interface to let me pick an age group and a topic, then generate a story with a single click. As a secondary goal of building the app, I wanted very much to improve my skills with prompt engineering, and learn as much as I can about AI-related tools. This article will talk about some of my learnings.

Tools Used

The current iteration of the app is built on:

Next.js
Prisma ORM
OpenAI
Stability.ai
The Vercel AI SDK

Lessons Learned

No consistent characters

Dall-E and Stable Diffusion are great at generating a single, hi-res image based on a text prompt. But ask either of them to generate a red-speckled fish in a bowl and a red-speckled fish in the ocean, and you'll get two very different fish even with the same prompt. Getting consistent character imagery is very tricky with the current state of AI APIs; so for now I$#39;m only generating a single image.

Varying rate limits and pricing

I initially was generating images with the OpenAI API (which uses Dall-E, their text-to-image model), and getting very nice cartoon-like images quite easily. I quickly ran into rate limits though, and I realized after looking it up that it can only handle five images per minute (!) - not ideal for an app that more than one person is using!

So I quickly switched to stable diffusion. Stability's rate limits are much more reasonable, and for a modest fee. In addition their image generation capabilities are amazing.

Streaming JSON isn't perfect

The way conversational chat interfaces work on the web is by streaming content word by word to the front end. But what if you want your front end to have information that is more structured streamed to it—like say, a JSON object with the title of a story, an array of paragraphs, and so on? That's a bit trickier.

The OpenAI SDK does support JSON streaming - but at any given time the object you receive might look like this:

{"title": "Sammy's Big Adventure", "content": "Once upon a time there was an elephant named Sammy

Notice anything funny? There's no ending quote or curly brace, making this invalid JSON. Any attempts to parse this on your front end will fail, causing an error (in fact, several in a row).

Luckily there are patterns for parsing broken JSON and making something useful out of it, like the one featured in best-effort-json-parser. This article by Mike Borozdin explains the approach in great detail. You basically have to attempt to add quotes and curly braces until the thing is valid. You'll also have to have fallbacks for when it simply can't be parsed, so that you don't lose the content that is already displaying to the user and get a nasty flicker effect.

Prompt Engineering May (Not) be Enough

So far, the content generated by the StoryTime app has been good, and in some cases more than good. But it could be a lot better. It remains to be seen how far the content can be improved by adjusting the prompts, or whether more invasive methods for improving quality (like Retrieval Augmented Generation or RAG) will be necessary. Stay tuned.