A few years ago I built an app for dictation that was intended to solve one of the biggest frustrations I had with other dictation applications: They didn’t handle punctuation. As a writer I dreamed about just speaking my stories aloud, but the cold hard reality was that until recently, every dictation application required you to end all of your sentences by saying “Period”, “Question mark” or “End quote.” This is about as natural as getting up from your keyboard and doing a jumping jack everytime you want to add punctuation.
This led to me creating an app that simplified adding punctuation by using a one-button input you released at the end of each sentence and some natural language processing to figure out if it was a question, a statement or dialog. It was only a stopgap until voice dictation got smart enough to add punctuation – which finally happened with applications like Google’s Live Transcribe and Otter.ai. However neither of them understand when to put something in quote marks. When I decided to try to do the NaNoWriMo challenge (writing a 50,000 word novel in a month) in just 24-hours using dictation, I had to hack together my own little script to add quote marks and format dialog. It was tedious and worked well-enough, but there was nothing elegant about it. I wished at the time there was a better way to do this through AI (I even tried to make a small ML model to do this.) Enter OpenAI’s API.
One of the first things I tried to do with OpenAI’s API was write a script to simplify the hundreds of lines of code and the imported NLP library I was using to detect dialog and add appropriate punctuation. I was stunned by the first result.
I supplied the API with some examples of how to format text where some of it should be written as dialog:
Input: What's happening. Output: What's happening? """ Input: The keys are on the table she said. Output: "The keys are on the table," she said. """ Input: This is interesting. Output: This is interesting. """ Input: Where are you going asked Amanda. Output: "Where are you going?" asked Amanda. """
The API without any adjustments returned this:
Input: The kids went to the Halloween party Marnie replied. Output: "The kids went to the Halloween party," Marnie replied.
It understood who each speaker was, where to put the quote marks and even when it was a statement or question. This is something I’ve yet to see any speech to text application do.
Even more impressive…
Parsing dialog and questions from text is something you can do with a lot of code and rules for grammar. But what about something that’s a bit more subjective?
Paragraphs are extremely challenging and grammar rules alone can’t tell you how to break text into distinct expressions. However, if I give the API a few examples of my own text without and with paragraph breaks, it’s pretty good at figuring out when to break up the text into separate thoughts.
We start with a context of a few examples of how I break my text into paragraphs:
Input: It wasn’t a hard rule, but it had some useful explanatory power. It remains to be seen if it’s just a theory that fits the available data. Julian called me after reading the article and encouraged me to do more research along those lines. I’d hesitate to call us friends. He lives his life in five-minute chunks, and you’re keenly aware that as soon as this conversation ends, he’s going to go to the next name on a very long list of people he talks to. I answer the phone with a slightly froggy voice. “Hey, Julian.” Output: It wasn’t a hard rule, but it had some useful explanatory power. It remains to be seen if it’s just a theory that fits the available data. Julian called me after reading the article and encouraged me to do more research along those lines. I’d hesitate to call us friends. He lives his life in five-minute chunks, and you’re keenly aware that as soon as this conversation ends, he’s going to go to the next name on a very long list of people he talks to. I answer the phone with a slightly froggy voice. “Hey, Julian.” """ (I provide three more examples similar to the previous ones.)
This is just a “first shot” example of the API trying to figure out how to break my text into paragraphs. With more examples it gets even better, but as is, it’s a pretty impressive example of how the API can even pick up somewhat arbitrary rules:
Input: I race through the woods, weaving around the rusted junk heaps, and try to find my way toward the gap out of here. My foot hits a half-buried piece of metal, and I trip. There’s an icy pain as my elbow smashes into the side-view mirror of a Toyota Celica. When I pull my arm free, there are bits of glass in my skin and blood on the door. Damn it. I try to wipe the blood away with my sleeve, but all I do is smear it over the panel. I see an upstairs light turn on through the trees. Not good. She has to have heard me. Output: I race through the woods, weaving around the rusted junk heaps, and try to find my way toward the gap out of here. My foot hits a half-buried piece of metal, and I trip. There’s an icy pain as my elbow smashes into the side-view mirror of a Toyota Celica. When I pull my arm free, there are bits of glass in my skin and blood on the door. Damn it. I try to wipe the blood away with my sleeve, but all I do is smear it over the panel. I see an upstairs light turn on through the trees. Not good. She has to have heard me.
At the point of sounding redundant, this is huge. Paragraph structure is something that varies from author to author and how someone defines a paragraph – a complete thought, a single action, a moment in time, etc. The wonderful thing about the API is that I only have to provide it with a few examples of how I define a paragraph and it can quickly learn from them.
Keywords and formatting
Using OpenAI’s API reminds me of the first few years of having a smartphone. It took me a long time to realize that I had a camera, a voice recorder, a TV studio and all the electronic information in the world in my pocket. I’d see something and forget that I had a camera or have a question and forget an answer was a second a way. It took a long time to really understand the utility of that magical device.
As I play with OpenAI’s API this thought occurs to me a lot. For example, if I need to know the main character in a paragraph, I only have to write a few lines to show the API what I want.
Input: My sister interviewed the mayor of our town. Main character: Sister Action: Interviewing the mayor """ Input: Doug started loading crates onto the back of Tom's truck. Main character: Doug Action: Loading crates """ Input: Hazel, John and Amy surprised Beth with a birthday party. Main character: Beth Action: Surprise
Those examples will allow me to do this:
Input: Kelly called Bill over the intercom and told him she was igniting the rocket. Main character: Kelly Action: Igniting the rocket
If I want the API to give me a list of the actors and directors in text passage it can do that using a script like this.
Input: There's a new movie by Doug Liman starring Tom Cruise set in space. Actors: Tom Cruise Directors: Doug Liman """ Input: Quentin Tarantino, Brad Pitt and Leonardo DiCaprio have made several movies. Actors: Brad Pitt, Leonardo DiCaprio Directors: Quentin Tarantino """ Input: Tom Hanks likes to work a lot with Stephen Spielberg. Actors: Tom Hanks Director: Steven Spielberg """ Input: Christopher Nolan works with Michael Caine in several of his films.
Actors: Michael Caine Director: Christopher Nolan
If I want the API to format an address, that’s fairly easy and it only takes one example:
Input: Bill Kane 23334 Wedge Street austin texas 78652 Output: Bill Kane 23334 Wedge Street Austin, TX 78652 """ Input: jill thompson 2121 hollywood way burbank ca 91505 Output: Jill Thompson 2121 Hollywood Way Burbank, CA 91505
if the API is capable of picking up content and learning from a few examples, that means it can take all kinds of text and reformat it into something else. If I want to code a website, I can make my own shorthand and then give the API the text I want formatted into HTML.
Input: Text input form 2 placeholder email Output: <input id="form2" type="text" placeholder="Email"> """ Input: Title Interesting things Output: <title> Interesting things </title> """ Input: Paragraph These are some of the interesting things you can do. Paragraph Like create code by speaking. Output: <p> These are some of the interesting things you can do.</p> <p> Like create code by speaking. </p> """
All it takes is a few examples from above to understand what I want in a request like this:
Input: Text input form 3 placeholder name Output: <input id="form3" type="text" placeholder="Name">
It only took me two-minutes to create a simple natural language to HTML interpreter using the API.
Now let me really blow your mind…
Let’s ask the API do something we didn’t train it to do:
Input: Div named main contents paragraph I'm inside the element
This is a completely new request, but it gives us this:
Output: <div id="main"> <p>I'm inside the element</p> </div>
All it took was a handful of examples for the API to recognize we were making things with HTML. Its ability to grasp context allowed the API to understand when I said “Div” and “contents” that I wanted a div element and that it should put a paragraph inside of there with the text I gave it. This is insane.
I try to think about the OpenAI API anytime I want to change text from one form to another. This could even be changing the text of an idea into a more detailed paragraph or an unstructured sequence into something that makes sense. It’s ability to pick up patterns and understand context continues to amaze me.
More posts about the OpenAI API: