Ricochet is the best place on the internet to discuss the issues of the day, either through commenting on posts or writing your own for our active and dynamic community in a fully moderated environment. In addition, the Ricochet Audio Network offers over 40 original podcasts with new episodes released every day.
How Do You Make an AI Hallucinate? Like This…
Every year I try to take on some new tech, which can be anything from old news (bookbinding, one year) to trendy. This time I decided to fool around with the currently hot ‘Large Language Models’ (LLM), known in the media as just ‘AI’ (artificial intelligence). I’d had an involvement with very early neural nets and related math over 30 years ago, so seeing how those then very limited capabilities had turned into a cause célebre was interesting.
As an up-to-speed I ran through a Great Courses class on general machine learning (taught by a prof who was a grad student at Bellcore when I knew his boss). Then through the lecture notes and supporting files for a Stanford undergrad course specifically on LLMs. Among other things I learned terms like ‘embedding‘ and ‘hallucinations‘, spicier than the dry math from back in the day. Thus prepared with updated jargon and a list of players, it was time to get my hands a bit dirty.
Training an LLM from scratch takes a giant pile of data, a monster server farm or at least five figures to rent usage of the same. I was looking for something I could do on my rather aged, though once top of the line, Intel based Mac as a first try. I decided to attempt to replicate a project from the same 30+ years ago that created simulated tutor characters drawn from early American history. The application of LLMs to create non-player characters for fantasy, anime and SF role play games and MMOs means there are a number of platforms that might be suitable. I settled on one called Faraday.dev, which lets you download an app to your own machine, where you can securely run various LLMs to power characters already defined on their site, or built yourself – my interim goal.
I downloaded (from huggingface.co) a model called mistral.7b.synthia. Mistral is the company that built the LLM. 7b stands for 7 billion (gig), that’s how many parameters there are in the ‘deep’ neural net or transformer that implements the LLM. That’s a small model – the well known ones like GPT-4 or the now infamous Google Gemini have hundreds of billions, but require the big server just to chat with you. ‘Synthia’ is the name of a ‘fine-tuned’ version of mistral.7b, specifically designed for extended conversations.
‘Fine-tuning’ is tweaking an existing LLM to better handle a specialized vocabulary or type of dialog. Rather than an entire new model ($$$$$) you instead start with a working model, and partially retrain with a new sample of appropriate texts ($$$$). Or you can apply a technique called LoRA that even further reduces the amount of training needed ($$$). But it’s even cheaper to find that someone’s already done the tuning you require.
Anyone who’s dealt with early American English sources knows that vocabulary, dialog and even syntax can vary from what can be scraped from the modern Internet. If I’m to build a credible historical character, I’ll eventually need a model fine-tuned for the period of interest. How to find such a model? Well, maybe I should use AI itself, it’s supposed to be helpful, and I can bootstrap…
I downloaded a generic help bot character, imaginatively called Faraday AI, as a front end to the Mistral 7B model, and away we went:
Faraday AI: Hello User. My name is Faraday AI, your AI Assistant. You can ask me anything and I’ll try my best to provide you with a helpful and coherent answer. How can I help you today?
User: Has anyone done a large language model for historical dialects of english?
Faraday AI: Certainly! There are several large language models that have been developed for historical dialects of English. One of the most well-known is the Old English Language Model, which was created by researchers at the University of Oxford. This model is based on texts written in Old English, the language spoken in England in the early Middle Ages. Another example is the Early Modern English Language Model, which was created by researchers at the University of Cambridge. This model is based on texts written in Early Modern English, the language spoken in England in the late 1500s and early 1600s. Additionally, there are several other language models that have been developed for historical dialects of English, such as the Middle English Language Model, the Renaissance English Language Model, and the Restoration English Language Model. If you have any specific questions about these language models, I would be happy to provide you with more information.
Not quite what I was looking for, but promising!
User: Are they any models for colonial american english?
Faraday AI: Yes, there are several language models that have been developed for Colonial American English. One of the most well-known is the Colonial English Language Model, which was created by researchers at the University of Texas at Austin. This model is based on texts written in Colonial English, the language spoken in America during the colonial period. Another example is the Early American English Language Model, which was created by researchers at the University of Michigan. This model is based on texts written in Early American English, the language spoken in America in the late 1700s and early 1800s. Additionally, there are several other language models that have been developed for Colonial American English, such as the Revolutionary War English Language Model and the Early Republic English Language Model. If you have any specific questions about these language models, I would be happy to provide you with more information.
Jackpot, just what I need! There’s just one small catch…
… and you may have guessed it from the title of the post: Not a single one of those projects actually exists.
On my first serious excursion, I’d hit the famous ‘hallucination’ problem. I might just call it ‘over-extrapolation’ but that’s not so catchy. LLMs in their current state aren’t exactly self-critical. If they don’t have verifiable info to answer your question, well they’ll just make up something that sounds reasonable. Just what was going on became more visible when I poked it a bit further:
User: Tell me more about the Revolutionary War English Language Model. Who developed it and what platform is it built on?
Faraday AI: The Revolutionary War English Language Model was developed by researchers at the University of Pennsylvania. This model is based on texts written during the Revolutionary War period in America. The model uses a combination of machine learning techniques, including deep learning, to analyze the language and structure of the texts. The model has been built on the TensorFlow platform, which is a popular open-source library for machine learning….
User: Who are the researchers behind it?
Faraday AI: The Revolutionary War English Language Model was developed by a team of researchers from the University of Pennsylvania. The primary researchers involved in the project were Professor James P. Allison, Professor William J. Turkel, and Professor Daniel H. Weiss. Professor Allison is an expert in digital humanities and has been involved in a number of projects related to historical language analysis. Professor Turkel is an expert in digital history and has been involved in a number of projects related to historical language analysis and digital archives. Professor Weiss is an expert in digital humanities and has been involved in a number of projects related to historical language analysis and digital archives.
Now TensorFlow is in fact a deep learning and LLM platform, and the three named men are professors. But Weiss is an art historian at Johns Hopkins, Allison is a Nobel laureate immunologist at University of Texas, and Turkel is at University of Western Ontario, with an eclectic agenda that at least includes computational history, but in his case he means history of computation – he recreates vacuum tube computers, among other things
In this case, Faraday/Mistral-7B has delivered a high tech version of Mad Libs. The syntax is impeccable, and it’s figured out that computational linguistics, historical English and research projects are the topic, but from there things fall apart. Lacking specifics – and in this simple platform a way to retrieve them from elsewhere – it grabs a few bits of information that appear to at least have the right types (platform, universities, research professors) and regurgitates them uncritically.
This may have worked out better on one of those hundred billion parameter models that might have captured some relevant specifics during training, but maybe not: I haven’t found any clear answers to the questions I posed in a fair amount of Web searching. There’s a basic hole in the current state of the art, which lacks an ability to throw up its hands and say “Beats me!”.
Since the AI topic has shown up on Rico of late, I thought its denizens might be interested in a real life adventure. I’ve since ignored, for the moment, the anachronistic dialect problem and plowed ahead into building some characters. I have a small digital Samuel Adams fulminating against King George’s tyranny and predicting trouble if General Gage tries anything aggressive. The platform is working a lot better when dealing with background knowledge that is known in advance. Soon I will imitate a mischievous schoolboy user and see how far I can lead ‘Sam’ into hallucinations. Fun guaranteed, as Elon might say.
Published in Science and Technology
This post was promoted to the Main Feed at the recommendation of Ricochet members. Like this post? Want to comment? Join Ricochet’s community of conservatives and be part of the conversation. Join Ricochet for Free.There are 18 comments.
Become a member to join the conversation. Or sign in if you're already a member.Become a member to join the conversation. Or sign in if you're already a member.
I asked Leo what are the lyrics to “Louie, Louie”, but I can’t verify if its answer is correct or not.
“Louie Louie” is a song made famous by the Kingsmen in 1963. The lyrics go like this: (Verse 1) “Louie, Louie, oh no, Me gotta go, me gotta go, Louie, Louie, oh baby, Me gotta go.”
(Chorus) “Then I’ll catch him, I’ll catch him, I’ll catch him by the leg, I’ll bring him, I’ll bring him, I’ll bring him to you.”
(Verse 2) “Louie, Louie, oh no, Me gotta go, me gotta go, Louie, Lougie, oh baby, Me gotta go, me gotta go, me gotta go.” The song’s lyrics are somewhat mumbled and hard to understand, which led to the FBI investigating the song in 1964 to determine if it contained any obscene content. The investigation concluded that the song was “unintelligible at the fast tempo” and that it was “unable to find any obscene or indecent words or phrases contained in the record.”
When it comes to AI, size matters.
This is something I never would have anticipated. I assumed that neural networks would saturate when the data reached the kinds of levels we’re seeing in the latest models. Saturation would lead to an inability to discriminate, to find nuance, and it would be obvious that we’re talking to a dumb machine.
Instead, and surprisingly to most people (and certainly to me), AI scales. The larger the matrix and the more training data provided, the better the results. I don’t know if the scaling is linear, but it’s certainly close.
You’re working with a network containing about 7,000,000,000 parameters. ChatGPT 4, which cost about $100M (est.) to train, has something like 100,000,000,000,000 parameters. (That’s a hundred trillion.)
It still makes things up, but it’s a lot harder to lead astray.
Unless I’m hallucinating, Henry Petroski pointed out that each new invention requires energy to create and operate it, adding to the whole energy requirement for civilization. My understanding is that large language models are sucking up a lot of heat to operate. I’m not sure how that energy requirement compares to, say, a trip from point A to point B by automobile, but it sounds as if our solar and wind farms won’t cut it for this next level of inventiveness.
BTW, I used chatGPT 3.5 for Table Topics at Toastmasters yesterday. In Table Topics the leader asks questions out of the blue and then one of the audience stands up to respond. I asked chatGPT for 10 questions on the subject “Experiences: Old and New”, with each question having no more than 10 words. All of the questions were usable.
Fascinating post.
I wouldn’t trust it. the lyric is obviously, “We gotta go.” It’s guaranteed by the last lyric, “Let’s go!”
In some cultures it would be so impolite to not give an answer to a query; you wouldn’t want to disappoint the questioner, so you give a plausible-sounding answer. It’s better than making him feel bad.
I hope this makes it to the main feed: I know an 8th grade student who has been assigned to write an abstract using Chat GTP, and I’d like her to see this.
I have read that in the Middle East, if you need to ask for directions you should ask several people and hope that there is a consensus. It’s a sign of unmanliness or something to not know where places are, so a man is unlikely to admit it and will just make up directions. Whether this is true or a legend, I cannot verify.
Same pretty much everywhere.
“Open the pod bay doors, Clippy.”
“Sorry, Dave, I’m tripping.”
How useful is a galactic brain that makes things up when it doesn’t know the answer?
Stereotypical shame-based culture. Stereotypes exist because they’re reliable.
It depends what you’re trying to do with it.
If all you want it for is to produce entertaining content, scam elderly people out of their savings, undermine the reputation of your competitors, or swing political elections, then a smart machine that makes mistakes is just fine.
If you want to solve truly hard problems, sometimes it’s okay to have a machine that gets it wrong. Many hard problems are of a type that are difficult to solve but easy to check. For those cases there’s a high value on those occasions when the machine gets it right, and relatively little cost to the failures.
My first job was with a company that worked on military signal processing/recognition -things like the identification of an aircraft based on signals in the radar return (oops, now I have to kill you). Eventually, we worked on identification between various tanks and trucks based on seismic signals (oops, again), sonar signal processing (oops again) , recognition of hand sent Morse code (including the sender based on their ‘fist’) and so on. Eventually, we worked on one of the first commercially available speech recognition. The best uses were in what were called ‘hands-busy’ applications. For example, we worked on a system for the Navy to be used by the pilot and I taught a paraplegic high school boy to use speech recognition to program.
I learned several things – mainly, AI was basically stupid. It didn’t actually ‘know’ anything. The training set and features (for signal processing) were critical. The training took a lot of computer power. Back in the day, the radar ‘train’ took several days and my first big break was due to my willingness to run the ‘train’ on the night shift. Eventually, I got tired of working on systems that even if they got the answer, couldn’t ‘say’ how they did. I got a degree in Control Theory and spent the rest of my career
Any time I hear the excitement over ‘AI’, I remember the trouble I went through to get one of our systems to stop misclassifying ‘sub’ and ‘destroyer’
We used to have a saying about ‘Artificial Intelligence’ being for those without the real thing…..
There’s a substantial difference between a galactic brain making an error because of flawed programming, bad inputs or insufficient information; and a galactic brain making up “facts” supporting a conclusion.
Steve, I agree. But your question was about utility, and that’s all I was addressing.
We are probably getting too deep in the semantic weeds. A failed experiment can provide useful information. Like Edison’s quote about learning 999 ways not to make a light bulb. A solution confirmed by bogus references tells me nothing.
Steve,
I agree that this can get pretty weedy pretty fast. But I think there’s some interesting stuff here that’s worth kicking around.
Consider this situation.
A doctor asks the AI to offer a diagnosis of a patient’s condition, based on the data collected about the patient.
The AI does what modern AIs do: performs pretty sophisticated pattern-matching against a vast collection of billions or trillions of pieces of data, all encoded as numerical values in a huge web of interconnected virtual neurons.
The AI then states its conclusion regarding the patient’s malady.
So far, so good: the doctor has a useful tentative diagnosis based on more data than he’s ever likely to have reviewed himself, and a basis for further tests. The diagnosis might even be correct: AIs have shown a real skill at performing such diagnoses.
But then the doctor says to the AI: “Explain the reasons for that diagnosis.”
To understand what happens next, you have t understand something about modern AIs. They’re based on huge numerical models, networks of weights and biases and activation levels. There’s nothing “procedural” about it, nothing that can be walked through in a logical fashion. It’s all just a web of numbers.
The AI can’t explain its conclusion, other than to say “the weights in my neural network led me to the conclusions I stated.” But that isn’t a satisfactory answer, no better than saying “I said it because I think it.”
So the AI makes up a justification. It doesn’t think logically, it can’t refer to a train of reasoning.
It could have said “I can’t explain my reasoning.” But modern AIs don’t consider that a praise-worthy answer (that is, an answer likely to be “rewarded” and thus reinforced). They could be programmed to be more honest, but that would really be little more than telling them to answer all questions of the form “how do you know?” with “I can’t explain.”
Maybe that would be a good thing, but it isn’t how it’s done. Meanwhile, the doctor has a useful diagnosis, and perhaps a correct one.
The problem in all of that isn’t so much that the AI made stuff up, as that the AI was asked to provide something AIs can’t provide, and failed to make that clear. People should understand that AIs can’t answer the “how do you know” questions.