Eva Dale 0:00 From the heart of the Ohio State University on the Oval, this is Voices of Excellence from the College of Arts and Sciences, with your host, David Staley. Voices focuses on the innovative work being done by faculty and staff in the College of Arts and Sciences at The Ohio State University. From departments as wide ranging as art, astronomy, chemistry and biochemistry, physics, emergent materials, mathematics and languages, among many others, the college always has something great happening. Join us to find out what's new now. David Staley 0:33 Joining me today over Zoom is Professor Michael White of the Department of Linguistics at The Ohio State University, College of the Arts and Sciences. His research interests have been primarily in natural language generation, paraphrasing, and spoken language dialogue systems. Before joining the faculty at Ohio State, he was a Research Fellow in the School of Informatics, the University of Edinburgh, and prior to that, worked many years at CoGenTex, a small company that was dedicated to developing commercial natural language generation software. Welcome to Voices, Dr. White. Michael White 1:08 Thank you. Delighted to be here. David Staley 1:10 Well, and I'm going to start by oversimplifying a bit, or maybe rather a lot here. So, are you looking to make a better Siri or a better Alexa? Michael White 1:21 That's funny, you asked that question as one of these A or B type of questions, where those are the only two options. So, in some sense, I would say the answer is neither, I'm not specifically working on Siri or Alexa, but in terms of those kinds of systems, definitely I would say I'm working on, you know, with lots of colleagues, on methods for building better conversational systems. So, if you think about systems like Siri and Alexa today, you know, everyone knows that they're pretty limited. I guess my favorite anecdote for that is a few years ago, I asked Siri to text my wife my location, and Siri said, here's the message for Amea, and the message says the words "my location", like, not my actual location, just the words "my location". So I said, Siri, you're an idiot, and Siri said, I'm doing the best I can, James Bond - see what my son told Siri that my name was James Bond - that improved my mood a little bit, but it does indicate, you know, we have a ways to go with these systems. We would like them to be more like, you know, a concierge or a travel agent or any kind of professional that works with information in useful ways, and not just only occasionally does what you really want them to. David Staley 2:30 Well, and let's dive into this, because I know that your research follows at least four different paths, and I'd like to explore some of these paths. So, one of these is about addressing the problem of inaccurate or misleading output texts. Tell us about this research, please. Michael White 2:48 Yeah, for this question, it's going to help to kind of back up a little bit and talk about the role of big tech in my field. So, it's, you know, it's not just Apple and Amazon, it's also Google and Facebook, now Meta, and the many other companies increasing their presence in the field of natural language processing or computational linguistics, especially over the past ten years. And some of those influential research has been happening in some of the big labs at Google or Facebook and places, often in collaboration with academia. But, there's been a big move towards investing heavily in new kinds of AI technologies that are very expensive to develop, and that's part of the reason why they've been so influential, is these big labs have the resources to build some of these models. So, what's been particularly influential, and it's influenced my research, is this new kind of technology called a neural language model. So, language models have been around, you know, since the 80s, at least, have been used in practice, in automatic speech recognition. Language model is a technical term that linguists don't really like, but we've come to accept it. It's just, it's a model that all it does is predict the next word, right? So, you give it a lot of text, and it learns; oldest models were just counting word sequences, very simply, very simple statistical models, and they get some kind of idea of what words are likely to come next given the preceding words. And over time, these models have become much more powerful, and if you've heard the terms deep learning or some of the other buzzwords in AI, these models have gotten much, much better at predicting the next word. They use more and more texts, like billions and billions of words of text that they're trained on, and now these models are getting so good that, you know, people are starting to worry about them, right? So, you can ask them, you know, to predict the next word and the word after that and so on, so kind of like auto complete if you've seen when you're texting someone, that autocomplete. If you essentially just do autocomplete over and over now, these models can carry on a conversation or write a story, and people even worried about automatic production of fake news articles or misleading messages about elections or things like that. People wouldn't even be able to tell the difference anymore between what's done by some sort of neural language model versus what's done by a person. These models are actually, you know, there's real ethical concerns about them now, but they're also pretty amazing in the advances in their ability to predict the next words, which would actually make them useful for all kinds of technologies, if you actually think about using them responsibly and creatively. So, the first strand of my research that you asked me to talk about was, in some sense, about controlling them to make them do what you want a bit more. So, the amazing thing is, if you're actually trying to get these to help you with this problem of natural language generation, or what linguists call production, so deciding what to say and how to say it, in some sense, you can train these models to help have these conversations, but it's amazing how they will mess up even the simplest things. So, to give you an example, there was a popular data set that's been used for a lot of research in the restaurant recommendation domain, and effectively, what you would do is you would give these models as input, you know, some facts or attributes about some restaurant that you want the system to convey, and one way to do it is you actually sort of give each fact in a simple sentence by itself, and then you ask the model to say, well, how would you actually say that fluently and naturally? So, for example, if you told the model, I want you to say the following: "The Wrestlers is rated five out of five stars. The Wrestlers is family friendly", right? And you can see, if you could put a few more facts together, it's gonna be very repetitive and robotic sounding, so no one would really want to have a conversation like that. But at least it's accurate, it's conveying the information accurately. So these models are actually really good at rewriting that into something that's much more natural and conversational. So, for the example I gave, it might produce something like, "The Wrestlers is a five star, family friendly sushi bar." But the only problem is, you didn't tell it that it was a sushi bar. It kind of made that part up, right? And you can't tell that it made that part up, if you're the listener in that conversation, other than the fact that "The Wrestlers" doesn't really sound like a sushi bar, you would have had no idea that the system just completely misled you about this restaurant that you're looking for. You know, to some extent, we can figure out what's going on there. So, in this data set, it's usually more than just two facts that you're conveying. So, in some sense, the model kind of felt compelled to add something, you know, kind of felt, that's boring, let's add a third fact, we'll give it the restaurant type of sushi bar. We want them to accurately convey the information that they're supposed to convey, so, you know, one that you can do is just throw in a word, like venue, like, if you don't have a specific information about what type of restaurant it is, you know, you have to give some kind of noun there at the end of the sentence, if you've given it a few adjectives, so part of my research is oriented around, how can we get these models to do what we want, take advantage of what they've learned about language, and all of this pre-training, is the technical terms. These models kind of in their learning to predict the next word, they actually learn a lot about language, but we want to take advantage of that without having to do crazy things like make up stuff that we didn't want it to say. David Staley 8:02 You talk about these algorithms learning to predict the next word. Is that what I do, I mean, is that what's happening right now? Am I trying to anticipate your next word? Michael White 8:13 Well, absolutely. There's some really interesting research coming out of psycholinguistics and even neurolinguistics, looking at mixed word prediction as a fundamental aspect of language. So, it looks like even down at the neural level, there are neural circuits that, even in code, meaning as kind of a difference between what was predicted and then what was actually found. So, taking advantage of prediction seems to be a very efficient way to manipulate and store information, and in some sense, these models are inspired by that. For the most part, these models are really just inspired by engineering concerns, so the engineers, in some sense, have gotten way out in front of the science. So all of these neural language models are very loosely inspired by what we think goes on in the brain, and to some extent, when they actually match what's going on in the brain, that's almost by accident. But next word prediction does seem to be a fundamental aspect of human language communication. That's not to say that our conversations are really just like picking the next word and autocomplete, we usually actually have something we really want to say, we're not just streaming together what sounds natural. David Staley 9:20 Tell us about the strand of your research that's focused on data efficient modeling. Speaker 1 9:27 Okay, great. So, that strand of research is one of the successes that I'm particularly happy about in terms of actually using these neural language models for something practical, right? There's even been a lot of popular press about these neural language models recently, and both the concerns about, you know, how they can also tell amazing stories and so on. It's really unclear how useful that is, whereas, if you're actually trying to use these systems to do practical things, like a better Siri, or, you know, other kinds of things we've been working on here at Ohio State, you have to come up with techniques that really take advantage of what these models do, but still control them adequately. So, in terms of taking advantage of these models, where we really seen a payoff is in terms of data efficiency, and what that means is the dream behind these systems is, if we wanted to do something useful, we just want to give it some examples of inputs and outputs that illustrate what we want the system to do, right? You know, if we say, here's some examples of restaurant facts you want to convey, or here's some examples of a whole bunch of weather attributes, and we want you to give us a weather summary for the weather in the next week and so on, right? So, you give it examples of the inputs the system is going to see, and then examples of the outputs of what you'd like it to produce. And you know, the field has seen a lot of progress in making systems where all you really have to do is give the system these examples of input and output, and it will learn how to generalize from that limited data to any kind of data that fits the mold. So, you know, regardless of what the weather is, it's going to tell you correctly that weather information in a natural and fluid way, right? So it's not just memorizing the inputs and the outputs, it's generalizing from them appropriately. So data efficiency means, well, how many examples do we have to give it to train the system effectively before, you know, it actually generalizes like we want to, and doesn't just make stuff up, like saying something's a sushi bar when it's really not. And, you know, in the early days, I should say the early days for these models, which means only a few years ago, you know, even with 50,000 examples in this very simple restaurant description domain, models would make a lot of mistakes, still, even with 50,000 examples, inputs and outputs. And you know, people who'd been working in the field for a long time thought this was kind of crazy, because the traditional way of building systems where you take advantage of linguistic knowledge and kind of build in a lot of facts about language to what's called knowledge engineering, or linguistic engineering, those systems would reliably convey the information, but they were expensive to build and required a lot of expert knowledge to build, whereas, with these new systems, the idea was,well, if you just have to give it these inputs and outputs, you know, you can use lower cost methods like what's known as crowdsourcing to get the data. So, you can ask people on platforms like Mechanical Turk to help collect examples of how you would say things naturally, for example. But it seemed kind of sad that even with 50,000 examples, these systems still weren't fully reliable on this task, and what we've been able to show over the past few years is we can get that down from the tens of thousands into the hundreds. So, for realistic systems, with just a few hundred examples, we can now get them to perform quite reliably, and that's been in part through taking advantage of these neural language models, these big models that big tech has been working on. So there, we've actually been able to harness them to do something useful, you know, rather than just talk abstractly about things it might do or harmful things that, you know, it could be doing. Does that make sense? David Staley 13:03 It does, yes. And in fact, speaking about useful and thinking about applications, I know that you are doing work on a virtual patient dialogue system, and so I'd like to hear more about this research. Michael White 13:14 Yeah, so the virtual patient dialogue system is something I've been working on for at least eight or nine years now, with colleagues in computer science as well as linguistics and in the medical school. One thing we've been able to show recently, which is quite pleasing, is that this virtual patient dialogue system, at least for practice purposes, works as well as actual humans do in training early stage medical students. So, if we back up for a second, so medical students in their first and second year before they're allowed to see actual patients, will get some practice with what are called standardized patients, which are basically actors pretending to be patients. So, they'll be told what their symptoms are supposed to be, and so on, they'll come in and have a conversation. And for grading purposes, the standardized patients, human actors are still used, but for practice purposes, so to give early stage medical students some opportunity to to practice taking a history or other kinds of conversational skills, they can now interact with this virtual patient system we developed, which has a whole avatar, it has someone who looks like a patient sitting in what looks like a doctor's office and so on. And the medical student can come in and say hello and start asking questions and take a complete history and come up with a differential diagnosis and so on, and you know, they will show that it works. It's far from perfect, still, even after all they hear, but it works well enough that it serves its purpose in terms of providing the same kind of feedback they'd get from actual physicians grading the interaction. David Staley 13:33 Does it pass the Turing's test? Michael White 14:47 No, absolutely not. So, it's funny, it's... so the Turing Test owes to one of the founders of computer science, Alan Turing, that people may know now from "The Imitation Game", and he had this idea that, well, how are we ever going to tell if a system is intelligent, right? And one way might be, well, if a person has a conversation with a computer, and you know that computer is in another room, so they can't tell whether they're talking to a computer or a person, if a person actually can't tell from the contents of the conversation when they're talking to a person or a human, maybe at that point, it makes sense to call the system intelligent. So, that was kind of a thought experiment about what it would mean for a system to be intelligent. You know, one of the questions these days is, does it make sense for systems like Siri and Alexa or a virtual patient or any of these systems to pretend to be people, right, or to try to fool people that they're talking to a person. And the intention of the system is not to be one that would actually fool anyone into thinking they were talking to an actual patient, the idea is that to be a realistic enough simulation that it's useful practice. So, one thing we found was, in the early days of the system, it would only get about half of the questions right, so it would give a misleading or incorrect answer to nearly half the questions, and that caused a level of frustration that was a little bit too high. These days, it's getting closer to nine, you know, between 80 and 90% correct. It varies a lot by how experienced the user is, so the earlier stage medical students are a bit more all over the place than latter state students less focused on their questioning, which makes it more difficult. But, the system still will give wrong answers often enough that it's a little jarring. You would not be fooled for long as to whether you're talking to an actual person. And it's not quite photorealistic anyway, and in terms of what the avatar looks like, but now the interaction is realistic enough that it does provide useful practice, even though it continues to get things wrong. David Staley 16:47 And you say that sometimes these... frequently, it sounds like these systems get things wrong. In the restaurant examplexample you were giving, did I understand you to say that the AI was sort of adding or making up that it was a sushi restaurant? Michael White 17:02 Right. David Staley 17:03 Where does that come from? Where does the AI...I mean, unless it's programmed as such, how is it making up this information? Michael White 17:11 Yeah, that's a great question, thanks. If you're talking about a system that's trying to understand what a person is saying, right, so like in the virtual patient system, there are questions that it's expecting from users, but there's an infinite number of ways people might ask those specific questions. We don't just want them to choose for a list like, which of these 300 questions do you want to ask next, they have to come up with it on their own, say it in their own words. So, whenever you're dealing with interpreting what someone is saying, it's an inherently very difficult problem. When you study linguistics, that's the thing that you're constantly amazed by, is the richness and complexity of human language. So that is an inherently difficult task, when you're trying to make your best guess at what someone means, right? But in the other direction, where, you know, say the system is already understood a simple request, like, tell me about this restaurant, or what's the weather going to be this weekend, right? Then, it's taking some information that it knows, like, say, information about the restaurant or the weather this weekend, and it actually knows what it's supposed to say, right? So there's no uncertainty there, like there is when people are asking a question, for example, and the fact that it would get it wrong is kind of amazing, right? David Staley 18:18 Yeah. Michael White 18:18 Yeah, you know, why is it getting it wrong? So, if you think about that next word prediction task again, like on your phone, you're typing away, and it's suggesting words you might say next, right? So, that's what, for the most part, these models have been trained to do, is figure out what are likely words to say next, given the words you've been saying so far. But when we want to use them for some useful task, what we're actually doing is saying, okay, given some input, express this input in a natural way, right? And that could be all kinds of information, and also, these models are also used for, say, summarization, you've got a whole big, long text you want to summarize into fewer words and so on. And there it has to learn a somewhat more complex task of given this input, right, so take this weather information, convey it concisely. It has to do, what's a natural next word, given that I'm trying to express this information, and that's the part of the task that it hasn't learned completely when it's making mistakes. And in particular, what it has to kind of balance effectively is, you know, given the words I've said so far, what are some natural next words? So, it may have to fill in, you know, some grammatical, some function words, some words that express tense or definiteness, or things like that, given the input. So, it's not explicit in the input every single word it should use, so some of the words it needs to fill in based on its grammatical knowledge, say, but the actual content it needs to express based on the information that it's been given. So finding that balance between the words it needs to add versus the words it needs to use to express the information it's supposed to express: that's the real tension there. So, if you think of like next word prediction, where you're just picking a next word to say what, paying no attention to what you intended to say, that's where it's getting lost. It kind of goes too far in picking its next word, because it sounds nice, forgetting what it meant to say to begin with, maybe if that analogy helps. Eva Dale 19:13 Did you know that 23 programs in the Ohio State University, College of Arts and Sciences, are nationally ranked as top 25 programs, with more than ten of them in the top ten? That's why we say the College of Arts and Sciences is the intellectual and academic core of the Ohio State University. Learn more about the college at artsand sciences.osu.edu. David Staley 20:48 Well, I know that AI is mastering chess and "Go", not just simply mastering it, but they are orders of magnitude better at playing "Go" and chess and these sorts of things. What we're talking about here is mastering language or maybe mastering conversation: is that a harder problem than, say, "Go" or chess or something like that? Michael White 21:10 Well, certainly. You know, so these systems that do amazingly well at these games, the one thing they're given still is the rules of the game, right? So, it's been pretty amazing the progress has been made with those systems through what is just characterized as self play, so if you have the system playing against itself in chess, it gradually learns what are effective moves like, okay, that was a winning sequence of moves and let me try something that again. And you know, the amazing thing is, there's some evidence that it actually learns some kind of abstract representation of the game, like what's a strong position in the center, or something like that. It's hard to actually probe what's going on in these models, but at least in language technology we're familiar with, there's been some progress towards actually understanding these complex models and figuring out what it is that they're learning, and there's some evidence that just by learning to predict the next word, it's learning fairly abstract things, like, you know, what's the subject of a sentence and does the verb need to agree with the subject? It seems to pick up on some of these things without being explicitly taught. So, it's the same kind of thing with chess, that it's learning strategies just through self play without being explicitly taught. But the difference is there that chess is a very rigid, well spelled out, well formalized game where the rules are told to the system in advance. So, no one's built a system yet, where it also has to learn the rules as it goes, so it hasn't completely mastered what people can do yet. But language is much more open ended, we can talk about anything in language. But I think there is a relevant connection I'd like to make there, that part of the work that, you know, we've made so much progress on data efficiency is exactly through, in some sense, getting the system to teach itself how to do the task. And you know, there, it's not so much about winning strategies, because there's not necessarily a notion of winning, but there is a notion of success, which comes from when the model can accurately convey the information. So, here, the trick is that you need to model both sides of the conversation, so just like when the chess is learning through self play, it's actually playing against itself, so two different versions of itself. The same kind of technique can be effective where you have, you know, one model that's trying to figure out what to say next, and then another model that's trying to understand what the first model just said, right? And you're effectively rewarding the system if you're able to close that loop, so if the system that's trying to understand what the first model said came to the same understanding that the original model started with, right? So if it's able to, you know, the first system is supposed to take some information and convey it in natural language, if the second model then is supposed to take that natural language and figure out what information what the first model is trying to convey, if that information then matches, right, so if it successfully understands what the first model said, then that's good, and you can use that kind of round trip consistency as a way to drive the system's learning in a technique we call self training. So, it's essentially teaching itself, and that technique has been very effective in helping to promote the effective use in these models and this kind of data efficiency. We only need to give it a few hundred examples where it could generalize like we wanted to and behave, essentially. David Staley 24:26 Early on in this conversation, you had mentioned people that maybe are worried about some of this research, or worried about what it could lead to. Should we be worried? Michael White 24:35 Yeah, I think the question of whether we should be worried about these models is a fascinating one, and you know, I think one thing I might point to quite famously a few years ago, people like Elon Musk, who's been in the news a lot recently, but even Stephen Hawking and others have talked about, we should be more afraid of AI than we should be of nuclear war, for example. And those kinds of concerns are really overblown. It's really unclear that we're anywhere near "The Terminator", where these systems are just going to go off on their own. There are many more mundane concerns that some of my academic colleagues, including here at Ohio State, are looking into these ethics issues with these models. And unfortunately, what we're finding is that these models that are trained on lots of lots of data just kind of scraped off the internet, including, you know, some Reddit sub forums that don't exactly model the language we'd like our kids to use, right? So, you know, there's all kinds of examples of toxic language and language bias that, you know, these models are picking up on and unfortunately, if you don't try to constrain these models, they all too easily degenerate into saying awful things, right? So, there's the one issue is that these models can not only exhibit toxic language or bias, there's some evidence they exacerbate that bias, so they can take actual problems with human language use and make them worse, either, you know, even more suspect to the kinds of things we'd like to avoid than people are. So, that's one concern. And then, of course, there's always the nefarious uses, right? So, people who want to automate misinformation on social media, for example, right, these kinds of models are going to make those kinds of nefarious uses just much easier than us, there's always concerns about that. People have heard of deep fakes in terms of images that you can't tell whether they're real or not; there's the same kind of issue with, you know, with stories that someone might be telling. It's so convincing that you feel that, this could have only been written by a person, so in that sense, maybe would pass the Turing Test, but that's true with pretty much any technology that gets developed, that technology can be used for good or for harm, so that's always a concern. David Staley 26:49 Tell us what's next for your research. Michael White 26:51 So, we come back around to a better Siri and Alexa. So it's clear that today's conversational systems are very limited, but they still can do useful things. In the future, we're moving towards having them do even more useful things, right? And this can, in part, require more sophisticated conversational capabilities. So, one brand new project I'm excited about, recently, we got some seed money from Ohio State to develop a system that's intended to help with patient prep. So, a new colleague in the med school, Dr. Subhankar Chakraborty, who came to us in the virtual patient group and was asking us about some new potential applications, and one thing we all got excited about was the potential for conversational systems to help patients get ready for complex procedures. So, you know, when you get to certain age you need to have a colonoscopy. I don't know if you've gotten to that age yet, but the preperation for it is - David Staley 27:49 I'm afraid I am, yes. Michael White 27:49 Well, you know, the preparation is not the most fun thing, but unfortunately, if you don't do it properly, then you can show up for the appointment and end up having to cancel the appointment because your preparation was incomplete. And unfortunately, you know, sometimes people will get an information sheet and they think they follow the instructions, but they misunderstood something, or they they forgot to take a step, or something like this. So, this causes an immense frustration on the parts of the patients, and substantial loss of money to the providers, you know, including at the Wexner Medical Center. And it can help if people call in and talk to a nurse or something, but people think, well, maybe I don't have to do that. If there were a system that was available 24/7 to answer your questions or to provide you a reminder for that next step, then you can imagine a conversational assistant that's always there to help talk you through things substantially improving on the outcomes. But to do that, it's going to have to have some more sophisticated capabilities in our systems to date. So it's going to need to keep track of what's already been discussed, you know, what you actually need to be reminded to do next, you know, aspects of the earlier conversation, or earlier conversations even that provided with the level of context sensitivity that we haven't even tried to do in the systems to date, so memory of where we are on a task and what's been discussed so far. This is the weak point of systems right now, they're very, you know, systems like Siri and Alexa are very much designed to take a command and execute it and be done with that. They're not really designed to have extended conversations. So there's a need to have more extended conversations that will really drive, you know, some interesting research on better technology to support that. And that intersects with many interesting issues in linguistics, so, so much of what's interesting about language is how words can mean somewhat different things in context, and how we can use words effectively in different contexts. Yeah, and then one other thing I'd like to mention that relates to this issue of reliably using this, these new kinds of technologies. I have a new project getting going with colleagues in computer science, where we're working on, in some sense, democratizing access to information. So, you know, these days for data science, but more generally, for anyone who wants to get access to the wealth of information that's stored in structured form, so things that are in databases, saying, takes a computer science degree effectively, to get at that information. If you have some complex request that you want to ask of some system that you can't just Google, right? You want to know, you know, how many students were enrolled each year, the past ten years, in the following history courses, or something like that, right? So, you know, Ohio State has that information; I bet you couldn't get access to that information easily. What we would like to do is support people asking those questions in natural language and actually getting an answer that they trust. And you know, there's been research on this for over 30 years, and systems get better and better, but they're still not reliable, right? So they make their best guess at what someone is saying, and they take that and convert it into a database query language, so a specialized programming language for extracting information out of databases effectively, but and then I'll just give you an answer, but you don't have any confidence that it really understood what you said, and if it got it wrong, you don't have any way to correct it. So, there's a need to be more transparent about what the system has done. So that's the basic idea behind our project, is to make it interactive for the system to be able to say, okay, here's what I understood you to be asking, and effectively, here's how I'm going to get that information for you and have it spell it out, step by step, this is how I'm going to calculate that information, allow you to say, oh, okay, that's right, but step three needs to change, and it needs to be this, and then once you've agreed on what you're actually asking for, it can compute the results, present them to you in a way that you understand and you can actually believe you got the right answer, because you understood how it was going about doing what I was doing. And you never had to take the computer science course in order to get there. David Staley 31:47 Michael White, thank you. Michael White 31:49 Thank you. Eva Dale 31:50 Voices from the Arts and Sciences is produced and recorded at The Ohio State University, College of Arts and Sciences Technology Services Studio. Sound engineering by Paul Kotheimer. Produced by Doug Dangler. I'm Eva Dale. Transcribed by https://otter.ai