What An AI Learns From A Baby’s-Eye View Of The World

Subscribe to Science Friday

Listen to this story and more on Science Friday’s podcast.

A smiling baby wearing a head camera and holding stuffed animals. — An 18 month old baby wearing a head-mounted camera. Photo by Wai Keen Vong

There’s a lot to learn in the first couple of years of a child’s life—not the least of which is how to talk. But little kids don’t sit down and study a vocabulary book. They soak up language from daily experiences, which are often filled with parents and caregivers saying things like “look at the kitty cat.” Scientists wondered whether an artificial intelligence model could learn about language using a similar strategy—not by being fed a curated set of pictures and words, but by eavesdropping on the day-to-day activities of a small child.

They found that associating images and sounds from 60 hours of video captured by a camera mounted on a baby’s head could teach a computer model a set of several dozen basic nouns, such as “car,” “cat,” and “ball.” And the learning was generalizable, meaning that the computer was able to properly identify cars and cats that it had not seen before.

Dr. Wai Keen Vong, a research scientist in the Center for Data Science at New York University and one of the authors of a study recently published in the journal Science, joins SciFri’s Kathleen Davis to talk about the research and what it can teach us about learning.

Donate To Science Friday

Invest in quality science journalism by making a donation to Science Friday.

Donate

Segment Guests

Wai Keen Vong

Dr. Wai Keen Vong is a research scientist in the Center for Data Science at New York University in New York, New York.

Segment Transcript

JOHN DANKOSKY: This is Science Friday. I’m John Dankosky.

KATHLEEN DAVIS: And I’m Kathleen Davis.

There’s a lot to learn in the first couple of years of a child’s life, not the least of which is how to talk. But little kids don’t sit down and just study a vocabulary book. They soak up language from their daily experiences, which are often filled with parents and caregivers asking questions like, Do you want some water? or saying things like, Oh, look, a kitty cat.

Scientists wondered whether an AI model could learn about language the same way, not by being fed a curated set of pictures and words or the entire content of the internet, but by eavesdropping on the day-to-day activities of a two-year-old. Dr. Wai Keen Vong is a research scientist in the Center for Data Science at New York University and one of the authors of a report on that effort recently published in the journal Science.

Welcome to Science Friday, Dr. Vong.

WAI KEEN VONG: Great to speak to you, Kathleen.

KATHLEEN DAVIS: So you trained your AI model on about 60 hours of video. Where did all that video come from?

WAI KEEN VONG: Yeah. So this video was really the efforts of a number of pioneering developmental psychologists and cognitive scientists, who realized that we really needed data that better matched what children really experienced to formulate better theories. And so about a decade ago, a number of researchers got together and started coming up with how to gather this data. How could we get something that resembled the real experiences of a child and produce this mini-camera that they were able to attach to a number of baby’s heads and record them over the course of their developing years?

KATHLEEN DAVIS: OK. So I’m trying to imagine this for myself. This may be the cutest study ever done. I mean, what would it look like if I was looking at this baby with a camera?

WAI KEEN VONG: So the initial set of babies wore the camera on a headband or a strap. But a second set of kids, whose data is now being collected, involves a more sturdy and firm helmet.

It looks like a home video. You are seeing these kids playing with their toys. Their parents are reading books to them. They’re having mealtimes or they’re out in the playground on a slide or a swing. You’re really getting a real first-person glimpse at what the baby is interested in, like on a second-to-second level, really. It’s incredible.

KATHLEEN DAVIS: Wow. So you fed 60 hours of this, what I would imagine is a stream of consciousness data, to this AI model, as you said, which could include things like going to the park or talking to a caregiver. What did this model actually learn?

WAI KEEN VONG: Right. So the way the model works is it’s trained to learn language by associating what it’s seeing with what it’s hearing. So in cognitive science, this style of learning is what’s known as associative learning because you’re learning to associate certain words with certain images or certain objects out there. And we’re really the first to apply these kinds of theories to this representative subset of what children are seeing and hearing and seeing what it could learn.

So what it could learn is basically a number of the earliest object words, or nouns, that children also display when learning a language. So those are often the first words that children learn to speak. And we wanted to see if we could capture that aspect of learning from this data and this model.

KATHLEEN DAVIS: So what kind of words? I’m imagining like “dada” or “mama” or “cat” or “dog,” things like that.

WAI KEEN VONG: Exactly. Yeah, words like “bowl,” words like “crib,” or “car” or “chair.” Yeah, these objects pop up almost in every video that you see. And so they’re the ones that are relatively straightforward to test for as well.

KATHLEEN DAVIS: Were there any words that were surprisingly hard for this AI to learn?

WAI KEEN VONG: Yeah. I think one of the obvious failure modes we saw was that the model just couldn’t really acquire the word “hand” no matter how we tested it.

KATHLEEN DAVIS: “Hand,” like your hand?

WAI KEEN VONG: Exactly. Exactly. And one reason is that hands appear all over the video data set because the kid is manipulating things or playing with things. They always show up in the video no matter what frame or what words are being spoken. And so that is a bit of a challenge with how our model operates. And so we didn’t find a consistent learning with that word.

KATHLEEN DAVIS: Did the AI think that “hand” meant something else, or vice versa?

WAI KEEN VONG: Yeah, exactly. So if you look through the data, which I did quite a lot while working on this study, the word “hand” really only gets spoken when the child is playing at the beach or at the sandpit. And so because of this association between words and objects that is built into our model, the model mistakenly learns to associate the word “hand” with bits of sand instead.

KATHLEEN DAVIS: [LAUGHS]. That is very interesting. There are all sorts of television shows that we hear are supposed to be educational for young kids. Do you think that the model may learn more from watching Sesame Street or Peppa Pig than it would from an actual child?

WAI KEEN VONG: That’s a fantastic question. I don’t know. I do know a number of other researchers from the Netherlands ran a similar study, training a similar kind of model, but on episodes from Peppa Pig, and showed that it was able to learn a similar amount of words as our model.

Obviously, children start watching television a little bit later in life. We’re primarily interested in language at the very conception, at the very beginning of life, and showing how that emerges from this blooming, buzzing confusion that every newborn is experiencing.

KATHLEEN DAVIS: I mean, to be clear, we aren’t saying that this is necessarily how kids’ brains actually work, but just that this approach is enough to start to train this AI model. Is that right?

WAI KEEN VONG: That’s right. This work speaks to this larger debate in cognitive science around nature versus nurture. What needs to be built in for humans to learn versus what can we acquire from experience? So our model, our work, is really focused on that latter aspect, where we actually have very few elements, or innate components to our model, and much of the learning is really driven by the video data that’s being fed into our model.

So I would say it’s something like 5% built-in, or innate, elements and 95% experience. And I couldn’t give you a range on people’s opinions in the field of what the right split is, but I hope our work shifts people’s minds to realize that experience is really valuable and we don’t really need that much built in on top of that.

KATHLEEN DAVIS: There are parts of language that aren’t just the words, but things like tone. Do you think that you could train the AI to understand that it’s different for a calm voice to say “Watch out for the dog,” or a scared voice saying, “Watch out for that dog”?

WAI KEEN VONG: Yeah, absolutely. I think that’s something in the future we’d definitely like to try. Like I mentioned before, most children’s words are these common objects or these concrete nouns. But children also learned words like “uh-oh” quite early on in life, right? And the tone in which that is spoken indicates they’re recognizing something about the situation is surprising or a mistake has been made. So I think tone is a huge factor.

Another thing that many other researchers have looked into is this notion of child-directed speech, and recognizing that the way that we often talk to babies is different from how we talk to adults. We often accentuate certain words or certain phonemes. And other researchers have shown that that also might be a pedagogical strategy to help with learning.

KATHLEEN DAVIS: If you gave this AI model more hours of video to work with, do you think that it would just get better at working with this set of nouns that it learned, or would it possibly have a bigger vocabulary?

WAI KEEN VONG: I think the answer is probably both to those. We see that in children as well. We see that children, a little bit after the age of one, really show this large comprehension boost in their word recognition abilities. And we’d hope that our model displays a similar qualitative effect if we were able to feed in more data.

Now, that’s something we’re actually already working on. So stay tuned for that.

KATHLEEN DAVIS: OK. And so far, as you said, you’ve been pretty focused on these concrete nouns. But I mean, what’s next? Do you do color or verbs? How do you expand this language understanding to the next level?

WAI KEEN VONG: Yeah, absolutely. I think going to verbs, to color words, to adjectives, to prepositions– I mean, we really want, in some shape or form, to be able to explain all of the different elements of language that children eventually acquire.

In this work, we really focused on this multi-modal aspect of linking language to vision because that is really where children get started. But later on, it seems like children might have other mechanisms available, perhaps similar to the next token prediction objective in language models, that might enable acquiring some of these more complex words.

I’ve looked into this question a little bit for verbs. It’s quite funny because you think, if a child is learning a word like “run,” if you look in the videos when the word “run” is spoken, it’s because the kid is running, and the camera is really just bobbing up and down. So feeding that into the model, it’s going to have a very different notion of what running is, I think, to most people.

KATHLEEN DAVIS: Right. Well, that’s all the time that we have for today. Dr. Wai Keen Vong is a research scientist in the Center for Data Science at New York University. Thank you so much for taking the time to talk with us today.

WAI KEEN VONG: Thank you for all of your lovely questions, Kathleen. It was great to speak to you.

Copyright © 2023 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of Science Friday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies/

Meet the Producers and Host

About Charles Bergquist

As Science Friday’s director and senior producer, Charles Bergquist channels the chaos of a live production studio into something sounding like a radio program. Favorite topics include planetary sciences, chemistry, materials, and shiny things with blinking lights.

About Kathleen Davis

Kathleen Davis is a producer and fill-in host at Science Friday, which means she spends her weeks researching, writing, editing, and sometimes talking into a microphone. She’s always eager to talk about freshwater lakes and Coney Island diners.

Cookie	Duration	Description
_abck	1 year	This cookie is used to detect and defend when a client attempt to replay a cookie.This cookie manages the interaction with online bots and takes the appropriate actions.
ASP.NET_SessionId	session	Issued by Microsoft's ASP.NET Application, this cookie stores session data during a user's website visit.
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
bm_sz	4 hours	This cookie is set by the provider Akamai Bot Manager. This cookie is used to manage the interaction with the online bots. It also helps in fraud preventions
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	past	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
nlbi_972453	session	A load balancing cookie set to ensure requests by a client are sent to the same origin server.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
TiPMix	1 hour	The TiPMix cookie is set by Azure to determine which web server the users must be directed to.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
visid_incap_972453	1 year	SiteLock sets this cookie to provide cloud-based website security services.
X-Mapping-fjhppofk	session	This cookie is used for load balancing purposes. The cookie does not store any personally identifiable data.
x-ms-routing-name	1 hour	Azure sets this cookie for routing production traffic by specifying the production slot.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
S	1 hour	Used by Yahoo to provide ads, content or analytics.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__jid	30 minutes	Cookie used to remember the user's Disqus login credentials across websites that use Disqus.
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_gat_UA-28243511-22	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
countryCode	session	This cookie is used for storing country code selected from country selector.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
vglnk.Agent.p	1 year	VigLink sets this cookie to track the user behaviour and also limit the ads displayed, in order to ensure relevant advertising.
vglnk.PartnerRfsh.p	1 year	VigLink sets this cookie to show users relevant advertisements and also limit the number of adverts that are shown to them.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_dc_gtm_UA-28243511-20	1 minute	No description
abtest-identifier	1 year	No description
AnalyticsSyncHistory	1 month	No description
ARRAffinityCU	session	No description available.
ccc	1 month	No description
COMPASS	1 hour	No description
cookies.js_dtest	session	No description
debug	never	No description available.
donation-identifier	1 year	No description
f	never	No description available.
GFE_RTT	5 minutes	No description available.
incap_ses_1185_2233503	session	No description
incap_ses_1185_823975	session	No description
incap_ses_1185_972453	session	No description
incap_ses_1319_2233503	session	No description
incap_ses_1319_823975	session	No description
incap_ses_1319_972453	session	No description
incap_ses_1364_2233503	session	No description
incap_ses_1364_823975	session	No description
incap_ses_1364_972453	session	No description
incap_ses_1580_2233503	session	No description
incap_ses_1580_823975	session	No description
incap_ses_1580_972453	session	No description
incap_ses_198_2233503	session	No description
incap_ses_198_823975	session	No description
incap_ses_198_972453	session	No description
incap_ses_340_2233503	session	No description
incap_ses_340_823975	session	No description
incap_ses_340_972453	session	No description
incap_ses_374_2233503	session	No description
incap_ses_374_823975	session	No description
incap_ses_374_972453	session	No description
incap_ses_375_2233503	session	No description
incap_ses_375_823975	session	No description
incap_ses_375_972453	session	No description
incap_ses_455_2233503	session	No description
incap_ses_455_823975	session	No description
incap_ses_455_972453	session	No description
incap_ses_8076_2233503	session	No description
incap_ses_8076_823975	session	No description
incap_ses_8076_972453	session	No description
incap_ses_867_2233503	session	No description
incap_ses_867_823975	session	No description
incap_ses_867_972453	session	No description
incap_ses_9117_2233503	session	No description
incap_ses_9117_823975	session	No description
incap_ses_9117_972453	session	No description
li_gc	2 years	No description
loglevel	never	No description available.
msToken	10 days	No description