Alongside opening your PaLM API for developer access, would Google also be backing developer projects in India?
Today, there are so many startups and developers looking to build solutions that serve these customers. What we’re now enabling is for them to start using our APIs, to build these solutions. We also have various teams, including customer engineering units and at our Google Cloud division, who already have pre-existing relationships with many developers. Depending on that, these teams will provide further hand-holding and assistance in terms of making the most of our generative AI APIs.
Researchers at Indian institutes have struggled with availability of digitized datasets in local languages. Would Google’s dataset now be available to institutes?
We already do that — Project Vaani was done in collaboration with the Indian Institute of Science (IISc). Through this, we’re seeing the first-ever digital dataset for Indic languages, for AI researchers.
When we started working on establishing a single generative AI model for 125 Indian languages, all of these languages were what researchers call zero-corpus. It’s not that we had very little data — for many of them, we had absolutely no digitized data at all. For the first time, we’ve managed to move many Indian languages from zero-corpus to at least the low-resource level.
All of this data is now open-sourced, which means that it is now openly available to academic researchers, startups, and even large companies. This is just the first tranche — over the coming months and the next one year, we’ll keep making more Indian language data available to our database. This will continue to happen as we keep scaling our efforts to more districts across India, through which the dataset that we have will become more diverse.
You’ve also open-sourced a local language bias benchmark in India. Given that data on Indian languages is still so scarce, is it possible to address AI bias at this stage?
The first and foremost thing that we did in bias was to start understanding the issue in a non-Western context. If you look at most AI literature on bias, up until two years ago, all of it — including understanding race and gender-based biases — were in the Western context. Hence, what we recognized is that there is a major societal context here — in India, for instance, there are multiple additional axes of bias that are based on caste, religion and others. We wanted to understand these. There is a technological gap in this regard, because the capability of language models were poorer in Indian languages than in more mature languages such as English. It is well-known that LLMs can hallucinate, which leads to misinformation in the output results. Hence, the problems (such as those of bias) often become worse in lower resource languages.
Then, there is also a pillar of aligning values. For instance, while confronting an elderly user’s queries in stoic phrases is acceptable in a Western cultural context, the same within India would not necessarily be so.
We wanted to understand these issues in the Indian cultural context — the technological gap of data is just one aspect that was missing in terms of understanding bias in an Indianized context. This would therefore apply even to English within the Indian context.
How good is the benchmark in addressing these biases?
It’s a start. We’ve already used our LLMs to automatically create certain phrases and sentence completions, through which we were able to get a comprehensive set of stereotypes that we uncovered in the local context.
In addition to this, we’re also engaging with the research community, and using our interactions to uncover additional sources of bias. These have led to multiple interesting ideas around intersectional issues of bias — for instance, in the case of a Dalit woman, a combination of gender and caste-based biases may come together within the model, which is what we’re working to identify and develop now.
How is the data on Indian languages collected by Google?
The entire effort is driven by IISc, and we’ve collaborated with them to share best practices on what we need the dataset to be like, in order for it to be used well by AI researchers. The IISc, in turn, has partners that operationalize their data collection efforts by having people reach various districts.
There, these partners then show a set of images to local residents, and record their local dialect answers.
Lack of compute is another major challenge, alongside data. Would Google also answer this for those who work on generative AI projects?
Yes. In many cases, we’ve been offering researchers access to free Google Cloud credits. This allows them to run their own AI models on our cloud infrastructure.
Compute is a significant enabler for building AI models, and is often hard to access for many developers and researchers. We recognize that, and we’ve been accordingly providing compute capabilities wherever feasible.
What contribution does Google Research India make in the development of PaLM, or even Bard?
We have significant engineering and research teams in India. In particular, our research lab has been making critical contributions to extending multilingual capabilities of LLMs within Google. We’ve of course started with Indian languages, but a lot of our work has been done in a manner that the same principles can be applied more broadly across other under-resourced languages around the world. This can help other languages also understand aspects around bias and misinformation.
Is it possible for versions of generative AI models to work on-device?
Our PaLM API runs on the cloud. But, there are certain generative AI capabilities that are becoming available on-device. They would be offline, and would be highly reduced models that are distilled for local functioning. They wouldn’t be as powerful as the ones that run on the cloud, but there are such models that exist today.
For instance, there are some versions of the PaLM API that are internally available, and work on-device.
Download The Mint News App to get Daily Market Updates & Live Business News.
Updated: 28 Jun 2023, 10:00 PM IST