Read Time: 07 minutes
Sr Adv Sibal explained that OpenAI gathered data, information, and facts from diverse sources, with news representing only a small fraction and ANI’s contributions forming an even smaller subset. He emphasized that the information collected was transformed and never presented in ANI’s original expression.
On Wednesday, OpenAI, before the Delhi High Court asserted that the purpose of their large language model (LLM) is to generate content and not to regurgitate content. OpenAI claimed that sometimes, due to a glitch, content is regurgitated, but that is a rare circumstance.
These observations were made in a suit filed by ANI Media against OpenAI, alleging copyright infringement related to the use of ANI's content for training artificial intelligence models. The case was argued before the bench of Justice Amit Bansal by Advocate Sidhanth Kumar on behalf of ANI, while Senior Advocate Amit Sibal represented OpenAI.
During the proceedings, Senior Advocate Amit Sibal presented arguments regarding the functioning of large language models (LLMs). He explained that ChatGPT did not reiterate previous responses but instead generated answers dynamically by selecting the most probable next word.
Senior Advocate Sibal further clarified that the data utilized by the model was gathered and stored abroad. Before being processed, the data underwent filtration to remove unwanted and duplicative content. The model did not directly store textual information but converted it into tokens, which were then transformed into numerical representations. These numerical values helped the system recognize and record patterns within the dataset.
Elaborating on the technical aspects, Senior Advocate Sibal stated that once tokenization occurred, the training data is introduced to the LLM, which would then generate vectors for each token. These vectors represent the meanings of words and enable the model to discern different contexts and relationships among words. He emphasized that the model did not extract specific expressions from any single author but instead identified broader linguistic patterns.
Senior Advocate Sibal argued that due to the vast and diverse data sources involved, no individual author could claim exclusive ownership of the generated content. He maintained that the LLM did not store text but relied on a dictionary composed of vectors rather than words. A prompt entered into the system is first converted into tokens outside the LLM, and these tokens subsequently generate corresponding vectors. The sequence of vectors is then used to create a probability distribution, determining the model’s output.
Furthermore, Senior Advocate Sibal asserted that while an author’s specific expression might be protected, the LLM did not extract or replicate such expressions verbatim. Instead, it identified general patterns, correlations, and trends found across publicly available works. He described these extracted elements as fundamental aspects of language, such as word meanings and grammatical rules, rather than unique creative expressions.
Senior Advocate Sibal concluded that the purpose of OpenAI’s model was to analyze structural information from various sources rather than to reproduce any particular work. He maintained that the model’s responses were based on an extensive churn of patterns and syntax, ensuring that no single individual served as the sole source of any generated content.
For ANI: Advocates Sidhant Kumar, Akshit Mago, Manyaa Chandok, Om Batra, Anshika Saxena and Shagun ChopraFor OpenAI: Senior Advocate Amit Sibal with Advocates Sanjeev Kapoor, Nirupam Lodha, Madhav Khosla, Gautam Wadhwa, Moha Paranjpe, Vanshika Thapliyal, Ankit Handa, Darpan Sachdeva, Saksham Dhingra, Rajat Bector, For Intervenors: Advocates Ameet Datta, Ayush Hoonka, Riddima, Naimish Tewari, Akshay Natrajan and Harsh KaushikCase Title: Ani Media Pvt Ltd v Open AI (CS(COMM) 1028/2024)
Please Login or Register