{"id":20351,"date":"2024-03-06T05:19:11","date_gmt":"2024-03-06T05:19:11","guid":{"rendered":"https:\/\/interface.media\/?p=20351"},"modified":"2024-03-25T14:12:14","modified_gmt":"2024-03-25T14:12:14","slug":"big-data-isnt-big-enough-to-train-generative-ai","status":"publish","type":"post","link":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/","title":{"rendered":"\u201cBig Data\u201d isn\u2019t big enough to train generative AI"},"content":{"rendered":"\n<p>It\u2019s difficult to say exactly how much data exists on the internet at any one time. Billions of gigabits are created and destroyed every day. However, if we were to try and capture the scope of the data that exists online, estimates suggest that the figure was about 175 zettabytes in 2022.&nbsp;<\/p>\n\n\n\n<p>A zettabyte is equal to 1,000 exabytes, or 1 trillion gigabytes, by the way. That\u2019s (roughly) 3.5 trillion blu ray copies of <em>Blade Runner: The Director\u2019s Cut. <\/em>If you converted all the data on the internet into blu-ray copies of <em>Blade Runner: The Director\u2019s Cut<\/em>, and smashed every disk after watching it, you could spend about 510 times longer than the universe has existed watching Blade Runner before you ran out of copies.&nbsp;<\/p>\n\n\n\n<p>Was that a weird, tortured metaphor? Yes. Was it any more weird and unnecessary than Jared Leto\u2019s presence in <em>Blade Runner: 2049<\/em>? Absolutely not. But I digress. The sheer amount of data that\u2019s out there in the world is mind-boggling. It\u2019s hard to fit into metaphors and defies real-world examples.&nbsp;<\/p>\n\n\n\n<p>Also, it seems we\u2019re going to run out of it, and it might happen as early as 2030.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-we-re-running-out-of-good-data\">We\u2019re running out of (good) data?<\/h3>\n\n\n\n<p>The value of data has skyrocketed over the past few years. A global preoccupation with extracting, measuring, analysing, and\u2014above all\u2014monetising data defined the past decade. Big data has profoundly impacted our politics, entertainment, social spheres, and economies.&nbsp;<\/p>\n\n\n\n<p>Awareness of the things that can be accomplished with data\u2014from optimising e-commerce revenues to <a href=\"https:\/\/interface.media\/blog\/2024\/03\/06\/generative-ai-is-creating-hew-headaches-for-cybersecurity-teams\/\">cybercrime<\/a> and putting people like Donald Trump in positions of political power\u2014has led to a frenzied scramble for the stuff. Data is the world&#8217;s most valuable resourse. Like many other valuable resources, the rate at which we&#8217;re consuming it is turning out to be unsustainable. Organisations have tried frantically to gather as much data as possible. Any and all information about environmental conditions, personal spending habits, racial demographics, political bias, financial markets, and more has been gathered up into huge pools of Big Data.\u00a0\u00a0<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-ai-training-models-are-to-blame\">AI training models are to blame <\/h3>\n\n\n\n<p>However, there\u2019s a problem related to the hot new use for huge data sets: training AI models.<\/p>\n\n\n\n<p>\u201cThe gigantic volume of data that people stored but couldn\u2019t use has found applications,\u201d writed Atanu Biswas, a Professor at the Indian Statistical Institute in Kolkata. \u201cThe development and effectiveness of AI systems \u2014 their ability to learn, adapt and make informed decisions \u2014 are fuelled by data.\u201d&nbsp;<\/p>\n\n\n\n<p>Training a large language model like the one that fuels OpenAI\u2019s ChatGPT takes a lot of data. It took approximately 570 gigabytes of text data\u2013about 300 billion words\u2014to train ChatGPT. AI image generators are even hungrier, with stable diffusion engines like those powering DALL-E and Midjourney requiring over 5.8 billion image-text pairs to generate weird, unpleasant pictures where the hands are all wrong that <a href=\"https:\/\/faroutmagazine.co.uk\/hayao-miyazaki-on-ai-utterly-disgusted\/\">Haiyo Miyazaki described as \u201can insult to life itself.\u201d<\/a><\/p>\n\n\n\n<p>This is because these generative AI models \u201clearn\u201d by intaking an almost unfathomable amount of data then using statistical probability to create results based on the observable patterns in that data.&nbsp;<\/p>\n\n\n\n<p>Basically, what you put in defines what you get out. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-bad-data-poisons-ai-models\">Bad data poisons AI models <\/h3>\n\n\n\n<p>Increasingly, the huge reserves of data used to train these generative AI models are starting to look thin on the ground. Sure, there\u2019s a brain-breakingly large amount of data out there, but putting low quality\u2014even dangerous\u2014data into a model can produce low quality\u2014even dangerous\u2014results.&nbsp;<\/p>\n\n\n\n<p>Information sourced from social media platforms may exhibit bias, prejudice, or potentially disseminate disinformation or illicit material, all of which may be unwittingly adopted by the model.&nbsp;<\/p>\n\n\n\n<p>For example, Microsoft trained an AI bot using Twitter data in 2016. Almost immediately, the endeavour resulted in outputs <a href=\"https:\/\/www.theverge.com\/2016\/3\/24\/11297050\/tay-microsoft-chatbot-racist\">tainted with racism and misogyny<\/a>. Another problem is that, as the amount of AI-generated content on the internet increases, new models could end up being trained by cannibalising the content created by old models. Since AI can\u2019t create anything \u201cnew\u201d, only rephrase existing content, development would stagnate.&nbsp;<\/p>\n\n\n\n<p>As a result, developers are locked in an increasingly desperate hunt for \u201cbetter\u201d content sources. These include books, online articles, scientific papers, Wikipedia, and specific curated web material. For instance, Google\u2019s AI Assistant was trained using around 11,000 romance novels. The nature of the data supposedly made it a better conversationalist (and, one presumes, a hornier one?). The problem is that this kind of data\u2014books, research papers, and so on\u2014is a limited resource.&nbsp;<\/p>\n\n\n\n<p>The paper <a href=\"https:\/\/arxiv.org\/pdf\/2211.04325.pdf\"><em>Will we run out of data?<\/em><\/a> suggests that the point of data exhaustion could be alarmingly close. Comparing the projected \u201cgrowth of training datasets for vision and language models\u201d to the growth of available data, they concluded that \u201cwe will likely run out of language data between 2030 and 2050.\u201d Additionally, they estimate that \u201cwe will likely run out of vision data between 2030 to 2070.\u201d&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-where-will-we-get-our-ai-training-data-in-the-future-nbsp\">Where will we get our AI training data in the future?&nbsp;<\/h3>\n\n\n\n<p>There are several ways this problem could resolve itself. Popular solutions include smaller language models and even synthetic data created specifically to train AIs. There has even been a proposed freeze on all new AI research and development, signed by Elon Musk and Steve Wozniak, amojng others.&nbsp;<\/p>\n\n\n\n<p>\u201cThis is an existential risk,&#8221; commented Geoffrey Hinton, one of AI\u2019s most prominent figures, shortly after quitting Alphabet last year. &#8220;It\u2019s close enough that we ought to be &#8230; putting a lot of resources into figuring out what we can do about it.\u201d<\/p>\n\n\n\n<p>One hellish vision for the future appeared during the 2023 actors\u2019 strike. During the strike, the MIT Technology Review reported that tech firms extended an opportunity to unemployed actors. They could earn $150 per hour by portraying a range of emotions on camera. The captured footage was them used to aid in the &#8216;training&#8217; of AI systems.<\/p>\n\n\n\n<p>At least we won\u2019t <em>all<\/em> lose our jobs. Some of us will be paid to write new erotic fiction to power the next generation of Siri.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The task of operating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of AI.<\/p>\n","protected":false},"author":480,"featured_media":20352,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"apple_news_api_created_at":"2024-03-06T05:19:16Z","apple_news_api_id":"de4b5d99-4de2-40ec-b9e4-0b0acf2f536d","apple_news_api_modified_at":"2024-03-25T14:12:11Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAAAQ==","apple_news_api_share_url":"https:\/\/apple.news\/A3ktdmU3iQOy55AsKzy9TbQ","apple_news_cover_media_provider":"image","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_cover_video_id":0,"apple_news_cover_video_url":"","apple_news_cover_embedwebvideo_url":"","apple_news_is_hidden":"","apple_news_is_paid":"","apple_news_is_preview":"","apple_news_is_sponsored":"","apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":[],"apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[3],"tags":[],"topic":[614],"class_list":["post-20351","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-the-interface","topic-data-ai"],"acf":[],"apple_news_notices":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.6 (Yoast SEO v26.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>\u201cBig Data\u201d isn\u2019t big enough to train generative AI - Interface<\/title>\n<meta name=\"description\" content=\"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"\u201cBig Data\u201d isn\u2019t big enough to train generative AI\" \/>\n<meta property=\"og:description\" content=\"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/\" \/>\n<meta property=\"og:site_name\" content=\"Interface\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-06T05:19:11+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-03-25T14:12:14+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"1024\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Dan Brightmore\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Dan Brightmore\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/\",\"url\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/\",\"name\":\"\u201cBig Data\u201d isn\u2019t big enough to train generative AI - Interface\",\"isPartOf\":{\"@id\":\"https:\/\/interface.media\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg\",\"datePublished\":\"2024-03-06T05:19:11+00:00\",\"dateModified\":\"2024-03-25T14:12:14+00:00\",\"author\":{\"@id\":\"https:\/\/interface.media\/#\/schema\/person\/7c33499ca8e42b097028109cccb22748\"},\"description\":\"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.\",\"breadcrumb\":{\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage\",\"url\":\"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg\",\"contentUrl\":\"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg\",\"width\":1024,\"height\":1024,\"caption\":\"Brutalist Art inspired vector pattern artwork made with abstract geometric shapes and bold forms. Digital graphics design for poster, cover, art, presentation, prints, fabric, wallpaper and etc.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/interface.media\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"\u201cBig Data\u201d isn\u2019t big enough to train generative AI\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/interface.media\/#website\",\"url\":\"https:\/\/interface.media\/\",\"name\":\"Interface\",\"description\":\"Delivering World Class Content \u201cFrom Executive, For Executive\u201c\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/interface.media\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/interface.media\/#\/schema\/person\/7c33499ca8e42b097028109cccb22748\",\"name\":\"Dan Brightmore\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/interface.media\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e9ca282f0ef431735a64685769ad57886e24b074c4c58314392755fb79164164?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e9ca282f0ef431735a64685769ad57886e24b074c4c58314392755fb79164164?s=96&d=mm&r=g\",\"caption\":\"Dan Brightmore\"},\"url\":\"https:\/\/interface.media\/blog\/author\/dbrightmore\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"\u201cBig Data\u201d isn\u2019t big enough to train generative AI - Interface","description":"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_GB","og_type":"article","og_title":"\u201cBig Data\u201d isn\u2019t big enough to train generative AI","og_description":"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.","og_url":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/","og_site_name":"Interface","article_published_time":"2024-03-06T05:19:11+00:00","article_modified_time":"2024-03-25T14:12:14+00:00","og_image":[{"width":1024,"height":1024,"url":"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg","type":"image\/jpeg"}],"author":"Dan Brightmore","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Dan Brightmore","Estimated reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/","url":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/","name":"\u201cBig Data\u201d isn\u2019t big enough to train generative AI - Interface","isPartOf":{"@id":"https:\/\/interface.media\/#website"},"primaryImageOfPage":{"@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage"},"image":{"@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage"},"thumbnailUrl":"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg","datePublished":"2024-03-06T05:19:11+00:00","dateModified":"2024-03-25T14:12:14+00:00","author":{"@id":"https:\/\/interface.media\/#\/schema\/person\/7c33499ca8e42b097028109cccb22748"},"description":"Separating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of Gen AI.","breadcrumb":{"@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#primaryimage","url":"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg","contentUrl":"https:\/\/interface.media\/wp-content\/uploads\/sites\/3\/2024\/03\/iStock-1408572301.jpg","width":1024,"height":1024,"caption":"Brutalist Art inspired vector pattern artwork made with abstract geometric shapes and bold forms. Digital graphics design for poster, cover, art, presentation, prints, fabric, wallpaper and etc."},{"@type":"BreadcrumbList","@id":"https:\/\/interface.media\/blog\/2024\/03\/06\/big-data-isnt-big-enough-to-train-generative-ai\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/interface.media\/"},{"@type":"ListItem","position":2,"name":"\u201cBig Data\u201d isn\u2019t big enough to train generative AI"}]},{"@type":"WebSite","@id":"https:\/\/interface.media\/#website","url":"https:\/\/interface.media\/","name":"Interface","description":"Delivering World Class Content \u201cFrom Executive, For Executive\u201c","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/interface.media\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/interface.media\/#\/schema\/person\/7c33499ca8e42b097028109cccb22748","name":"Dan Brightmore","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/interface.media\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e9ca282f0ef431735a64685769ad57886e24b074c4c58314392755fb79164164?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e9ca282f0ef431735a64685769ad57886e24b074c4c58314392755fb79164164?s=96&d=mm&r=g","caption":"Dan Brightmore"},"url":"https:\/\/interface.media\/blog\/author\/dbrightmore\/"}]}},"_links":{"self":[{"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/posts\/20351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/users\/480"}],"replies":[{"embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/comments?post=20351"}],"version-history":[{"count":3,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/posts\/20351\/revisions"}],"predecessor-version":[{"id":20527,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/posts\/20351\/revisions\/20527"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/media\/20352"}],"wp:attachment":[{"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/media?parent=20351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/categories?post=20351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/tags?post=20351"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/interface.media\/wp-json\/wp\/v2\/topic?post=20351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}