• BombOmOm@lemmy.world
    link
    fedilink
    English
    arrow-up
    201
    arrow-down
    2
    ·
    1 year ago

    The difficult part of software development has always been the continuing support. Did the chatbot setup a versioning system, a build system, a backup system, a ticketing system, unit tests, and help docs for users. Did it get a conflicting request from two different customers and intelligently resolve them? Was it given a vague problem description that it then had to get on a call with the customer to figure out and hunt down what the customer actually wanted before devising/implementing a solution?

    This is the expensive part of software development. Hiring an outsourced, low-tier programmer for almost nothing has always been possible, the low-tier programmer being slightly cheaper doesn’t change the game in any meaningful way.

    • Knusper@feddit.de
      link
      fedilink
      English
      arrow-up
      12
      ·
      1 year ago

      Yeah, I’m already quite content, if I know upfront that our customer’s goal does not violate the laws of physics.

      Obviously, there’s also devs who code more run-of-the-mill stuff, like yet another business webpage, but those are still coded anew (and not just copy-pasted), because customers have different and complex requirements. So, even those are still quite a bit more complex than designing just any Gomoku game.

      • NoRodent@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        ·
        1 year ago

        I’m already quite content, if I know upfront that our customer’s goal does not violate the laws of physics.

        Haha, this is so true and I don’t even work in IT. For me there’s bonus points if the customer’s initial idea is solvable within Euclidean geometry.

        • Knusper@feddit.de
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 year ago

          Well, as per above, these are extremely complex requirements, so most don’t make for a good story.

          One of the simpler examples is that a customer wanted a solution for connecting special hardware devices across the globe, which are normally only connected directly.

          Then, when we talked to experts for those devices, we learnt that for security reasons, these devices expect requests to complete within a certain timeframe. No one could tell us what these timeframes usually are, but it certainly sounded like the universe’s speed limit, a.k.a. the speed of light, could get in our way (takes roughly 66 ms to go halfway around the globe).

          Eventually, we learned that the customer was actually aware of this problem and was fine with a solution, even if it only worked across short distances. But yeah, we didn’t know that upfront…

      • Vlyn@lemmy.zip
        link
        fedilink
        English
        arrow-up
        10
        ·
        1 year ago

        If you just let it do a full rewrite again and again, what protects against breaking changes in the API? Software doesn’t exist in a vacuum, there might be other businesses or people using a certain API and relying on it. A breaking change could be as simple as the same endpoint now being named slightly differently.

        So if you now start to mark every API method as “please no breaking changes for this” at what point do you need a full software developer again to take care of the AI?

        I’ve also never seen AI modify an existing code base, it’s always new code getting spit out (80% correct or so, it likes to hallucinate functions that don’t even exist). Sure, for run of the mill templates you can use it, but even a developer who told me on here they rely heavily on ChatGPT said they need to verify all the code it spits out, because sometimes it’s garbage.

        In the end it’s a damn language model that uses probability on what the next word should be. It’s fantastic for what it does, but it has no consistent internal logic and the way it works it never will.

          • Vlyn@lemmy.zip
            link
            fedilink
            English
            arrow-up
            9
            ·
            1 year ago

            Mate, I’ve used ChatGPT before, it straight up hallucinates functions if you want anything more complex than a basic template or a simple program. And as things are in programming, if even one tiny detail is wrong, things straight up don’t work. Also have fun putting ChatGPT answers into a real program you might have to compile, are you going to copy code into hundreds of files?

            My example was public APIs, you might have an endpoint /v2/device that was generated the first time around. Now external customers/businesses built their software to access this endpoint. Next run around the AI generates /v2/appliance instead, everything breaks (while the software itself and unit tests still seem to work for the AI, it just changed a name).

            If you don’t want that change you now have to tell the AI what to name things (or what to keep consistent), who is going to do that? The CEO? The intern? Who writes the perfect specification?

              • Vlyn@lemmy.zip
                link
                fedilink
                English
                arrow-up
                8
                ·
                1 year ago

                Management and sound technical specifications, that sounds to me like you’ve never actually worked in a real software company.

                You just said what the main problem is: ChatGPT is not perfect. Code that isn’t perfect (compiles + has consistent logic) is worthless. If you need a developer to look over it you’ve already lost and it would be faster to have that developer write the code themselves.

                Have you ever gotten a pull request with 10k lines of code? The AI could spit out so much code in an instant, no developer would be able to debug this mess or do a code review. They’ll just click “Approve” and throw it on the giant garbage heap whatever the AI decided to spit out.

                If there’s a bug down the line (if you even get the whole thing to run), good luck finding it if no one in your developer team even wrote the code in the first place.

    • akrot@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      2
      ·
      1 year ago

      Absolutely true, but many direction into implementing those solution with AIs.

    • doublejay1999@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      Which is why plenty of companies merely pay lip service to it, or don’t do it at all and outsource it to ‘communities’

  • theluddite@lemmy.ml
    link
    fedilink
    English
    arrow-up
    141
    arrow-down
    1
    ·
    1 year ago

    “I gave an LLM a wildly oversimplified version of a complex human task and it did pretty well”

    For how long will we be forced to endure different versions of the same article?

    The study said 86.66% of the generated software systems were “executed flawlessly.”

    Like I said yesterday, in a post celebrating how ChatGPT can do medical questions with less than 80% accuracy, that is trash. A company with absolute shit code still has virtually all of it “execute flawlessly.” Whether or not code executes it not the bar by which we judge it.

    Even if it were to hit 100%, which it does not, there’s so much more to making things than this obviously oversimplified simulation of a tech company. Real engineering involves getting people in a room, managing stakeholders, navigating conflicting desires from different stakeholders, getting to know the human beings who need a problem solved, and so on.

    LLMs are not capable of this kind of meaningful collaboration, despite all this hype.

    • thantik@lemmy.world
      link
      fedilink
      English
      arrow-up
      33
      arrow-down
      1
      ·
      1 year ago

      AI regularly hallucinates API endpoints that don’t exist, functions that aren’t part of that language, libraries that don’t exist. There’s no fucking way it did any of this bullshit. Like, yeah - it can probably do a mean autocomplete, but this is being pushed so hard because they want to drive wages down even harder. They want know-nothing middle-managers to point to this article and say “I can replace you with AI, get to work!”…that’s the only purpose of this crap.

      • Corkyskog@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        13
        arrow-down
        1
        ·
        edit-2
        1 year ago

        I think there is less of a conspiracy, and it’s just pushing investment. These AI articles sound exactly like when the internet was new and most people only had a cursory experience with it and people were pumping any company if they just said the word internet.

        Now that “Blockchain” has been beaten to death, they need a new hype word to drive mindless investment.

    • PlexSheep@feddit.de
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      4
      ·
      edit-2
      1 year ago

      Thank you for writing this so I only have to upvore upvote you.

      Edit: What the difference between one key can be

        • NoRodent@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          1 year ago

          Is it… vore but… upwards? So… vomiting people? Nah, I don’t want to know either.

          • Bleeping Lobster@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            1 year ago

            What’s up, vore!

            AFAIK vore is a rare fetish where someone gains sexual gratification from imagining swallowing someone whole (or imagining themselves being swallowed whole). Like the Bilquis scenes from American Gods, which I found oddly arousing.

            Oh fuck.

            • RiikkaTheIcePrincess@kbin.social
              link
              fedilink
              arrow-up
              2
              ·
              1 year ago

              Well, there are different kinds. Not all involve swallowing a critter whole, not all involve death, not all involve, er, mouths.

              Hey wait, where’s everyone going? Oh well, more vore for me 🤣Guess I should go check out American Gods. … And look for a particular kind of place to hang out 🤔

              • Bleeping Lobster@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                ·
                1 year ago

                It’s not for everyone, but I loved it and was saddened that the show got cancelled. It’s very surreal in places, the settings switch from standard middle America to jaw-droppingly-stunning god realm stuff.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 year ago

      80% accuracy, that is trash

      More than 80% of most codebases is boilerplate stuff: including the right files for dependencies, declaring functions with the right number of parameters using the right syntax, handling basic easily anticipated errors, etc. Sometimes there’s even more boilerplate, like when you’re iterating over a list, or waiting for input and handling it.

      The rest of the stuff is why programming is a highly paid job. Even a junior developer is going to be much better than an LLM at this stuff because at least they understand it’s hard, and at least often know when they should ask for help because they’re in over their heads. An LLM will “confidently” just spew out plausible bullshit and declare the job done.

      Because an LLM won’t ask for help, won’t ask for clarifications, and can’t understand that it might have made a mistake, you’re going to need your highly paid programmers to go in and figure out what the LLM did and why it’s wrong.

      Even perfecting self-driving is going to be easier than a truly complex software engineering project. At least with self-driving, the constraints are going to be limited because you’re dealing with the real world. The job is also always the same – navigate from A to B. In the software world you’re only limited by the limits of math, and math isn’t very limiting.

      I have no doubt that LLMs and generative AI will change the job of being a software engineer / programmer. But, fundamentally programming comes down to actually understanding the problem, and while LLMs can pretend they understand things, they’re really just like well-trained parrots who know what sounds to make in specific situations, but with no actual understanding behind it.

    • R0cket_M00se@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      3
      ·
      1 year ago

      LLMs are not capable of this kind of meaningful collaboration

      Which is why they’re a tool for professionals to amplify their workload, not a replacement for them.

      • CmdrShepard@lemmy.one
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        1
        ·
        1 year ago

        But C-suites will read articles like this and fire their development teams “because AI can do it.” I have my popcorn ready for the day it begins.

    • Nougat@kbin.social
      link
      fedilink
      arrow-up
      60
      arrow-down
      2
      ·
      1 year ago

      I’ve tried to have ChatGPT help me out with some Powershell, and it consistently wanted me to use cmdlets which do not exist for on premise Exchange. I told it as much, it apologized, and wanted me to use cmdlets that don’t exist at all.

      Large Language Models are not Artificial Intelligence.

    • thorbot@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      2
      ·
      1 year ago

      This also completely glosses over the fact that AI capable of writing this had huge R&D costs to get to that point and also have ongoing costs associated with running them. This whole article is a fucking joke, probably written by AI

    • aard@kyu.de
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      You meant to say “a competent human”, which a lot of programmers are not.

      While I’d expect this to be of rather low quality I’d bet money on having seen worse projects done by actual humans in the last 25 years.

  • flamekhan@lemmy.world
    link
    fedilink
    English
    arrow-up
    88
    arrow-down
    1
    ·
    1 year ago

    “We asked a Chat Bot to solve a problem that already has a solution and it did ok.”

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      58
      ·
      1 year ago

      to solve a problem that already has a solution

      And whose solution was part of its training set…

      • variaatio@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        19
        ·
        edit-2
        1 year ago

        half the time hallucinating something crazy in the in the mix.

        Another funny: Yeah, it’s perfect we just need to solve this small problem of it hallucinating.

        Ahemm… solving hallucinating is the “no it actually has to understand what it is doing” part aka the actual intelligence. The actually big and hard problem. The actual understanding of what it is asked to do and what solutions to that ask are sane, rational and workable. Understanding the problem and understanding the answer, excluding wrong answers. Actual analysis, understanding and intelligence.

        • merc@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          9
          ·
          1 year ago

          Not only that, but the same variables that turn on “hallucination” are the ones that make it interesting.

          By the very design of generative LLMs, the same knob that makes them unpredictable makes them invent “facts”. If they’re 100% predictable they’re useless because they just regurgitate word for word something that was in the training data. But, as soon as they’re not 100% predictable they generate word sequences in a way that humans interpret as lying or hallucinating.

          So, you can’t have a generative LLM that is both “creative” in that it comes up with a novel set of words, without also having “hallucinations”.

          • JoBo@feddit.uk
            link
            fedilink
            English
            arrow-up
            4
            ·
            1 year ago

            the same knob that makes them unpredictable makes them invent “facts”.

            This isn’t what makes them invent facts, or at least not the only (or main?) reason. Fake references, for example, arise because it encounters references in text, so it knows what they look like and where they should be used. It just doesn’t know what one is or that it’s supposed to match up to something real which says what the text implies that it says.

            • merc@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              so it knows what they look like and where they should be used

              Right, and if it’s set to a “strict” setting where it only ever uses the 100% perfect next word, if the words leading up to a reference are a match for a reference it has seen before it will spit out that specific reference from its training data. But, when it’s set to be “creative”, and predict words that are a good but not perfect match, it will spit out references that are plausible but don’t exist.

              So, if you want it to only use real references, you have to set it up to not be at all creative and always use the perfect next word. But, that setting isn’t very interesting because it just word-for-word spits out whatever was in its training data. If you want it to be creative, it will “daydream” references that don’t exist. The same knob controls both behaviours.

              • JoBo@feddit.uk
                link
                fedilink
                English
                arrow-up
                2
                arrow-down
                1
                ·
                1 year ago

                That’s not how it works at all. That’s not even how references work.

    • dustyData@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      Please ignore the hundreds of thousands of dollars and the corresponding electricity that was required to run the servers and infrastructure required to train and use this models, please. Or the master cracks the whip again, please, just say you’ll invest in our startup, please!

    • thanks_shakey_snake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 year ago

      They did do management-- They modeled the whole company as individual “staff” communicating with each other: CEO-bot communicates a product direction to the CTO-bot who communicates technical requirements to the developer-bot who asks for a “beautiful user interface” (lol) from the “art designer” (lol).

      It’s all super rudimentary and goofy, but management was definitely part of the experiment.

        • thanks_shakey_snake@lemmy.ca
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 year ago

          It was testing that the code worked, of course :) That was the only place that had human intervention, other than a) providing the initial prompt, and b) providing icons and stuff for the GUI, instead of using generated ones. That was the “get out of jail free” card:

          In cases where an interpreter struggles with identifying fine-grained logical issues, the involvement of a human client in software testing becomes optional. CHATDEV enables the human client to provide feedback and suggestions in natural language, similar to a reviewer or tester, using black-box testing or other strategies.

    • mrginger@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      edit-2
      1 year ago

      This is who will get replaced first, and they don’t want to see it. They’re the most important, valuable part of the company in their own mind, yet that was the one thing the AI got right, the management part. It still needed the creative mind of a human programmer to do the code properly, or think outside the box.

  • scarabic@lemmy.world
    link
    fedilink
    English
    arrow-up
    55
    ·
    1 year ago

    A test that doesn’t include a real commercial trial or A/B test with real human customers means nothing. Put their game in the App Store and tell us how it performs. We don’t care that it shat out code that compiled successfully. Did it produce something real and usable or just gibberish that passed 86% of its own internal unit tests, which were also gibberish?

    • ArbiterXero@lemmy.world
      link
      fedilink
      English
      arrow-up
      68
      arrow-down
      1
      ·
      1 year ago

      As someone that uses ChatGPT daily for boilerplate code because it’s super helpful…

      I call complete bullshite

      The program here will be “hello world” or something like that.

      • LazaroFilm@lemmy.world
        link
        fedilink
        English
        arrow-up
        28
        arrow-down
        1
        ·
        edit-2
        1 year ago

        Absolutely I can create a code for your app.

        void myApp(void) {
          // add the code for your app here
          return true;
        }
        

        You may need to change the code above to fit your needs. Make sure you replace the comment with the proper code for your app to work.

      • Semi-Hemi-Demigod@kbin.social
        link
        fedilink
        arrow-up
        6
        ·
        1 year ago

        It’s great for things like “How do I write this kind of loop in this language” but when I asked it for something more complex like a class or a big-ish function it hallucinates. But it makes for a very fast way to get up to speed in a new language

          • Semi-Hemi-Demigod@kbin.social
            link
            fedilink
            arrow-up
            2
            arrow-down
            2
            ·
            1 year ago

            It’s a lot less in my opinion, because you can just ask it a question rather than having to read and interpret things. Every programming tutorial in every language is going to waste my time explaining how loops and conditionals work, when all I want is how this language does them.

              • Semi-Hemi-Demigod@kbin.social
                link
                fedilink
                arrow-up
                1
                arrow-down
                1
                ·
                1 year ago

                Right, but you can’t give it the variable names you’re using and have it fill them in, and if you want to do something inside that loop with

                I can ask ChatGPT “Write me a loop in C# that will add the variable value_increase to the variable current_value and exit when current_value is equal to or greater than the variable limit_value, with all the variables being floats”

                You won’t find that answer immediately on the Internet, and you’re more likely to make errors synthesizing the new syntax.

                But you do you, I’ll keep using ChatGPT and looking like a miracle worker.

                • Kerfuffle@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  ·
                  1 year ago

                  Right, but you can’t give it the variable names you’re using and have it fill them in, and if you want to do something inside that loop with

                  Why are you actively trying to avoid learning how to write the loop? Are you planning to have ChatGPT fill in your loop templates for the rest of your life?

                  But you do you, I’ll keep using ChatGPT and looking like a miracle worker.

                  It’s going to be slower overall than just using the reference and learning how to do it. I really, really am skeptical that a developer at the level where they need that feature is going to seem like a miracle worker to anyone other than people who are just impressed when you can do anything with a computer.

                • Vlyn@lemmy.zip
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  1 year ago

                  If writing simple loops with ChatGPT makes you a miracle worker then you might have other problems than AI.

                  And even simple things break down when you ask it about using library functions (it likes to hallucinate heavily there).

      • Ertebolle@kbin.social
        link
        fedilink
        arrow-up
        6
        ·
        1 year ago

        OTOH, if you take that hello world program and ask it to compose a themed cocktail menu around it, it’ll cheerfully do that for you.

      • kitonthenet@kbin.social
        link
        fedilink
        arrow-up
        3
        ·
        edit-2
        1 year ago

        I can totally see the use case for boilerplate, but I’m also very very rarely writing new classes from scratch or whatever.

        As always, proof of concept or gtfo

    • KoboldCoterie@pawb.social
      link
      fedilink
      English
      arrow-up
      16
      arrow-down
      1
      ·
      1 year ago

      The study said 86.66% of the generated software systems were “executed flawlessly.”

      But…

      Nevertheless, the study isn’t perfect: Researchers identified limitations, such as errors and biases in the language models, that could cause issues in the creation of software. Still, the researchers said the findings “may potentially help junior programmers or engineers in the real world” down the line.

        • KoboldCoterie@pawb.social
          link
          fedilink
          English
          arrow-up
          12
          arrow-down
          1
          ·
          1 year ago

          And when the reviews are terrible and end users start reporting unreal quantities of bugs, they’ll fire the junior devs. They should have fixed those!

      • radix@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        1
        ·
        1 year ago

        🎵🎵 99 little bugs in the code, 99 bugs in the code, Fix one bug, compile it again, 101 little bugs in the code. 101 little bugs in the code, 101 bugs in the code, Fix one bug, compile it again, 103 little bugs in the code. 🎵🎵

    • scarabic@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      ·
      1 year ago

      And how long did it take to compose the “assignments?” Humans can work with less precise instructions than machines, usually, and improvise or solve problems along the way or at least sense when a problem should be flagged for escalation and review.

    • BombOmOm@lemmy.world
      link
      fedilink
      English
      arrow-up
      24
      arrow-down
      2
      ·
      1 year ago

      The new role of a senior dev will be contract work slicing these Gordian knots.

      The amount of money wasted building and destroying these knots is immeasurable. Getting things right the first time takes experienced individuals who know the product well and can anticipate future pain points. Nothing is as expensive as cheap code.

  • kitonthenet@kbin.social
    link
    fedilink
    arrow-up
    17
    ·
    edit-2
    1 year ago

    At the designing stage, the CEO asked the CTO to “propose a concrete programming language” that would “satisfy the new user’s demand,” to which the CTO responded with Python. In turn, the CEO said, “Great!” and explained that the programming language’s “simplicity and readability make it a popular choice for beginners and experienced developers alike.”

    I find it extremely funny that project managers are the ones chatbots have learned to immitate perfectly, they already were doing the robot’s work: saying impressive sounding things that are actually borderline gibberish

    • thanks_shakey_snake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 year ago

      What does it even mean for a programming language to “satisfy the new user’s demand?” Like when has the user ever cared whether your app is built in Python or Ruby or Common Lisp?

      It’s like “what notebook do I need to buy to pass my exams,” or “what kind of car do I need to make sure I get to work on time?”

      Yet I’m 100% certain that real human executives have had equivalent conversations.

    • realharo@lemm.ee
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      1 year ago

      And ironically Python (with Pygame which they also used) is a terrible choice for this kind of game - they ended up making a desktop game that the user would have to download. Not playable on the web, not usable for a mobile app.

      More interestingly, if decisions like these are going to be made even more based on memes and random blogposts, that creates some worrying incentives for even more spambots. Influence the training data, and you’re influencing the decision making. It kind of works like that for people too, but with AI, it’s supercharged to the next level.

  • Knusper@feddit.de
    link
    fedilink
    English
    arrow-up
    15
    ·
    1 year ago

    the CTO responded with Python. In turn, the CEO said, “Great!” and explained that the programming language’s “simplicity and readability make it a popular choice for beginners and experienced developers alike.”

    Yep, that does sound like my CEO.

  • blazera@kbin.social
    link
    fedilink
    arrow-up
    11
    arrow-down
    1
    ·
    1 year ago

    Researchers, for example, tasked ChatDev to “design a basic Gomoku game,” an abstract strategy board game also known as “Five in a Row.”

    What tech company is making Connect Four as their business model?

    • realharo@lemm.ee
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      1 year ago

      This is also the kind of task you would expect it to be great at - tutorial-friendly project for which there are tons of examples and articles written online, that guide the reader from start to finish.

      The kind of thing you would get a YouTube tutorial for in 2016 with title like “make [thing] in 10 minutes!”. (see https://www.google.com/search?q=flappy+bird+in+10+minutes)

      Other things like that include TODO lists (which is even used as a task for framework comparisons), tile-based platformer games, wordle clones, flappy bird clones, chess (including online play and basic bots), URL shorteners, Twitter clones, blogging CMSs, recipe books and other basic CRUD apps.

      I wasn’t able to find a list of tasks in the linked paper, but based on the gomoku one, I suspect a lot of it will be things like these.

  • gencha@feddit.de
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    2
    ·
    1 year ago

    What a load of bullshit. If you have a group of researchers provide “minimal human input” to a bunch of LLMs to produce a laughable program like tic-tac-toe, then please just STFU or at least don’t tell us it cost $1. This doesn’t even have the efficiency of a Google search. This AI hype needs to die quick

  • atzanteol@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    This research seems to be more focused on whether the bots would interoperate in different roles to coordinate on a task than about creating the actual software. The idea is to reduce “halucinations” by providing each bot a more specific task.

    The paper goes into more about this:

    Similar to hallucinations encountered when using LLMs for natural language querying, directly generating entire software systems using LLMs can result in severe code hallucinations, such as incomplete implementation, missing dependencies, and undiscovered bugs. These hallucinations may stem from the lack of specificity in the task and the absence of cross-examination in decision- making. To address these limitations, as Figure 1 shows, we establish a virtual chat -powered software tech nology company – CHATDEV, which comprises of recruited agents from diverse social identities, such as chief officers, professional programmers, test engineers, and art designers. When presented with a task, the diverse agents at CHATDEV collaborate to develop a required software, including an executable system, environmental guidelines, and user manuals. This paradigm revolves around leveraging large language models as the core thinking component, enabling the agents to simulate the entire software development process, circumventing the need for additional model training and mitigating undesirable code hallucinations to some extent.