Will Large Language Models Really Change How Work Is Done?

Even as organizations adopt increasingly powerful LLMs, they will find it difficult to shed their reliance on humans.

Peter Cappelli, Prasanna Tambe, and Valery Yakubovich March 05, 2024 Reading Time: 18 min

Topics

Dan Page/theispot.com

Large language models (LLMs) are a paradigm-changing innovation in data science. They extend the capabilities of machine learning models to generating relevant text and images in response to a wide array of qualitative prompts. While these tools are expensive and difficult to build, multitudes of users can use them quickly and cheaply to perform some of the language-based tasks that only humans could do before.

This raises the possibility that many human jobs — particularly knowledge-intensive jobs that primarily involve working with text or code — could be replaced or significantly undercut by widespread adoption of this technology. But in reality, LLMs are much more complicated to use effectively in an organizational context than is typically acknowledged, and they have yet to demonstrate that they can satisfactorily perform all of the tasks that knowledge workers execute in any given job.

LLMs in Organizations

Most of the potential areas of use for LLMs center on manipulating existing information, much of it specific to an individual organization. This includes summarizing content and producing reports (which represents 35% of use cases, according to one survey) and extracting information from documents, such as PDFs containing financial information, and creating tables from them (33% of use cases).1 Other popular and effective uses of LLMs include creating images with tools like Dall-E 2 or generating synthetic data for applications when real data is difficult to obtain, such as data to train voice recognition tools like Amazon’s Alexa.2

Most organizations using LLMs are still in the exploration phase. Customer interactions, knowledge management, and software engineering are three areas of extensive organizational experiments with generative AI. For example, Audi recruited a vendor to build and deploy a customized LLM-based chatbot that would answer employees’ questions about available documentation, customer details, and risk evaluations. The chatbot retrieves relevant information from a variety of proprietary databases in real time and is supposed to avoid answering questions if the available data is insufficient. The company used prompt engineering tools developed by Amazon Web Services for retrieval augmented generation (RAG), a common customization procedure that uses organization-specific data without requiring changes to the underlying foundation model.3

Unlike conventional automation tools that presume a fixed input, an explicit process, and a single correct outcome, LLM tools’ input and output can vary, and the process through which the response is produced is a black box. Managers can’t evaluate and control these tools the same way they do conventional machines. So there are practical questions that must be answered before using these tools in an organizational context: Who will determine the input? Who will evaluate the quality of the output, and who will get access to it?

Challenges With Integrating LLMs Into Organizations

In this section, we focus on five key challenges likely to arise when integrating LLMs into the organization and why they are likely to require continued involvement from human employees.

1. The Knowledge Capture Problem. Organizations produce huge volumes of proprietary, written information that they cannot easily process themselves: strategic plans, job descriptions, organizational charts and processes, product documentation, performance evaluations, and so on. An LLM trained on such data can produce insights that the organization likely did not have access to before. This might be a company’s most important advantage in using LLMs.

Organizations that make the most of LLMs will use them to generate outputs that pertain specifically to their needs and are informed by their data sources. For example, “What are consumer interests likely to be in China?” is a less-pertinent question for a business than “How should we adapt our products for consumers in China?” To usefully answer the latter, the LLM requires access to the organization’s proprietary data; the quality of the response depends on the quality and relevance of the data on which the LLM has been trained.

Feeding the right information to the LLM is no small task, given the considerable effort required to sort out the volumes of junk or irrelevant data organizations produce. Useful knowledge about organizational culture, survey results from employees, and so forth take time to assemble and organize. Even then, a lot of important knowledge might be known to individuals but not documented. In one study, only about 11% of data scientists reported that they have been able to fine-tune their LLMs with the data needed to produce good and appropriate answers specific to their organization.4 The process is expensive and requires powerful processors, thousands of high-quality training and verification examples, extensive engineering, and ongoing updates.

There is also the problem of data pollution within an LLM: If poor-quality data from anywhere in the organization is provided to the LLM, it affects not only current answers but future ones as well. A set of rules about the curation of data used to train an LLM should be put in place, and ultimately someone in the organization must police such activities.

Since customizing LLMs requires vast amounts of high-quality data, companies must organize and standardize explicit knowledge and codify it in standard operating procedures, job descriptions, employee manuals, user guides, computer algorithms, and other units of organizational knowledge for use in LLMs. Computer programming is an area where explicit knowledge can be particularly important. LLMs are already very helpful with answering programming questions, and there are numerous LLM-based tools, like GitHub’s Copilot and Hugging Face’s StarCoder, that assist human programmers in real time. One study suggests that programmers prefer using LLM-based tools for generating code because they provide a better starting point than the alternative of searching online for existing code to reuse. However, this approach does not improve the success rate of programming tasks. The main drawback so far has been that additional time is required to debug and understand the code the LLM has generated.5

The difficulty of the knowledge capture task for organizations is likely to drive the creation of new jobs. For instance, data librarians, who catalog and curate organization-specific data that can be used to train LLM applications, could become critical in some contexts.

2. The Output Verification Problem. LLM outputs for programming tasks can be tested for correctness and usefulness before they are rolled out and used in situations with real consequences. Most tasks are not like this, though. Strategic recommendations or marketing ideas, for example, are not outputs that can be tested or verified easily. For these kinds of tasks, the output simply has to be “good enough” rather than perfectly correct in order to be useful. When is an LLM answer good enough? For simple tasks, employees with the relevant knowledge can judge for themselves simply by reading the LLM’s answer.

So far, the evidence on whether users will take the task of checking output seriously is not encouraging. In an experiment, white-collar workers were given the option to use an LLM for a writing task. Those who chose to use the tool could then opt to either edit the text or turn it in unedited. Most participants chose the latter.6

And what happens if employees lack the knowledge required to judge an LLM’s more complicated, unusual, and consequential outputs? We may well ask questions for which we do not know what good enough answers look like. This calls for a higher degree of skilled human judgment in assessing and implementing LLM outputs.

Unlike an LLM, a human employee is accountable for their outputs, and a track record of accuracy or good judgment can allow their employer to gauge their future outputs. A human can also explain how they reached certain conclusions or made certain decisions. This is not the case with LLMs: Each prompt sends a question on a complex path through its corpus of knowledge to produce a response that is unique and unexplainable. Further, LLMs can “forget” how to do tasks that they previously did well, making it hard to provide a service guarantee for these models.7

Ultimately, a human is needed to assess whether LLM output is good enough, and they must take that task seriously. One challenge when integrating LLM output with human oversight is that in many contexts, the human must know something about the domain to be able to assess whether the LLM output is valuable. This suggests that specific knowledge cannot be “outsourced” to an LLM — domain experts are still needed to evaluate whether LLM output is any good before it is put into use.

3. The Output Adjudication Problem. LLMs excel at summarizing large volumes of text. This might help bring valuable data to bear on decision-making and allow managers to check the state of knowledge on a particular topic, such as what employees have said about a particular benefit in past surveys. However, that does not mean that LLM responses are more reliable or less biased than human decisions: LLMs can be prompted to draw different conclusions based on the same data, and their responses can vary even when they’re given the same prompt at different times.8

This makes it easy for different parties within an organization to generate conflicting outputs. For instance, if individuals or groups with different interests were to generate LLM outputs that supported their own positions, leaders would then be faced with adjudicating these disagreements. Such a challenge existed before the emergence of LLMs, but given that the cost of creating new content is now so much lower than the cost of adjudication, managers are faced with a new and more complicated task than ever before.

Whether the task of adjudicating LLM outputs is added to existing jobs or will create new ones will depend on how easy it is to learn. The optimistic idea that lower-level employees will be empowered by access to LLMs to take on more of the tasks of higher-level employees requires particularly heroic assumptions. The long-standing view about job hierarchies is that incumbents need skills and judgment that are acquired through practice, and the disposition to handle certain jobs, not just textbook knowledge made available on the fly by LLMs. The challenge has long been to get managers to empower employees to use more of that knowledge as opposed to making decisions for them. That reluctance has been much more about a lack of trust than a lack of employee knowledge or ability.9 As discussed above, effective adjudication of LLM output might also require a great deal of domain expertise, which further limits the extent to which this task can be delegated to lower-level employees.

One way to address both the decision rights and robustness issues for high-stakes output is to centralize the use of LLMs. Creating a role to generate an organization’s key reports using an LLM both facilitates the development of expertise in using such tools and limits the number of documents generated using data specific to the organization.

Establishing a centralized role with a standard approach to producing reports can also help prevent the problem of dealing with conflicting outputs and the need to adjudicate content differences. An LLM office might well conduct its own robustness experiments to see how modest changes in data, guardrails, and the language of prompts change the output. This would make the role of adjudicator less a technical one than a compliance one, so it could easily be situated within the general counsel’s office as an IT function.

4. The Cost-Benefit Problem. The benefits of using LLM output within an organization can be unpredictable. For instance, LLMs are terrific at drafting simple correspondence, which often just needs to be good enough. But simple correspondence that occurs repeatedly, such as customer notifications about late payments, has already been automated with form letters. Interactive connections with customers and other individuals are already handled rather well with simple bots that direct them to solutions the organization wants them to have (though not necessarily what those customers actually want). Call centers are already replete with templates and prepared text tailored to the most common questions that customers ask.

A study of customer service representatives where some computer-based aids were already in place found that the addition of a combination of LLM and machine learning algorithms that had been trained on successful interactions with customers improved problem resolution by 14%.10 Whether that is a lot or a little for a job often described as uniquely suited to LLM output, and whether the result is worth the cost of implementation, are open questions. A preregistered experiment with 758 consultants from Boston Consulting Group showed that GPT-4 drastically increased consultants’ productivity on some tasks but significantly decreased it on others.11 These were jobs where the central tasks were well suited to being done by LLMs, and the productivity effects were real but well short of impressive.

Even where opportunities exist for LLMs to generate better and more finely tuned responses than existing templates and chatbots, the question is whether organizations will see a need to use them. They might choose to in contexts like sales calls, where the gain could be significant, but they might not in contexts like customer service, where organizations have not shown much interest in improving their performance using the resources they already have.

Further, the time and cost savings afforded by LLMs in various contexts might be undone by the other costs they impose. For instance, converting chatbots to LLMs is a considerable undertaking, even if it might eventually be useful; besides, putting customers or clients in direct conversation with LLM-powered chatbots can expose the organization to security and brand risks. High-stakes correspondence or messaging often must be vetted by lawyers or communications professionals — an expensive process — regardless of whether they are drafted by a human or an LLM.

5. The Job Transformation Problem. How will LLMs work with workers? Predicting the answer is far from straightforward. First, given that employees are typically engaged in multiple tasks and responsibilities that are dynamic in nature, LLMs that take over one task cannot replace the whole job and all of its separate subtasks. It is worth recalling the effect of the introduction of ATMs: Even though the machines were able to do many of the tasks that bank tellers performed, they did not significantly reduce the number of human workers because tellers had other tasks besides handling cash and were freed up to take on new responsibilities.12

The variability and unpredictability of the need for LLMs in any given workflow is a factor that essentially protects existing jobs. Most jobs now do not have a need to use LLMs all that often, and it can be difficult to predict when they will need them. The jobs that LLMs are most likely to replace are of course those where the tasks that take up most of people’s time can consistently be done correctly by the technology.

But even in those cases, there are serious caveats. The projections of enormous job losses from LLMs rely on the unstated assumption that tasks can simply be redistributed among workers.13 This might have worked with old-fashioned typing pools, where all of the employees performed identical tasks. If the pool’s productivity increased by 10%, it would be possible to reallocate the work and cut the number of typists by 10%. But if workers are not organized into a pool, such a reduction is not possible without significant and costly workplace transformation. Also, clearly, we cannot cut 10% of an executive’s personal assistant if they become 10% more productive.

Cutting labor is easier with outsourced work than with employment. An organization can negotiate to reduce the cost or the number of hours of outsourced work they get from a vendor if some of that work can be performed by LLMs. The biggest tech vendors, such as the giant IT outsourcing companies, are most likely to have large numbers of programmers performing reasonably interchangeable tasks (like in a typing pool) and have the biggest chance of making job cuts. The extent to which cost reductions created by AI lead to lower prices for clients versus higher profits for the contractors is an open issue.

Individual contractors are also at risk. It might be true that a contractor can get more done with an LLM than without one, but that is equally true for an employee. If there is less work to go around, the company will likely cut back the number of contractors before employees are let go, because it’s easier. As with outsourcing vendors, companies can also try to renegotiate lower prices with contractors that are using LLMs.14 Initial evidence indicates that the volume of standardized gig work that can be done by contractors has declined disproportionately with LLMs’ introduction.15

Another possibility is that LLMs could improve productivity enough across an entire organization that it has an effect not on specific occupations but on the overall need for labor. There is no evidence of this yet, but it would be a welcome effect for many business leaders, given how slow productivity growth has been in the U.S. and elsewhere and the difficulty so many employers report in expanding their workforces.

A surprising area where LLMs might make inroads is one that we had thought of as the most human: providing one-on-one feedback, as in coaching, counseling, and tutoring. There is some evidence that people prefer that at least their initial interaction in such situations be with an AI chatbot rather than a human because they find it less threatening.

Recommendations for Managers

The history of IT-related innovations suggests that their impacts vary considerably depending on the job, organization, and industry and take a long time to play out. The fact that LLM tools are constantly becoming easier to use, and that they are being incorporated into widely adopted software products like Microsoft Office, makes it likely that they will see faster uptake. Our conversations suggest that at this point, though, most organizations are simply experimenting with LLMs in small ways.

How should organizations get ready for LLMs?

First, acceptable use standards should be drawn up and circulated. It is probably impossible to prevent employees from experimenting with LLMs, but even at this early stage, it is important to establish ground rules for their use, such as prohibiting proprietary data from being uploaded to third-party LLMs, and disclosing whether and how LLMs were used in preparing any documents that are being shared. Acceptable use policies already limit how employees can use company equipment and tools.16 Another approach is to use a tool like Amazon Q, a generative AI-powered chatbot that can be customized to adhere to an organization’s acceptable use policies around who can access an LLM, what data can be used, and so forth.

Second, it is worth thinking about creating a central office to produce all important LLM output, at least initially, to help ensure that acceptable use standards are followed and manage problems like data pollution. Central offices can provide guidance in best practices for creating prompts and interpreting the variability of answers. They also offer the opportunity for economies of scale. Having one data librarian in charge of all the company data that could be used in analyses is far more efficient and easier to manage than having each possible user manage it themselves.

At least initially, setting up rules and practices calls for the convening of a task force that includes representatives from IT, the general counsel’s office, and likely users. The task force and, later, a central office could help address the data management challenges that have slowed the use of machine learning and simpler data analysis. As a first step, simply identifying where analysis is being held up by data that is not being shared, that cannot be shared (because it’s controlled by vendors, for instance), or that has not been codified would be a big step toward breaking down those silos and making more and better information possible.

Third, everyone who is likely to ask for LLM reports or will need to use them should have simple training to understand the tools’ quirks — especially their ability to hallucinate — and how to evaluate AI-generated documents and reports. The next step would be to train employees in prompt design and refinement. It is also important to articulate and communicate a standard for what constitutes clearing the “good enough” bar for your organization before using LLM output. A central office can facilitate training that best fits the organization.

Should employers change their hiring criteria for future jobs or start making plans for where they can cut? The many claims in the popular media about how AI will eliminate enormous numbers of jobs will create pressure from investors and stakeholders to deliver those cuts. It might help to remind them how inaccurate other forecasts have been; for example, predictions that truck drivers would be largely replaced by robotic drivers by now have not come to pass.

In the longer term, once we figure out the different ways in which LLMs might be put to work, we can see whether tasks can be reorganized to create efficiencies. It would be imprudent to begin to rewrite contracts with vendors or start cutting jobs today.

The history of technology has shown that in the long run, new technologies create more jobs than they eliminate. Forecasts about massive job losses from IT innovations, and particularly AI, have not materialized. Developments that change the allocation of work across jobs typically move slowly. We expect LLMs’ use to be widespread but job losses to be relatively small, even where LLMs are used extensively. The idea that these tools can replace jobs wholesale must confront the reality that the simplest tasks LLMs perform have already been automated to some extent, that the most important tasks in a given job that LLMs can do will likely generate new tasks, and that rearranging work among existing employees to find excess positions that can be cut is unlikely to be easy or cost-effective. Technological determinism — the notion that changes in technology are the main factor shaping society — is a popular theory with people who create technology, but it has little credibility among those who study it.

Keen to know how emerging technologies will impact your industry? MIT SMR Middle East will be hosting the second edition of NextTech Summit.

Topics

About the Author

Peter Cappelli is the George W. Taylor Professor of Management; Prasanna (Sonny) Tambe is associate professor of operations, information, and decisions; and Valery Yakubovich is executive director of the Mack Institute for Innovation Management, all at the Wharton School of the University of Pennsylvania.
View More

References

1. “Beyond the Buzz: A Look at Large Language Models in Production,” PDF (San Francisco: Predibase, 2023), https://go.predibase.com.

2. A. Rosenbaum, S. Soltan, and W. Hamza, “Using Large Language Models (LLMs) to Synthesize Training Data,” Amazon Science, Jan. 20, 2023, www.amazon.science.

3. “Storm Reply Launches RAG-Based AI Chatbot for Audi, Revolutionising Internal Documentation,” Business Wire, Dec. 21, 2023, www.businesswire.com.

4. “Beyond the Buzz.”

5. P. Vaithilingam, T. Zhang, and E.L. Glassman, “Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models,” in “CHI EA ’22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems,” ed. S. Barbosa, C. Lampe, C. Appert, et al. (New York: Association for Computing Machinery, April 2022), 1-7.

6. S. Noy and W. Zhang, “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,” Science 381, no. 6654 (July 13, 2023): 187-192.

7. L. Chen, M. Zaharia, and J. Zou, “How Is ChatGPT’s Behavior Changing Over Time?” arXiv, revised Oct. 21, 2023, https://arxiv.org.

8. S. Ouyang, J.M. Zhang, M. Harman, et al., “LLM Is Like a Box of Chocolates: The Non-Determinism of ChatGPT in Code Generation,” arXiv, submitted Aug. 5, 2023, https://arxiv.org.

9. P. Cappelli, “Stop Overengineering People Management,” Harvard Business Review 98, no. 5 (September-October 2020): 56-63.

10. E. Brynjolfsson, D. Li, and L.R. Raymond, “Generative AI at Work,” working paper 31161, National Bureau of Economic Research, Cambridge, Massachusetts, April 2023. We cannot tell the extent to which the improvement was due to the LLM per se because it was bundled together with an algorithm, which is a different tool.

11. F. Dell’Acqua, E. McFowland III, E. Mollick, et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence on the Effects of AI on Knowledge Worker Productivity and Quality,” working paper 24-013, Harvard Business School, Boston, September 2023.

12. C.B. Leon, “Occupational Winners and Losers: Who They Were During 1972-80,” Monthly Labor Review 105, no. 6 (June 1982): 18-28.

13. M. Cerullo, “Here’s How Many U.S. Workers ChatGPT Says It Could Replace,” CBS News, April 5, 2023, www.cbsnews.com; and L. Nedelkoska and G. Quintini, “Automation, Skills Use, and Training,” working paper 202, Organization for Economic Cooperation and Development, Paris, March 2018.

14. X. Hui, O. Reshef, and L. Zhou, “The Short-Term Effects of Generative Artificial Intelligence on Employment: Evidence From an Online Labor Market,” SSRN, Aug. 1, 2023, https://papers.ssrn.com.

15. J. Liu, X. Xu, Y. Li, et al., “‘Generate’ the Future of Work Through AI: Empirical Evidence From Online Labor Markets,” SSRN, Aug. 3, 2023, https://papers.ssrn.com; and O. Demirci, J. Hannane, and X. Zhu, “Who Is AI Replacing? The Impact of ChatGPT on Online Freelancing Platforms,” SSRN, Oct. 15, 2023, https://papers.ssrn.com.

16. For an example of an acceptable use policy for LLMs, see ACA Global’s template: “Sample Policy: Acceptable Use Policy for Employee Use of Large Language Models on Company Devices,” ACA Aponix, May 2023, https://web.acaglobal.com.

Tags:

Topics

Share