Over the last few years, there’s been a steady stream of new developments in generative AI that have had significant ramifications for engineering teams.
For one, generative AI tools like GitHub Copilot and ChatGPT have hit the mainstream and are now regularly used by developers. Additionally, recent U.S. Copyright Office (USCO) decisions provided some clarity on whether generative AI output is actually copyrightable — while at the same time creating practical challenges for software developers.
All the while, there’s been debate about whether engineering teams that use generative AI are at risk of violating open source software licensing requirements. (Generative AI tools train, in part, on open source libraries, which leads to questions about whether generative AI output is subject to the same licenses as the training material.)
For answers to these questions — and to help engineering organizations make decisions about generative AI tools in the proper copyright law context — we reached out to a leading expert on the topic, Kate Downing. Kate is an IP attorney who counsels top enterprises (including Zoom, Illumio, and Squarespace) on matters related to open source. (You can visit Kate’s website for more information about her legal services and to read her blog.)
We’ll cover several topics core to the copyright law conversation about using generative AI in software development, including:
- The significance of the USCO’s recent decisions, including copyright implications
- Whether generative AI output puts users at risk of license non-compliance
- Policies and tools to guard against IP risk when using generative AI
U.S. Copyright Office Issues Guidance on Generative AI Output
On Feb. 21, 2023, the USCO responded to author Kristina Kashtanova’s appeal to copyright the comic book “Zarya of the Dawn.” Zarya of the Dawn” is no ordinary comic book. Kristina Kashtanova wrote its text and arranged its content, but used Midjourney — a generative AI program — to create the images. The USCO found that while the text and “selection, coordination, and arrangement of the Work’s written and visual elements” are copyrightable, the images themselves are not.
As Kate Downing explained in her blog:
“The crux of the USCO’s refusal to recognize any copyright interest in the images rests on the idea that Midjourney’s output is unpredictable and that the prompts users provide to it are mere suggestions, with too much “distance between what a user may direct Midjourney to create and the visual material Midjourney actually produces” such that “users lack sufficient control over generated images to be treated as the “mastermind” behind them.” Repeatedly, the USCO seems to argue that the final result has to reflect the artist’s “own original conception,” even going so far as to argue that the “process is not controlled by the user because it is not possible to predict what Midjourney will create ahead of time.”
The USCO’s decision has major implications — and creates potentially significant challenges — for engineering teams. It would require developers to distinguish between code they wrote with and without generative AI, which is often impractical.
“I think if the USCO strictly applies this reasoning to the software realm, this creates a really big challenge for anyone using generative AI because I don't know how they're going to submit their software to the copyright office and claim the part they wrote without the tool,” Downing says. “It’s harder to distinguish than when you’re submitting a book where, say, you wrote the text and a tool created the images.”
It’s important to note at this point that it’s unlikely the USCO’s decision will be the final word on whether generative AI output is copyrightable. (However, it is worth noting that, in September, the USCO again rejected copyright protections for AI-produced art, citing the fact that the work wasn't the result of human authorship.)
Downing thinks it’s likely the USCO will soon come across a software copyright registration application that mentions Copilot, triggering a back and forth with the USCO over the applicability of the new guidelines to software and testing the strictness of the USCO’s demand for evidence of modification to AI-generated content.
Generative AI and Open Source License Compliance
Organizations that manufacture generative AI tools like GitHub Copilot and ChatGPT train those tools on open source code, some of which is under strong copyleft licenses like GPL or AGPL. This has sparked some debate about whether generative AI output should be considered a derivative work of the code upon which it’s trained.
If generative AI output is considered a derivative work of the training materials, engineering teams that use it would be required to comply with the license(s) of the code upon which the tool is trained. This, of course, could come with requirements to disclose source code, generate attribution notices, and more.
But this is only the case if generative AI output contains copyrightable expression. And, although the book isn’t closed on this question, there are considerations that suggest it doesn’t. Sufficiently short phrases — which generative AI often produces in the software development context — are not copyrightable and therefore neither subject to copyright nor any open source license.
Downing explains: “Tools like GitHub Copilot suggest to you the code that they have most commonly seen — that's the algorithm. It’s similar to Google auto-complete, which suggests completions for your queries that are most common. And so, almost by definition, what it suggests to you is mostly not copyrightable.
“For context, in certain languages, there are specific class names and there are specific function names. There are a lot of pieces that get reused throughout code, almost like lego blocks. So if the suggestion is fairly small, it probably doesn’t have any copyrightable expression in it in the first place; the suggestion is likely to be purely functional (i.e. this is the only way to do x in this language).”
There’s a related but separate discussion on whether the ML model itself — as opposed to its output — is a derivative work of the training material, and, as such, should be open sourced in accordance with copyleft licensing requirements. An ongoing class action lawsuit against GitHub alleges as much in its complaint, so we could get more clarity on this issue when the case is decided. In the interim, we recommend reading Kate’s blog for a full breakdown of the case and its potential impact.
Policies to Control Generative AI-Related IP Risk
There are several steps organizations can take to reduce IP risk when using generative AI in software development.
In light of the GitHub Copilot class action lawsuit (and the potential for more litigation in the future) — which, in theory, could require teams to re-work their code without the generative AI elements — Downing recommends teams implement a tagging system to more easily distinguish human- and generative AI-created output.
“I have started telling my more conservative clients that I recommend they tag generative AI files or identify parts of the code base where Copilot has/will be used and parts where it won’t,” Downing says. “This is the case for several reasons, including that we don't really know what will happen with the class action lawsuit case against Copilot. If, for whatever reason, you decide you need to rip out all the code written by a generative AI tool, you’ll need to know which files you touched with the tool, even if you don't know the specific elements. And this way, you are likely to have certain files remain Copilot-free, and therefore simple to submit for copyright registration under the USCO’s new generative AI guidance.”
“The fear is that, if court cases are decided in a certain way, you may decide that you need to remove certain code to limit your liability. But without tagging, you’d have no way of doing it because you can't differentiate between what you wrote and what the AI wrote.”
From a license compliance standpoint, it’s always wise to use a scanning tool (like FOSSA), which detects open source components in your code.
Additionally, you can add another layer of protection by enabling GitHub Copilot’s optional duplication detection filter. If you do, Copilot’s suggestions won’t include exact or near matches of public code on GitHub. Turning the filter on is a relatively straightforward process that takes just a few steps — visit GitHub’s docs for step-by-step instructions. Downing adds that to the extent your company gets an IP indemnity from GitHub, GitHub will only honor it if you have all the filters enabled.
Learn More: Generative AI and Managing IP Risk
If you’d like more information or legal guidance about using generative AI in your engineering organization, you can contact Kate by visiting her website.
Or, if you’d like to learn more about using FOSSA to automate license compliance management, please fill out the form on this page, and our team will be in touch.