Follow us


Part of our series The IP in AI.

In this series, we have explored whether IP rights protect AI systems themselves and whether copyright or patents provide protection for AI-generated works or inventions, however, equally as controversial is the way in which AI systems use others’ works. At their core AI systems are computer systems, working on large volumes of data. Those systems – and most obviously those data – are often the products of others’ intellectual and economic investment. This article explores the degree of protection likely afforded to IP rights holders against unsanctioned use of this material by an AI system.

Copyright infringement

What amounts to infringement?

In general, copyright prevents the unauthorised use of certain categories of subject matter (for example, literary or artistic works). Although the requirements for infringement differ between jurisdictions, in general if copyright subsists in a work (which we discussed in more detail in part 3 of our series), proving infringement of that copyright requires the copyright owner to show that:

  • a relevant act was done in relation to a work (the “junior work”), for example that it has been reproduced, published or transmitted electronically;
     
  • the junior work bears objective similarity to the work that is protected by copyright (the “senior work”), or a “substantial part” of it; and
     
  • that objective similarity arises because of copying of the senior work.

There are also a number of exceptions or defences to infringement that can apply, which differ from jurisdiction to jurisdiction. For example:

  • In Australia, use of a copyrighted work will not amount to infringement if it is a “fair dealing” for certain specified purposes, such as research and study,1 criticism and review,2 news reporting,3 or parody and satire.4 These exceptions are, however, relatively narrow.5 As well as being for one of the prescribed purposes, the use must be “fair”, which requires the assessment of a number of factors such as the purpose and character of the dealing, the possibility of obtaining the work on commercial terms, and the effect of the dealing on the market for the original work.
     
  • In the EU there are multiple exceptions and limitations (“E&Ls”) available but each applies separately in each EU Member State as there is no harmonisation of the exceptions and limitations across the EU. However, the so-called EU “Copyright Directive”6 introduced some mandatory E&Ls, such as text and data mining (separate E&Ls for research and other purposes), teaching and educational purposes, and preservation of cultural heritage.
     
  • In the UK there are exceptions to copyright infringement for “fair dealings” for certain prescribed purposes including non-commercial research and private study;7 criticism, review and news reporting;8 and parody, caricature and pastiche.9 Again, these exceptions require the use to be “fair”, which involves consideration of factors such as whether the use affects the market for the original work, and whether the amount of the work used is reasonable and appropriate. Exceptions also exist for certain purposes such as text and data mining for non-commercial research10 and assisting accessibility for the disabled.11 All also require sufficient acknowledgement.
     
  • In China there are exceptions to copyright infringement for purposes including personal study, research or appreciation; introducing or commenting on a certain work, or illustrating a point; news reporting; certain uses for non-commercial teaching of research; and provision of published works to dyslexics in a barrier-free way through which they can perceive.

It is noteworthy that none of these jurisdictions have an equivalent to the relatively broad and flexible “fair use” doctrine that applies in the US.

Copyright infringement by training

Infringement by training

Central to almost any AI system is a large mass of data on which the system is trained. Although referred to as “data”, the training materials are frequently themselves original works, in which copyright subsists. For example, these may be artworks (as in Stable Diffusion), or passages of code (as in CoPilot). The process of training an AI or ML system on those inputs almost certainly involves the creation of a copy (in a copyright sense) – and most likely many copies – of those copyright works, even if those copies are only ever used “internally” within the system (eg in training the system) and never reproduced as outputs from it.

This kind of copying is among the key allegations in proceedings brought by Getty Images against Stability AI (Getty Images). Getty Images, a global media provider distributing royalty-free images, photos, music and video, has sued Stability AI in the UK and US for allegedly using over 12 million of its copyrighted images and associated captions and meta-data to train its AI text-to-image tool, Stable Diffusion, without consent or compensation. In the US, the case is in its discovery stages,12 whilst in the UK the High Court, on 1 December 2023, set the case down for trial on the basis of real prospects of success.13 The UK litigation also involves allegations of infringement of database rights, trade mark infringement and passing off, as well as copyright infringement (see our recent update on this case and other generative AI litigation worldwide here). 

Authors including Jodi Picoult and George RR Martin have also sued Open AI in the US, (Authors Guild, et al. v. OpenAI, Inc.) alleging the infringement of fiction authors’ rights in the AI system’s wholesale copying of their works, without permission or compensation, to train its large language models (LLMs). They also argue that the output of these LLMs are derivative works which mimic or paraphrase the authors’ work and harm the market. The Authors Guild allege that this threatens the livelihood of authors, and most recently have joined Microsoft as a defendant. Unsurprisingly, many other groups of authors have brought separate suits against ChatGPT and Open AI based on similar concerns (including Tremblay v. OpenAI, Inc.).

Practical challenges

Although, in a legal sense, this act of infringement may be straightforward conceptually, there are practical matters that make it difficult to establish infringement:

  • Proving that a particular copyright work was part of the training data - because the training set is rarely published (and often protected as a trade secret), this may be difficult. In Getty Images, this problem was to some extent circumvented because Getty located its watermark on some output images from Stable Diffusion. Similarly, in J. Doe 1, et al., v Github, Inc., et al No. 22-cv-06823 (GitHub), some outputs were shown to be almost identical to specific code stored on GitHub. In other cases, however, there may be no such link. In those cases, pre-action investigative processes, such as preliminary discovery or subpoenas, may be required.
     
  • Jurisdictional considerations – namely, that the act of infringement must occur in the jurisdiction in question. In the UK, in Getty Images for example, Stability AI made an application for a reverse summary judgement in the UK litigation on the basis that the acts did not occur in the UK, though the court did not accept that it was clear enough for a summary judgement and the matter has gone to full trial. In that case there is also a claim of infringement based on an alleged importation of an infringing "article" (ie the LLM), raising the question whether a service is an "article" in the sense anticipated by this element of the UK legislation. The specific location of the act of infringement may also raise difficulties if, for example, the jurisdiction in which the training takes place has specific defences that are not available in the copyright owner’s jurisdiction – such as the US’s fair use defence, or Singapore’s computational data analysis provisions.   

Government responses

These issues, and the challenges they present for rights holders, are a high priority for governments worldwide. For example, the current draft of the EU AI Act, which is being negotiated between the EU Council, EU Parliament and EU Commission, contains provisions requiring transparency of training data to be mandatory such that copyright protected materials using in training an AI can be identified (see our blog post here). In addition, the EU AI Act requires general purpose AI models to make publicly available a sufficiently detailed summary of the content (including text and data protected by copyright) used for training the model.

In Australia, in December 2023, Commonwealth Attorney-General Mark Dreyfus announced the establishment of a copyright and AI reference groupto better prepare for future copyright challenges emerging from AI”, expressly referring to the need to address copyright issues concerning “the material used to train AI models” and “transparency of inputs and outputs”.

The UK House of Lords Communications and Digital Committee issued a report on LLMs and Generative AI in February 2024 (see our blog post here), which called on the UK Government to support copyright holders, saying the Government “cannot sit on its hands” while LLM developers exploit the works of rightsholders. The report expressly called for a way for rightsholders to check training data for copyright breaches, and the Committee Chair was quoted as saying:

One area of AI disruption that can and should be tackled promptly is the use of copyrighted material to train LLMs. LLMs rely on ingesting massive datasets to work properly but that does not mean they should be able to use any material they can find without permission or paying rightsholders for the privilege. This is an issue the Government can get a grip of quickly and it should do so.

In its response following the consultation on its AI Regulation White Paper published in February 2024, the UK Government did not produce the definitive solution that the House of Lords had called for, but referenced the UK IPO's failed attempts to find a solution between stakeholders over the last 18 months. Instead the response referred to further examination of ways to improve transparency of use of copyright material (see our blog post here). As a result, in the UK it may well be for the courts to determine the copyright position in the short term, although this may not be to the liking of those investing in AI development.

Copyright infringement by outputs

Aside from infringement during the training of an AI system, it may also be the case that an AI system can produce outputs that infringe copyright, in the sense that they bear sufficient objective similarity to an original work. Since this only requires a side-by-side comparison of a given output from the AI system and a given original work (rather than a forensic enquiry into whether the original work was in fact among the training data set), this kind of claim avoids some of the difficulties referred to above. However, here the difficulty is primarily in showing the requisite degree of objective similarity between a given output and a given input.

This was the primary challenge faced in GitHub and Andersen v. Stability AI Ltd. (Andersen),14 where many of the claims originally brought have been dismissed because the plaintiffs were unable to establish a specific original work that bore sufficient objective similarity to a specific output work. This difficulty is caused by a multitude of practical factors, including poor or inaccurate referencing and lack of transparency from developers, as well as the technical nature of AI systems.

This problem can be exacerbated by the “Snoopy problem” (also referred to as the “Italian plumber problem”). If the training data uses enough example images of a particular and well-known subject (such as Snoopy), or a particular style of work, it may be difficult to draw a sufficient causal link between a given output and a specific input image. This, too, is an issue in Getty Images, where one area that is being debated in relation to potential defences (the Defence has yet to be filed) is that the outputs are "inspired by" rather than directly copying the originals, since they mix elements from multiple sources. In that respect the replication of watermarks or parts of them (discussed above) may assist Getty. 

Academics and software developers have recently sought to develop methods to identify whether text is generated by an AI, but these methods appear to be currently limited to text-based output, and have limited reliability and accuracy. Any input from governments to mandate, as a matter of policy, a framework for watermarking or indicating the source of an AI output will also need to consider countervailing issues including economic policy, competition, and the promotion of innovation.

Another challenge faced with these kinds of cases is the identification of the infringer. If an AI system can be used to generate an output work bearing similarity to a given input, but only when that AI system is used by a user who is determined to infringe, who is (or should be) liable for that infringement?15 In many jurisdictions the answer may be both the user and the AI system owner – the former for the primary infringement and the latter for “authorisation”, “vicarious” or “secondary” infringement. However, assessment of such “secondary” liability often requires an examination of the degree to which the AI system owner can control or prevent the allegedly infringing conduct of the user.

Aside from copyright infringement, the owners of works used to train an AI system may have other causes of action in relation to a given output. For example, even if a given input work is available on open-source licence terms, those terms may require attribution information, or require that any derivative works are licensed on terms no less open than that applying to the inputs (so-called “copyleft” licences). Indeed, the removal of attribution (or copyright management information) is part of the complaint brought by the plaintiffs in GitHub.

Patent infringement

With the rapid development of AI systems, companies like Google, Samsung,and Microsoft led the market in terms of AI-related patent applications at the EPO in the period 2016 to 2020.16

While copyright infringement has dominated current IP litigation brought in the context of works generated by AI, there are emerging patent disputes involving AI systems. Given it is now relatively established in most jurisdictions worldwide following Dr Thaler's series of applications (see our blog post here on the UK Supreme Court decision of December 2023 in that regard) that the AI system itself is unable to be an “inventor” for the purposes of patent law (as also discussed in our previous article here), the focus has shifted to infringement of patents seeking to protect the AI system itself.

In July 2023, FriendliAI commenced proceedings in the United States District Court For The District Of Delaware against Hugging Face (FriendliAI Inc. v. Hugging Face, Inc.), who offers an inference server for Large Language Models (“LLMs”) called Text Generation Inference (“TGI”). The founder and CEO of FriendliAI is Dr. Byung-gon Chun, the inventor of PeriFlow/Orca, which utilises a system for iteration-level or dynamic “batching” which allegedly improves AI systems with more efficient and scalable serving of generative AI transformer models. This allows the AI to process multiple requests at once. Hugging Face clearly states on its website that it uses PeriFlow/Orca, which FrendliAI contends constitutes infringement of their patent entitled ‘Dynamic Batching for Inference System for Transformer-Based Generation Tasks’. This matter is in its early stages, and it will be one of the first patent infringement cases relating to an AI technology.

These patent cases, while dealing with AI subject matter, will grapple with relatively traditional patent law concepts, including construction of the patent claims, considerations of whether those claims have been exploited, as well as counterclaims attacking the patent’s validity (see our previous article on patent protection of AI systems). An example of the latter occurred in December 2023, just before the Supreme Court's decision in the DABUS/Thaler case on inventorship: the High Court of England and Wales rejected a challenge to the patentability of an AI system, relating to an autonomous neural network, which was held not to be excluded from patentability (see our blog post here). The UK IPO has been granted leave to appeal the decision to the Court of Appeal. However, in the interim, in response to the High Court's decision, the UK IPO has temporarily suspended its guidance on the examination of AI inventions while it considers the impact of this decision and has issued a practice update specifically relating to the examination of ANNs.

Conclusions

A common thread amongst the cases discussed above is the normative considerations associated with IP protection and enforcement in relation to materials used and produced by AI systems. These include the adequate compensation of copyright owners, lost opportunity to licence their works and market usurpation through derivative works.

Copyright holders asserting their rights, including Getty Images, have often reiterated that they do not seek to have a chilling effect on the development of AI technology, but instead are focusing on ethical sourcing of data, including compensation for copyright holders, consent (including by exploring opt-out models), and an opportunity to licence. These issues have been behind the debates worldwide over regulation of AI and attempts to balance opportunity with equity.

At the same time, organisations hosting large volumes of data are realising the potential value of those data to new and upcoming AI systems and putting in place systems to protect them. Reddit, for example, has announced that it plans to charge companies for accessing its application programming interface (which is used by external entities to download conversations from the forum), even though its User Agreement confirms that users retain ownership of content they post to the platform.

Outside of the strict bounds of the law, developers of AI systems may also begin to see the fair and ethical sourcing of their input data as forming a part of their ESG public image and "social licence to operate". In late 2023, for example, Canva announced a commitment not to train its proprietary AI models on its creators’ content without express permission, and established a $200 million compensation program for creators who consent to having their content used to train those models.

The growing frequency of attempts to regulate these issues – and disputes arising from them – demonstrate the challenge IP law is currently contending with in striking the balance between encouraging investment in AI technologies and protecting investments that have already been made in the material being used to train them. The legal reforms and market restrictions that might lead to this balance are yet to be implemented, but the results of the various disputes around the world may help to illustrate the difficulties of the current position and provide added impetus towards an international solution to an international problem.


  1. Copyright Act 1968 (Cth) s 40.
  2. Copyright Act 1968 (Cth) s 41.
  3. Copyright Act 1968 (Cth) s 42.
  4. Copyright Act 1968 (Cth) s 41A.
  5. See our previous articles on this topic, including ‘Not all’s “fair dealing” in war and Greenpeace: Federal Court confirms limits of the “parody or satire” exception to copyright infringement’ and ‘Copyright owners “Don’t have to take it”: Federal Court of Australia awards substantial remedy for copyright infringement, plus double damages for flagrancy’.
  6. EU Directive (EU) 2019/790 in the Digital Single Market.
  7. Copyright, Designs and Patents Act 1988 (CDPA) s 29.
  8. CDPA s 30.
  9. CDPA s 30A.
  10. CDPA s 29A. The UK Government considered extending this exception to commercial use but withdrew its proposals in 2023 (see our blog post here).
  11. CDPA ss 31A-31F.
  12. https://dockets.justia.com/docket/delaware/dedce/1:2023cv00135/81407.
  13. Getty Images (US) et al., v Stability AI Ltd [2023] EWHC 3090, [108].
  14. Case No. 3:23-cv-00201-WHO.
  15. https://crsreports.congress.gov/product/pdf/LSB/LSB10922; http://eprints.lse.ac.uk/117745/1/McDonagh_can_artificial_intelligence_infringe_copyright_accepted.pdf.
  16. Google and Samsung top the list of applicants for AI-related patents at the EPO - IAM (iam-media.com).

The IP in AI – What you need to know

With AI sending waves throughout the business world, we explain the salient role of intellectual property in regulating the technology and protecting the rights of inventors.

Key contacts

Aaron Hayward photo

Aaron Hayward

Senior Associate, Sydney

Aaron Hayward
Anna Vandervliet photo

Anna Vandervliet

Senior Associate, Sydney

Anna Vandervliet
Byron Turner photo

Byron Turner

Solicitor, Sydney

Byron Turner
Rachel Montagnon photo

Rachel Montagnon

Professional Support Consultant, London

Rachel Montagnon
Heather Newton photo

Heather Newton

Of Counsel, London

Heather Newton
Peng Lei photo

Peng Lei

Partner, Kewei, Mainland China

Peng Lei
Giulia Maienza photo

Giulia Maienza

Senior Associate (Italy), London

Giulia Maienza

Stay in the know

We’ll send you the latest insights and briefings tailored to your needs

New York Vietnam Group Brisbane Europe Madrid London - Canary Wharf Australia Jakarta Germany Latin America Group London Korea Group Tokyo Paris Belfast (ALT) Sydney Singapore Perth Mainland China Hong Kong Melbourne Ukraine Group Nordic Group Bangkok India Group Kazakhstan Group Americas Swiss Group Asia Brussels Milan Dispute Resolution Copyright, Designs and Confidential Information Intellectual Property Intellectual property Aaron Hayward Anna Vandervliet Byron Turner Rachel Montagnon Heather Newton Peng Lei Giulia Maienza