GoCoMA: New Framework for Attributing LLM-Generated Code

Researchers introduce GoCoMA, a multimodal framework to identify the source of LLM-generated code. This addresses security and licensing concerns by analyzing code stylometry and binary image representations.

Researchers have developed GoCoMA, a novel multimodal framework designed to attribute code generated by Large Language Models (LLMs). The tool leverages both code stylometry—capturing structural and stylistic signatures—and image representations of binary pre-processed code to determine the origin of the code. This advancement is crucial as LLMs become increasingly proficient at producing code indistinguishable from human-written code, raising concerns about security vulnerabilities and licensing ambiguities.

The ability to trace the source of LLM-generated code is significant for several reasons. It helps in identifying potential security threats, ensuring compliance with licensing agreements, and understanding the capabilities and limitations of different LLMs. GoCoMA's multimodal approach sets it apart from traditional methods by integrating multiple data modalities to enhance accuracy and reliability. This could be particularly useful in legal and forensic contexts where the origin of code is under scrutiny.

Moving forward, the adoption of GoCoMA could lead to more robust frameworks for code attribution, potentially influencing policy and industry standards. However, the effectiveness of such tools will depend on continuous updates to keep pace with the rapid evolution of LLMs. The research also opens up questions about the ethical implications of tracking and attributing AI-generated content, particularly in areas where anonymity or privacy might be desired.