Feature photo CFB August, 2025

Platonic Leaks, Tectonic Threats: Vec2Vec, A New Challenge To Data Pseudonymization

written by Guest Author

in ,

A recent paper by the Department of Computer Science of Cornell University introduces a novel unsupervised method, titled Vec2Vec, that allows embeddings from different models to be mapped with high similarities despite variations in their architecture, size, or training data. This article discusses how this development challenges present notions of consent and data pseudonymization in India, and raises questions around embedding governance, consent management, and AI model accountability. In doing so, the article analyses India’s data governance regime and holds that an ambiguous position regarding data pseudonymization in India creates an unwanted illusion of privacy for all involved stakeholders. Based on this, the article recommends several ways in which the Central Government and its agencies can develop policies to fructify the legal intention of empowering a Data Principal, not just in theory but practically as well.

Introduction to Vec2Vec

With the attention on the AI industry at an all-time high, user privacy and consent hangs in the balance. A recent breakthrough development for software engineers in this domain has been the development of the Vec2Vec 2 method. In May 2025, the Department of Computer Science, Cornell University in a paper titled: “Harnessing the Universal Geometry of Embeddings” (“Paper”), introduced Vec2Vec, a method that will not require any aligned data or knowledge of the source or target models to translate information from one model to another. A user can input any embedding into Model A and get a similar translated version of their input in Model B’s style, without paired data.

Simply, when an AI feature asks the receiver to provide a smart reply, after a message is decrypted on a Data Principal’s device, the decrypted message (even if encrypted before) is translated into embeddings. By employing Vec2Vec, an actor can infer (and derive a similar meaning of) these embeddings without the need to access the message, effectively bypassing the control, friction and limitation that traditionally acted as a safeguard against uncontrolled re-identification. This means that while a Data Principal may have consented to their personal data to be processed within Model A, this consent does not automatically extend to its derivative representations in Model B, where this message is translated.

At the core of this method lies the Platonic Representation Hypothesis, a philosophy, drawn from Plato’s Theory of Forms, that reasons that all language models, regardless of architecture or training corpus, organise semantic meaning within a shared, latent geometric space. The Paper proposes and holds that a stronger, constructive version of the Platonic Representation Hypothesis can be utilised to enable unsupervised cross-model translation.

How Vec2Vec Bypasses Present Consent Mandates

In the landmark judgement of K.S Puttaswamy judgement, the Supreme Court in 2018, upheld an individual’s right to privacy, and elevated its status as a fundamental right.[i] Justice Sanjay Kisan Kaul, enabling and empowering every citizen’s right to be forgotten, in his concurring opinion held:

the right of an individual to exert power over his personal data and to be able to control his/her own life would also encompass his right to control his existence on the Internet”.

To ensure that businesses manage and process user data in a non-exploitative and transparent manner, Section 43A was inserted within the Information Technology Act, 2001. The Section empowered the Central Government to frame the Information Technology (Reasonable security practices and procedures and sensitive personal data or information) Rules, 2011, mandating entities to establish reasonable security practices to collect, store, and process user personal data. Furthermore, the Judiciary has time and again upheld the constitutional validity of Section 69A (2) of the Information Technology Act, 2001 read with Rule 3 of the Information Technology (Procedure and Safeguards for Blocking for Access of Information by Public) Rules, 2009 which stipulates that the Central Government, via a “Designated Officer”, has the power to issue blocking directions for (among others) decrypting a person’s personal data in the interest of issues such as national security or sovereignty. This means that while the State may intercept or decrypt a citizen’s personal data, these powers do not automatically extend to private entities.

This position is reinforced by the Digital Personal Data Protection Act, 2023 that grants Data Principals the right to correction, completion, and erasure of their consent-based personal data, and obligates Data Fiduciaries that data processing be purpose-specific, subject to timely erasure, and protected by adequate security safeguards to prevent unauthorised use.

These concepts are expanded further by the latest Business Requirement Document for Consent Management guidelines (BRD), the Technical Guidelines on Bills of Materials (BOM), and the upcoming draft Digital Personal Data Protection Rules, 2025 (Draft Rules), all released this year.

However, the Vec2Vec method counteracts the long-held position of non-exploitation of individual personal data by businesses, and successfully exposes multiple blind spots in how consent, control and compliance are presently conceptualised. A key reason for this is India’s non-committal nature on data pseudonymity. While the definitions such as  “personal data”, “personal data breach”, and “processing”, within the DPDPA are broad enough to cover embedding data protection within their ambit, the absence of any guideline/notification in this regard within our data governance infrastructure signal that legislators have failed to draw a difference between raw data and embedded data. It is this lack of distinction between these concepts that leads to confusion amongst stakeholders.

Even after several domestic consultation papers and reports, the DPDP Act, the BRD, the BOM, and the Draft Rules only conceptualizes data processing as a linear step-by-step process of data collection, data storage, data anonymisation/pseudonymization, and data conversion into an end product, and has failed to recognise the need for dedicated definitions and governance frameworks for pseudonymization and reidentification.

This creates an unwanted illusion of privacy. For instance, when a Data Principal consents to data processing, they are under the assumption that they are empowered under the DPDP Act to control, access and erase such personal data. Further, they are also led to believe that such personal data is secure when pseudonymised by Data Fiduciaries. However, Vec2Vec showcases that data, while being encrypted, is susceptible to re-identification via semantic inference, the moment it is received. Since India’s data governance ecosystem does not safeguard its Data Principals against such re-identification, this leads to a scenario where the Data Principal’s data can be used for the same illegal purposes without attributing any accountability to Data Fiduciaries or Consent Managers. This further means that data within Model A, which is believed to be pseudonymized, can be translated within the vector space of a more open model-where reverse engineering tools such as nearest neighbor search or attribute inference already exist.

The Way Forward

  1. Distinction Between Raw Data and Processed Data: Without a clear definition of pseudonymized data, establishing a framework to address this issue becomes challenging. Presently, Section 2(t) of the DPDP Act, that the term “Personal Data” does not distinguish between data collected from primary sources (i.e. Raw Data) from data that is processed for a purpose (i.e. Processed Data). It is only after recognition of data as “Processed Data” or “Embedded Data” can the Central Government, issue guidelines on procedures to prevent re-identification of “Processed Data”.
  • Dedicated Framework for “Processed Data”: To address this menace of data being processed despite pseudonymisation, the EU conceptualised the framework of Pseudonymization Domain (Framework). The Framework clarifies that pseudonymised data will be considered personal data if it can be re-identified using additional information. This means that Data Controllers/ Processors are obligated to extend data security practices even to pseudonymised data. A framework such as this can be useful in not only safeguarding individual personal rights, but also in attributing accountability in case of any data pseudonymization failures by methods such as Vec2Vec, where it becomes difficult to correctly identify the source of the leak.
  • Reliance on Third-Party Certifications: Beyond compliance with the applicable law, several Data Fiduciaries, in order to build trust with their Data Principals, adopt third-party certifications. These certifications revolve around informing, implementing and training Data Fiduciary on multiple information security measures, including prevention of Personal Data from getting re-identified.  Some global and regional certifications adopted by Data Fiduciaries in India include the ISO/IEC 27001 Information Security Management Certification, ISO/IEC 27701 Privacy Information Management Systems Certification, SOC 2 Type II Certification, compliance with MeITY’s CERT-In guidelines, etc.

With the Central Government on the verge of drafting a fresh draft of the DPDP Rules, it is imperative that the Government recognises the challenges that new methods such as Vec2Vec bring. The method if employed, not only has widespread implications on user privacy but also has the potential to disturb national sovereignty and security on a large scale. Some of the ways to tackle this menace can be recognising Processed Data differently from Raw Data and establishing a dedicated data pseudonymization framework.


[i] K.S Puttaswamy (Retd.) & Anr. vs. Union of India & Ors, (2017) 10 SCC 1.

Author

The views expressed are personal and do not represent the views of Virtuosity Legal or its editors.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *