Virtual Data Lakes
Virtual Data Lakes

Key Ideas

This page explains the key ideas that make virtual data lakes work.

A virtual data lake is implemented as a simple triple store that supports dynamic connection and disconnection of sources and enforces granular access control.

A virtual data lake can also be considered as a graph database. The items in the store are nodes of the graph, and the triples define edges. Paths through the graph are not stored explicitly using pointers, but can be retrieved by searching the triples. Because search operations are very efficient, this does not impose a serious overhead.

Triple Stores

A triple store holds items and triples from the sources to which it is connected. It has methods to create, delete, and update items and triples in connected sources that it owns, and it has methods to connect and disconnect sources, to read items and triples in all connected sources and to search for triples in all connected sources.

Triple stores are often encountered in Semantic Web applications, implementing the Resource Description Framework (RDF) standard. Virtual data lake triple stores are not the same as RDF triple stores. They do not natively use Uniform Resource Identifiers (URIs) to identify items. This makes them simpler, and more generally applicable. URIs can still be associated with items, using triples. Also, they connect to sources of data, rather than storing it all internally, and have native access control. There are other differences too, but these are the main ones. 

Triple stores are implemented by a software program, written in Java. This has a Java API that can be used by other Java programs.

Items

An item generally represents something, and can be the subject, verb, or object of a triple.

For example, one item might represent the person John Smith, and another might represent the concept of a person having a family name. A triple whose subject is the first item, whose verb is the second item, and whose object is the string "Smith" then conveys the information that John Smith's family name is "Smith".

Items do not convey meaning in themselves. There is nothing about the item representing John Smith to indicate that it represents anything or anyone in particular. The meaning can be conveyed in separate textual descriptions, such as, "The item represents the person John Smith" and by the triples in which the item appears.

Each item, apart from a few special items, is stored in a source.

Each item is uniquely identified by a combination of two numbers, one identifying the source that contains the item, and the other uniquely identifying the item within that source. (For special items, the source number is zero; there is no real source with that number.)

Each item has two access levels: a read level, which controls who can read it, and a write level, which controls who can write it (and enables anyone who can write it to read it also). Access to a triple is controlled by the access levels of the items in the triple. The access levels of the items thus provide fine-grained access control for all the data in a triple store.

An item can be locked to prevent simultaneous writes by different sessions. This also prevents simultaneous writes to triples, because writing to a triple requires its subject to be locked. The series of write actions performed while an item is locked is a transaction. When the item is unlocked, the transaction is completed, and the write actions are carried out in the item's source.

Named Items

A named item is an item that has, in addition to its identifier, a name that is guaranteed to be unique within its source. The name is associated with the item by a special triple. Named items can be created by source administrators. They are a powerful way of giving meaning to the data in the source. They are most often used as verbs of triples (for example, a named item could be created with the name "Product Has Price"). They can also be triple subjects or objects (for example "The home page" or "The color green".

Triples

A triple consists of a subject that is an item, a verb that is also an item, and an object that may be an item or a data value that is a boolean, an integer, a real number, a piece of text, or a sequence of bytes.

A triple is held in the same source as its subject, and has its own numeric identifier in that source.

Access to a triple is determined by the access levels of its subject, its verb, and its object if that is an item.

Objects that are pieces of text or sequences of bytes can be of arbitrary size. Non-item triples contain summaries of their objects, rather than the objects themselves. Searches in the triple store first find triples with objects whose summaries match the criteria. These objects are then retrieved from the sources to determine whether they match. Objects whose summaries do not meet the criteria are not retrieved.

For triples with objects that are booleans, integers, or real numbers, the value of the object can be determined from its summary. For triples with objects that are items, the object is stored in the triple. It is only text and binary objects that need to be retrieved from the sources.

Sources

Rather than storing all the data, a virtual data connects to data sources, so that users and client programs get the current data without having to keep loading it.

When a source is connected, the virtual data lake loads the items and triples, with summaries of the text and binary objects. It then performs periodic updates to keep the data in sync. The items, triples and summaries are kept in memory, so that they can be searched rapidly. The text and binary objects, which generally account for most of the data volume, are retrieved only as needed.

Access Levels

An access level determines which users can access what data.

Each item has two access levels, which determine which users can read, update, or delete it. They are the item's read level and write level.

Each store session has an access level. It can read an item if its access level is the same as or superior to the item's read level or the item's write level. It can update or delete the item if its access level is the same as or superior to the item's write level.

Access levels also determine the circumstances in which users can read, update, or delete triples. A session can read a triple if it can read the triple's subject and verb and also its object if that is an item. A session can update or delete a triple if it can write the triple's subject, can read the the triple's verb and also can read the triple's object if that is an item.

Each access level is represented by an item. The access levels of that item control access to the access level that it represents.

Superiority Relation

There is a superiority relation between access levels such that, if one access level is superior to another, anything that can be done at the inferior level can also be done at the superior level.

This is a powerful feature that system managers can use to control access to information. For example, an access level could be defined to control access to a company's pricing discount data. Another access level could be defined for people with a sales role. Making this level superior to the first one gives the sales team access to the pricing discount data.

The superiority relation is, in the mathematical sense, reflexive and transitive, but not symmetric. "Reflexive" means that an access level is superior to itself. "Transitive" means that if access level A is superior to B, and B is superior to C, then A is superior to C. So, if an access level is defined for people with a sales manager role, and is made superior to the level for the sales role, then sales managers will also have access to the pricing discount data. "Not symmetric" means that, if A is superior to B, then B is not necessarily superior to A. The sales managers can be given access to data that the sales team cannot see.

An organisation defines a number of access levels to control access to its information, and states that some are superior to others. Because the superiority relation is transitive, a level can be superior to another without this being stated explicitly (as, in the example, is the case for the sales manager level and the pricing discount access level).

Access by Collaborating Organizations

An access level defined by one organisation can be superior to an access level defined by another. The company in the example might appoint another company as a distributor of its products, and make the distributor's sales role access level superior to its pricing discount access level. (In such a situation, the access levels would probably be in different sources, with the distributor's sales role access level in a source owned by the distributor, and the pricing discount access level in a source owned by the producer.)

With access levels in different organizations, the question arises of which organization should be able to make a level superior to another. In the example above, we would expect the producer to have control over whether the distributor's sales role level should be superior to its pricing discount level. In other cases, however, an organization might want to control whether another organization's level is inferior to one of its levels. For example, it might wish to prevent its people from accessing information that it would then have to pay for, or that would put it in breach of a commercial or ethical policy. The principle is therefore adopted that both organizations must give permission for a level to be superior to another. Once both have done so, a stated relationship between the two levels is created automatically. Either can revoke its permission at any time, and the stated relationship is then automatically removed.

Built-In Access Levels

There are two built-in access levels:

  • The highest level, which is superior to every other level
  • The lowest level, which is inferior to every other level

The highest level is internal to the virtual data lake implementation. It is needed for some functionality (such as ensuring that a triple object is unique) but it is not exposed to users or client programs. There is no way that a user or client program can gain super-user privileges and see all of the data.

Authorization

Operations that write data or read data other than at the lowest access level must be authorized. Authorization for a series of operations can be provided by a store session, and authorization for a single operation can be provided by a credential and key.

A store session is an association between a client and a triple store that enables the client to perform read or write operations. In the Java API, it is an object that the client program passes to the store in method invocations. A credential and key establishes a store session for a single operation, and can be re-used to provide further sessions for other operations.

The virtual data lake does not provide authentication. The responsibility for allowing users to establish sessions and for issuing credentials and keys lies with the client program that created the virtual data lake.