Virtual Data Lakes
Virtual Data Lakes

The Virtual Data Lake Web API

With the Virtual Data Lake REST API, you can access data in a simple, powerful and flexible way. It has two basic data access operations: query and update, plus supporting operations. You do not have to use schema. You can think of your data as a graph or a set of triples, or as tables if you want to. It enables you to connect and disconnect sources of data, and it has powerful features for access control.

How to Use It

You can use any programming language to invoke the API with POST requests. There is a Python wrapper, which makes this particularly easy if you are using Python.

This page explains the basic operations with Python examples. There is also full documentation of the REST API and the Python wrapper, but it is helpful to understand the basic concepts before you get to the nitty-gritty.

Queries

A query finds the data that satisfies a set of constraints.

For example, suppose you have an application that keeps notes about web pages. You want to find the notes about pages that are about trees. The data that you want to find is:

  • Let Page be a page
  • Let Url be the URL of the page
  • Let Note be a note
  • Let Text be the text of the note

The constraints on it are:

  • Page is about "Trees"
  • Page has URL Url
  • Note is about Page
  • Note has text Text

This is a bit like elementary algebra, where you solve a set of equations to find the values of some unknowns. The difference is that, instead of solving equations, you are searching your data. You do not discover all the logical possibilities, you just get whatever data is there. Because of this similarity, the API uses the term "Unknown" for data items that you are looking for, and "Solution" for a combination of data values that satisfy the constraints. The solutions to this query might be:

Page
Url
Note
Text
1_987654321https://en.wikipedia.org/wiki/Tree1_876543219Wikipedia article
1_765432198https://trees.org/1_654321987Trees for the Future


The values for Page and Note are items, represented by meaningless identifiers. The values for Url and Text are text strings. Other types of value - Booleans, integers, floating point numbers, or chunks of binary, might be found in other applications.

The query can be made quite simply using the Python wrapper.

solutions = vdl_client.query([
  (vdl.Unknown('Page'), vdl.NamedItm(1, 'is about'), 'Trees'),
  (vdl.Unknown('Page'), vdl.NamedItm(1, 'has URL'), vdl.Unknown('Url')),
  (vdl.Unknown('Note'), vdl.NamedItm(1, 'is about'), vdl.Unknown('Page')),
  (vdl.Unknown('Note'), vdl.NamedItm(1, 'has text'), vdl.Unknown('Text'))])

The solutions can then be examined or printed, e.g. by:

for solution in solutions:
  print(solution['Page'], solution['Url'], solution['Note'], solution['Text'])

Updates

Updates are made using lists of atomic operations. All of the operations in a list will be performed, or none. For example, to add the page at https://minecraft.fandom.com/wiki/Tree with the note "Minecraft Overworld trees", the operations would be:

  • Create a Page item to represent the page
  • Create a triple: Page has URL "https://minecraft.fandom.com/wiki/Tree"
  • Create a Note item to represent the note
  • Create a triple: Note is about Page
  • Create a triple: Note has text "Minecraft Overworld trees"

The update can be made using the Python wrapper.

vdl_client.update([
  vdl.CreateItem('Page', 1),
  vdl.PutTriple((vdlUnknown('Page'), vdl.NamedItm(1, has URL'), 'https://minecraft.fandom.com/wiki/Tree')),
  vdl.CreateItem('Note', 1),
  vdl.PutTriple((vdlUnknown('Note'), vdl.NamedItm(1, is about'), vdlUnknown('Page'))),
  vdl.PutTriple((vdlUnknown('Note'), vdl.NamedItm(1, has Text'), 'Minecraft Overworld trees'))])

Simplicity

In the example, the notes on web pages about trees were retrieved by a query with four constraints, and came with meaningless item identifiers as well as theUrls and the text of the notes. This is more complex than using the SQL query SELECT Url, Text FROM PageNotes. So how is the Virtual Data Lake API simple?

Relational database is a powerful concept that enables access to data on which a clear structure has been imposed. Its value for managing structured data is unquestioned but, in recent years, the need has grown to deal with data that does not follow the relational model, such as sets of name-value pairs, and knowledge graphs. This data may be viewed differently by different applications, or by the same application at different times, and trying to impose a static structure on it can make things very complicated.

It doesn't take much for things to start getting complicated, even in the SQL world. For example, suppose we can have more than one note on a web page. Our simple PageNotes table is no longer in First Normal Form. We fix this by introducing a second table for notes, and giving it a primary key. The text of the notes doesn't make a sensible key, so we introduce a meaningless identifier. Already the complexity is building. Then we find that notes can be authored by different people, and we want to keep track of them - so we need a third table for the authors. Imposing this kind of structure, working through first, second and third normal forms, can be a valuable discipline, but it is a big design overhead, and is simply not viable when the structure needed by the data users is unclear or changes frequently.

Different query paradigms have been introduced to deal with non-relational data, including SPARQL (which was the inspiration for the queries in this API) and GraphQL. The API has simple query and update operations. It works for triple stores, graph databases and other "NOSQL" paradigms, and for relational data too, because it breaks queries and updates into atomic constraints and changes. When dealing with data whose structure is fluid or obscure, it is as simple as you can get.