LDEO Log Data: RDF to Google Fusion Tables

The start

This is a test of what is involved to make a pipe (ETL) from a locally hosted RDF store (Virtuoso in our case) to a service like Google Fusion Tables

The goal is to investigate the process as a means to enable a level of business intelligence type capacity with IODP data. The RDF store and the linked data pattern allows us to blend data from multiple graphs and send the results to more user friendly interfaces like Fusion tables. The data also could also be sent back into the RDF store as a new graph as well.

Assets

RDF graphs provide a means to query across several models resulting in a larger interconnected linked open data "graph" for USIO data. However, this can be complex for a domain scientist un-experienced with the approach to interact with. Conversely the spreadsheet like metaphors behind Google Docs, or Google Fusion tables (and other services) provide a known and intuitive environment.

This experiment seeks to take a first look at connecting these environments. It does this demonstrating a search across multiple graphs, a calculation based on these results and then loads this data into Fusion Tables. This is a classic extract, translate and load (ETL) process. It's done to see what the effort required is and what some of the potential benefits of doing this will look like.

Boreholes with logs and age models

The first step was to compare the graphs that hold the age model data and the log data. The SPARQL call below locates all the log file resources that are in boreholes that also have age models.

The result of this call then is all the available log file resources that we have enough information to provide an age for the bottom depth recorded with the resource

PREFIX chronos: <http://www.chronos.org/loc-schema#>
PREFIX iodp: <http://www.oceandrilling.org/core/1/#>
SELECT DISTINCT  ?log ?agemodel ?codices ?logbottom ?lat ?long
FROM NAMED <http://data.oceandrilling.org/logdb#>
FROM NAMED <http://chronos.org/janusamp#>
WHERE {
   GRAPH <http://data.oceandrilling.org/logdb#> {
      ?log skos:related ?codices .
      ?log iodp:logbottom ?logbottom .
     FILTER regex(?codices, "^http://data.oceandrilling.org/codices/lsh/.*/.*/.*" ) .
   }
   GRAPH <http://chronos.org/janusamp#> {
     ?agemodel skos:related ?codices .
     ?agemodel geo:long ?long .
     ?agemodel geo:lat ?lat .
   }
}
                

Some math and load to Google Fusion Tables

At this point the next step is to pull back all the age depth pairs (again by SPARQL) and then do some math to calculate the age at log bottom. The result now is we have the log file resource URI, the bottom depth of the log and now the calculate bottom age of the log as well.

Next we want to load this into a Google Fusion table to begin to see what we can do with this. Using the GDdata API's we can load this data in batch mode into a new table. The code creates and loads this data into a fusion table.

Google Fusion Tables gives us what

As mentioned Google Fusion tables (and by way of loading to Google Doc's too) gives us a more common look and feel to the data. Table oriented but also potentially enhanced with geospatial mapper and query ref: http://data.oceandrilling.org/labs/geo/geo.html ).

Fusion tables allows not only mapping but also simple plots, timelines and other visualization for data that lends itself to those visualizations. The data can also be exported or copies made by users if permission is given for this. Filters, aggregates, and other data massaging can also be done and the results freely shared by 3rd parties and their community external to USIO interaction.

Final thoughts

This example is not a full fledged attempt at this pipeline. The following errata comes to mind:

  • Only about 30 percent of the log file resources match up to an age model. However, there are likely more hits if I use any age model at a site, rather than requiring the log and the age model share the exact same hole. A site level age model should be more than accurate enough as any two given holes at a site will only be a few meters away from each other.
  • More elements of the various graphs used could be pulled into the Fusion table to make it a richer resource. For this example only the simple reference URI, depth, age and location were loaded just for sake of simplicity.
  • The log data doesn't lend itself to filtering and searching like lithology or taxonomy data might. However it served as a clean data set to test this example with. Future efforts would use these more faceted data types to explore better the data visualization and analysis capacity.

KML Link

KML Link

Scatter plot of depth vs age for various logs. Here depth in meters below sea floor is on the X axis and age in millions of years is on the Y axis. While this is mildly interesting one issue with the interface is that we can not zoom or select a point and obtain its URL to explore it further. Moving the data into something like Google Docs (is there a connection between fusion tables and docs?) might allow for richer interaction.