Personal Data Stores and Context Automation

Issue/Topic: Personal Data Stores and Context Automation (T3F)

Conference: IIW-East September 9-10, 2010 in Washington DC Complete Set of Notes

Convener: Phil Windley

Notes-taker(s): Phil Windley – Blog Post

Tags for the session - technology discussed/ideas considered:

Discussion notes, key understandings, outstanding questions, observations, and, if appropriate to this discussion: action items, next steps:

www.windley.com/archives/2010/09/pdx_principles.shtml

There was a lot of discussion around Personal Data Stores (PDS) and Personal Data Lockers at IIW East. Every time slot on both days had at least one and sometimes two sessions on the subject. (As an aside, if you’re not familiar with IIW, the agenda is created in real time, by the participants, not months in advance by a program committee, so it represents more fully the interests of the participants than a normal conference aganda might.) I’m confident that this will also be a major theme at the upcoming IIW in Mountain View CA in November.

The term itself is a problem. When you say “store” or “locker” people assume that this is a place to put things (not surprisingly). While there will certainly be data stored in the PDS, that really misses it’s primary purposes: acting as a broker for all the data you’ve got stored all over the place and managing the metadata about that data. That is, it is a single place, but a place of indirection not storage. The PDS is the place where services that need access to your data will come for permission, metadata, and location. Similarly for services that need to give you data. Consequently, some have taken to calling it a PDX, where “x” stands for the “variable x.” That is, we don’t know what to call the last thing, so we’ll say “x” and leave it at that.

In the discussions, I started to tease out a few principles that define the PDX and make it something different from just a database where my stuff is. We all have lots of places where data about us is stored and since it’s personal data, we might think of them as “personal data stores” but when people at IIW (and elsewhere) use the term, they’re talking about something larger and more capable that just a passive database. Here’s a list of a few things that I think distinguish a PDX from just places where your personal data is stored:


 * user-controlled - the user needs to be in control of the data, who has access, and how it is used. Once that data is in my PDX, I make decisions about it. That doesn’t mean the data might not also be somewhere else. For example, data about my purchases from Amazon will certainly be stored at Amazon and not under my control. But I might also be emailing the receipts to a service that parses them and puts the data in my PDX for my use.
 * federated - there isn’t one place where your data is stored, but multiple places that the data needs to be able to flow between, in a permissioned way. There’s no center, just a lot of cooperating system with my PDX orchestrating the interactions. While Amazon might not give my PDX access to and control over my transactions, my phone company might provide a PDX-capable contact service where I choose to store my contact information.
 * interoperable - various PDX services and brokers have to be able to operate together according to standards to perform their roles. When I take money out of my account at Wells Fargo and deposit it at Chase, I don’t lose part of the value because Chase doesn’t know how to handle some part of the transaction. The monetary system is interoperable with standards and, sometimes, shims that connect it all together.
 * semantic - a PDX knows more about the data that it holds than existing data stores do. Consider Dropbox. I can put all kinds of things in my Dropbox, but it’s syntactic, not semantic. By that I mean that if I want to put healthcare data in Dropbox and control who uses it, I create a folder and put the data in it with specific permissions. The fact that there is a folder with a certain name located at a particular place in the folder hierarchy is purely syntactic. In a semantic world, the data itself is tagged as healthcare data and no matter where it is, it’s protected according to the policies I’ve put in place.
 * portability - a PDX doesn’t trap data in proprietary formats. If my phone company is storing my contact data in the cloud and I decide that I want to move it to my own server or another service, I can—from a technical as well as a policy standpoint. Note that this doesn’t mean we have to wait until thousands upon thousands of data format specification get hammered out. Semantic metadata can provide a means of translating from one format to another.
 * metadata management - one of the primary roles of the PDX is managing data about my data. What are the roles I’ve created? What permissions have I granted as exceptions to the defaults? What semantics surround the various data fields? What data sharing, encoding, and encrypting policies have I created? All of this has to be kept and managed in my behalf in the PDX.
 * broker services - the PDX is a place where the user manages a federated network of data stores. As an example of why this is important, consider the shortcomings of OAuth. If I use an application that needs access to four OAuth mediated APIs, I have to go through the OAuth ceremnoy with each API provider separately. Now consider that I might have dozens of apps that use a popular API. I have to go through the OAuth ceremony for each of them separately. In short a broker saves us from the N x M explosion of permissioning ceremonies. Similarly for various data services.
 * discoverable - a PDX should provide discoverability for its APIs and schemas so that any application I’m interested in knows how to interact with it. Discoverability protects users from having to completely specify addresses, mappings, and schemas to every application that comes along.
 * automatable and scriptable - a PDX without automation is worse than no PDX at all because it burdens the user rather than saving effort. A PDX will be a player in a larger ecosystem of services. I don’t see is as a mere API that allows services and applications to GET and PUT data—it’s not WEBDAV on steoids. The PDX is an active participant in the greater ecosystem of services that are cooperating on the user’s behalf.

Surely I’ve missed some, but this list is a good start. What would you add?

Update: Kaliya wrote up a vision and principles document for personal data stores a month ago. Not surprising to people who know us both, they differ radically in perspective, but are coherent in spirit.