2007-08-10

Redefining "Document", Part 1: Information Sharing Problems

Recently I've been working to establish some very basic architectural concepts for large-scale, multiparticipant sharing of data, possibly including restricted data. It's been interesting and a little unsettling to realize just how poorly surveyed this problem space really is. I've increasingly come to believe that any large-scale information sharing effort is terribly exposed to two serious problems:
  • Loss of policy control over shared information;

  • (Probably nonlinear) degradation of information quality.

It's taken me a long time to get to this point, but I think I can state with reasonable confidence that the crux of the problem lies in unexamined concepts of what a "document" is. Fortunately, some conceptual investigation of the nature of documents yields a possible solution, and it's happily a solution that seems to be technically feasible.

To begin with, some definitions:
  • Managed data is persistent structured information whose maintenance and dissemination is at least nominally governed by a single authority.

  • Such a managing authority is called a primary source, or, especially in contexts related to policy, a primary authority.

  • Information sharing refers to all transmissions of managed data between primary sources and/or secondary consumers. (Note that in some contexts, "information sharing" has a narrower meaning, referring only to information transmissions whose destination is human eyes (as opposed to an automated data consumption process). That limitation does not apply here.)

  • A Document Management Application is a software application used by the primary source to manage the information for which it's responsible.


Primary Sourcing and the mindset of inevitability

As stated, the two major dangers involved in information sharing are loss of policy control and degradation of data quality. Explanations of both are necessary before discussing the essential problem of the nature of documents.

Both of these problems are founded on the concept that a primary authority really is responsible for managing its information, and is to be considered the authoritative source for that information. This primary sourcing concept, often reduced to the somewhat simpler formulation of "data ownership", has actually been abandoned by a lot of policy thinkers in the rush to share information: I've heard more than once that "Data ownership is obsolete". I was troubled by that remark the first time I heard it, and I never got more comfortable with the notion. This post covers a lot of territory, but in large part is a rebuttal to that whole idea.

It's instructive, and mildly distressing, to see how quick policy thinkers have been to embrace a kind of technological determinism. This is probably something of a conditioned response to the expensive and monolithic information technologies of previous decades, and the consequent pervasive insertion of IT professionals (like me) into business, social, and political decision processes, where we serve as gatekeepers, telling people what can and cannot be done.

I think, too, that it reflects something of the ruthless, bottom-line-driven, hypercompetitive nature of today's corporate culture. That culture has pressured society into an unfortunate habit of thought, in which any attempt to assess benefit, risk, or cost according to criteria other than the corporate balance sheet is subject to derision and, if persisted in, retaliation.

These two dynamics — IT people telling policymakers "You can't do that, there's no way to do that", and corporate people telling policymakers "Nice little policymaking job you got there, be a shame if something happened to it" — probably explain the widespread readiness to think of technological progress as both deterministic and sovereign: a great big train barreling down a single line of track. You can't stop it, changing its direction is fundamentally impossible, and you'd best not be in the way.

Fortunately, both of these dynamics seem to be increasingly subject to challenge. Current information technologies are rapidly eroding the concept that IT is rigidly constrained in its capabilites; and the remarkable criminality and corruption of the Republican Party since its ascendancy to power in the Reagan years seems finally to be undermining America's unthinking habit of deference to business priorities. I'm hopeful that it's becoming possible to have design discussions that do not begin with pernicious assumptions of inevitability.

Because seriously: loss of policy control and degradation of information quality are the kinds of problems that are immensely more difficult and expensive to remedy after the fact. To the extent that we can work out concepts, designs, and practices that prevent those problems from developing in the first place, we will be in an immensely better position.

I believe it's urgent to work very seriously at the problem. Information sharing as a general project is gaining traction, and all of its manifestations appear to envision a default behavior that is very detrimental.

The default model of information sharing is to make copies of documents and transmit them to sharing recipients. For present purposes, this model will be dubbed Massive Duplication, because its net result is of course many copies everywhere. The reason for this being the default proceeds directly from the fundamental nature of our concept of a Document.

Massive Duplication disastrously amplifies the two central problems addressed here.


Loss of Policy Control

Losing policy control over shared information is defined as the condition in which the primary source lacks authority over — and in some cases, even knowledge of — what the recipient is doing with the data. Is it being copied to a data warehouse or archive? Is is being secondarily disseminated to other parties? Is it being merged into other records owned by another primary source altogether?

Maintaining policy control is, of course, important only if the information being shared is in some way sensitive: if its disclosure to an inappropriate recipient could cause harm of some sort. But information sharing projects are, implicitly, all about trafficking in such sensitive information. After all, if it's not sensitive, if it's truly public data, it's cheap and easy to just post it on a website and fuggedaboudit.

The problems of policy control are all problems of disclosure policy. Control over other kinds of information operations (eg., creating, deleting, and modifying records) is not affected by the existence of alternate copies floating around out in the world. It's disclosure that's the issue.

When information is shared, the governance of subsequent disclosure and usage can never be absolute. (If nothing else, a bad actor viewing shared information could always transcribe it and do something nefarious with it.) However, there's an enormous difference between an information sharing architecture which is vulnerable to bad-faith abuse, and a policy-hostile architecture that makes it impossible to maintain policy control even when all participants are well-behaved.

Massive Duplication is of course such a policy-hostile architecture.


Degradation of Information Quality

"Information Quality" is a broad and not entirely tightly defined term that covers a lot of issues. In general, it speaks to the question, "How trustworthy is my data?"

Information Quality topics include issues like:
  • How reliably sourced/observed is the content of the information?

  • How good are the data sanitation practices of the entities that have had possession of the data?

  • What confidence level has been assigned to the data? What are the criteria used for that assignment? How trustworthy are the entities contributing to that assignment?

  • How well-understood are the transformations and rationalizations (if any) applied to the data? Are they clean, correct mappings, or susceptible to semantic error?

  • Is the data normalized or does it reiterate any part of its content?

  • What is the observed or reported incidence of data error in information from that source? Are internal inconsistencies observable within the data?

  • ...and so on.

Information Quality is an acknowledged problem in all information management activity. Information Sharing just happens to magnify the problem's every aspect and manifestation.

As is the case for policy control, the cause of good information quality is hurt badly by adopting the Massive Duplication model of information sharing. Its worst effects are:
  • The n-generation effect, in which a document is subject to a given probability of replication or transcription error for every sequential event of transmittal, and every event of persistence, outside the primary source. This is very similar to the progressive degradation of image quality as successive photocopies are made, each from the last photocopy. The term "n-generation" is derived from the description of how many sucessive copies-of-copies resulted in a given print.

  • The divergence effect, in which copies transmitted to multiple recipients create a greater likelihood of a "fork" in the data, causing the content to diverge. This is a topologically different problem than the n-generation effect, and its primary risk is the creation of separate version chains of the document: it's difficult and expensive to reconcile inconsistencies in such an information structure.

  • The loss of provenance for all or part of the information in a document. This is the condition in which knowledge of the authoritative source of the information in the document has been lost or erroneously recorded. The most important consequence of this loss is the marked increase in difficulty — and expense — when trying to resolve any questions about the document's content or validity.

  • The patchwork quilt effect, in which successive generations of copies of data have content added, modified, and removed along the way to suit the purposes of the holder of the moment. In this fashion, as the document is forwarded, it becomes increasingly a composite of data from heterogeneous sources. Accuracy and provenance of any given patch in the quilt becomes more difficult to persist and establish; in some situations, essential components of the tracking and identifying information may be among the discards.

  • And finally, there's the devastatingly simple problem of the reverse axis. As hard as it can be, in a massively duplicated information sharing environment, to track a piece of information back to its source, attempts on the part of the primary source to issue corrections, retractions, redactions, etc. forward seem almost certain to fail to reach some of the disseminated copies.

I'm in no position to make authoritative mathematical statements; I haven't constructed a math model for the probability of error in any of these scenarios. And math can surprise you.

But my intution is strong enough to bet anybody a beer that any rigorous probabilistic error-rate predictive analysis would contain significant terms that would be worse than linear. (I speak here of linearity with respect either to the number of information transmissions, or to the number of participants in the exchange.)

The consequences of embarking on massively duplicated information sharing are potentially quite grave. And the consequences would grow over time, as the common slush-pile of copies of copies would grow deeper and higher.

No comments: