What is data?

The leading voices in technology have exploded in discussion about data portability, data rights, and the future of web applications. As an active member in the DataPortability Policy group, here is my suggestion on how the debate needs to proceed: break it down. Michael Arrington seems pretty convinced you own all your data, but I don’t think that’s a fair thing to say - and at core is the reason he is clashing with Robert Scoble’s view. For things to proceed, I really think a deeper analysis of the issues need to be made.

1) Define the difference between data, information and knowledge. There’s a big difference.
2) Determine what things are. (is an e-mail address data or information?)
3) Recognise the difference between ownership, rights and their implications.
4) Determine what rights (if that’s what it is) the various entities have over data (users, web apps, etc).

This is a big area and has a lot of abstract concepts - break it down and debate it there.

Some of my own thoughts to give some context

1) Data is an object and information is generated when you create linkages between different types of data - the ‘relationships’. Knowledge is the application of information.

  • 2000 is data - a symbol with no meaning. Connect it with other data, like the noun "year", and you have information because 2008 now has meaning. Connect that information with other information, like "computer bug" and "HSBC and you now have an application of that information. That being, there was an issue with the Y2K bug that has something to the bank HSBC.

2) Define what things are

What’s an e-mail address, a phone number, a social graph, an image, a podcast…I’m not entirely sure. I wouldn’t be blogging this if I had all the answers. Once we agree on definitions, we can then start categorising them and applying a criteria.

3) Ownership:

Here is something Steve Greenberg explained to me

- Ownership is relevant when there is scarcity.
- Ownership is the ability to deny someone else’s use of the asset.
- So, if data is shared and publicly available, it is a practical impossibility for me to deny use
- and if data is available in a form where I can’t control others’ use of it, I can not really claim to own it

Nitin Borwankar has a very different argument: you should have ownership based on property rights. He explained that to me here .

4) Rights over data

I personally think no one owns data (which is inspired by the definition of data being inherently meaningless); instead you own things further down the value chain when that data becomes something with value. You own your overall blog posts - but not the words.

But again, this goes back to what is data?

7 Responses to “What is data?”


  1. 1 Alex Schleber

    Elias,

    I appreciate that you brought some much needed depth to the discussion, when a lot of it appeared to be posturing and chest-thumping especially by Arrington.

    As far as rights and ownership, never forget that “possession is 9/10 of the law”… :) You are right that it is next to impossible to make a strong claim for data ownership.

    As for emails addresses, it is in the very nature of the internet to not have any scarcity of them or any other data/information item for that matter (only thing we are running short on are IP addresses, IPv6 will solve that permanently, with enough IPs for every “thing” in the world and then some).

    If you have a hosting account with your own domain and add-on domains, and have access to domain-wide email-forward with a wildcard(*) you can with a few keystrokes implicitly create an INFINITE number of email addresses… only thing that matters though is if you check the account that you forward them to in an email client, asf.

    But in a real sense anyone can now type any possible combination of letters/numbers in front of @yourdomain.com and get the email to you. Food for thought.

    You could give each of your friends a different email address (elias1@, elias2@, asf.) and then be able to tell who “compromised” your email address e.g. by using it in an openly visible CC: recipient list. You could turn each of those off at the hosting level or filter at the email client level.

    I know a guy who has 40,000+ social bookmarking accounts that he uses for gray-hat SEO… who owns those accounts? Does it matter as long as the services cannot tell these are all him due to defensive measures such as proxy plugins in Firefox, cleaning out cookies, etc. Until they can detect and delete, or the service goes out of business they are his to use I presume.

    Possession…

  2. 2 Elias Bizannes

    Good comment Alex, thanks.

    My perspective is that it’s not about owning the data, it’s who gets the economic benefits. Your point about possession has got me thinking, but let me extend it with an applied example (the e-mail address that Scoble loves).

    The economic benefit of that e-mail is on receiving mail via that address. I guess you could say you own the actual account like a piece of property. And whilst the data on that account may not be hosted on your own server, there are techniques available to pull your e-mail account. But the fact remains, the economic benefit is on receiving the e-mail - and you can get an e-mail redirect and change providers. You don’t have to possess that e-mail domain; only have rights to enable a redirect.

    The use of e-mail in the debate is partly about identification. There is value being identified in using an e-mail as an identifier for a person - so as to be able to identify them on another system. But I don’t think you can control identification. It’s a bit like saying because I have brown hair, you are not allowed to use that to identify me. “Hair” is data - a noun - it has no meaning without context. No one owns that. And I can can’t control the way you use that against me.

    Arrington’s reaction to the e-mail issue is because he wants to control what gets sent to him. This is because it represents an economic cost depending on what gets sent to him. However that’s a separate issue: the issue is who controls the benefits. The debate need to focus on the benefits, and not the costs. And whilst it is valid for him to raise this, as spam is a major issue - it confuses the issue. Yes, people should be mindful about how his e-mail is used, but that’s called etiquette. At the end of the day, he has absolutely no control over the representation of his e-mail address. As the issue should be about controlling the economic benefits, there’s not much else we can extend on this.

  3. 3 Justin Davey

    I don’t think true data portability will fly anytime in the near future. If true ownership of data is in the right to deny the usage of that info by another party, we own very little of what we contribute to the world wide web. And I think that includes our identities. As long as monolithic companies like Google and Facebook remain convinced they can monetize “our” data, they will find a way to take ownership and control it. Even under the guise of data portability.

  4. 4 Crosbie Fitch

    Data is the property of the database owner - as I concluded recently here:
    http://www.digitalproductions.co.uk/index.php?id=117

    Inherent in the nature of relationships is that by our actions and speech we communicate information, and that information may be recorded as data by either party - beyond the other’s control. We can only hope to require that such data is only ever used to make truthful statements, and to assure corporations that their trustworthiness will ultimately be revealed (especially through indiscretion).

    The problem we’re having on the web is that corporations have created excellent recording facilities (databases) whereas individuals have not, and trade is currently fairly one-sided public vs corp. It doesn’t help either that corporations are immortal, whereas individuals are human.

    VRM will help rebalance the market.

  5. 5 Elias Bizannes

    Crosbie - you raise some interesting points. I understand what you mean about distinguishing corporations from humans due to the ethical angle, but don’t forget corporations do have a legal identity. The operation of how a corporation works is constrained, and is governed by humans - just because it represents human constituents doesn’t relegate it the status of a robot with no boundaries.

    Just because someone stores the data, doesn’t mean they somehow get automatic ownership. When I store my cash at a bank, does that mean the bank who possess my cash now owns it? No - and that’s because I have control over the benefits of that cash - which is its use. So whilst it is correct to say the entity hosting the data has possession, it does not mean that have exclusive use of it. Like a bank, they can use that data for other activities, but ultimately, the control of the usage is of the consumer it relates to. The bank that stores my data does have the ability to generate economic returns by the fact they store it, but the yltimate decisions of what happens with it, is my own.

  6. 6 Crosbie Fitch

    I have only been arguing the fundamental nature of data - its natural foundation. Given that, we can then better understand the limitations of the unnatural structures we might build upon it.

    External data communications and storage services are not actually naturally private. A private conduit between two houses is naturally, mutually private, but all parties privy to the communications channel can’t undo the nature of their situation (even if they avert their gaze).

    We can use public key encryption, one time pads, etc. But the encrypted data nevertheless becomes privy to those whose hands it passes through.

    Those who are privy to our data can nevertheless offer assurance and guarantees that they will observe the utmost discretion, and avert their gaze. The state can even regulate corporations to audit and further assure this discretion - in the virtually private communications and storage facilities they provide.

    Some services say “We’ll observe discretion, but we will process and analyse all your data selling such anonymised analysis or utilising it ourselves to provide resulting services back to all of our customers whose data we store/communicate”

    Some ISPs are now saying “We’ll observe discretion, but we will inspect your communications in order to prioritise it depending upon content or communicants”.

    You then have a choice of requiring the state step in to ensure discretion and neutrality (subject to the state’s indiscretion and censorship), or letting ISPs stand or fall on their behaviour in a free market (that the state ensures is competitive) and leaving it to communicants to encrypt their communications if they don’t want them analysed.

    As to your money/bank analogy. Money is both data and contract (promissory note) though treated in the aggregate as a liquid commodity. The numeric amount in your bank account is data visible to the bank and other government agencies (tax, laundering, etc.). The actual value itself relies upon its non-reproducibility (audited to assure this) - one cannot spontaneously generate contracts (without committing fraud). Although one can copy a contract this doesn’t generate two contracts, but two records of one.

    Although money may appear to be data, it is a more complex animal. Money is not information or even intellectual property, but a contract.

    So your analogy works only so far as the data recording the details of your account is kept private to you and your bank (and anyone else they’re required to permit access). The discretion of banks is regulated a little more scrupulously than the discretion of web ventures like Facebook. And as you recognise, the liquid asset aspect of the money you have deposited with the bank is highly mobile behind the scenes - so that is not at all analogous with data you’d like to keep immobile in a bank’s safety deposit box.

    Speaking of which, Swiss banks are of course notorious for being the most secretive of all (not quite so secretive these days).

    However, I doubt that it would be a useful law that required all companies to be as secretive with their customers’ data as banks are. What would be better would be for companies to be prosecuted if they didn’t adhere to their offers of discretion (aka ‘privacy policies’). Thus no company is required to be discrete, but if they make an offer or assurance of discretion, then if there’s ever any evidence they’ve broken that offer, they can’t simply say “Oh, well, we tried. Sorry that we sold your data when we said we wouldn’t - we just couldn’t help ourselves.”. NB I would subject corporations to such law, but not people. People cannot alienate themselves from their right to free speech. Although people can offer guarantees of confidentiality, e.g. “I have deposited $100 as security against the event that I blab within the next year”.

  7. 7 Daniel Parker

    Elias, I want to express agreement and add a few thoughts.

    I am generally agreeing with what you are saying in principle: The economic benefits are the essence of the ownership of data/information. But the bank analogy brings up an interesting point: Both the bank and you are benefiting (hopefully!) from the contract and cash in the account. Your benefits are security, and most likely interest; the bank’s benefits are access to your cash in order to lend and gain interest themselves. I think this plays out very much alike in not only a few other instances, but in every other instance. Google is my typical example in the area of information — Google doesn’t own any data on the web but their own domains, but they make use of much of it by indexing it and generating for themselves new “information” in the form of a searchable database of the web. They are potentially reaping an economic benefit from every piece of data I post on the web, maybe even this comment I’m writing. And in this case, I may not have any economic benefit other than people noticing my name; and you may end up with more benefit than me — who knows? The issue of ownership of this comment might come back around to the control idea. Maybe ownership has more to do with who *controls the economic benefit*. You said it briefly in your reply to Alex: “However that’s a separate issue: the issue is who controls the benefits.” In the case of this comment, that would be you. I’ve given you this comment to do with it as you please. You can make use of a robots.txt to keep Google from their economic benefit, or you could delete it entirely to keep anyone but you from any benefit. So I think I’m beginning to see that a combination of Control and Economic Benefit reveals who is Exercising ownership of the data. But we must be careful, because Exercising ownership does not indicate the Right to Exercise Ownership. So the question becomes in many cases as clear as, “Who has the Right to Control the Economic Benefits of this data?”

    With a little more thinking, my policy may be settling in my mind on this issue. I wrote it out here then decided to blog it and link: http://blog.behindlogic.com/2008/09/ethics-of-information.html

    On a completely different subject, the email-as-identification ideas remind me of the current initiative Email-to-OpenID within the DataPortability project. Looking back, the common process of identifying and validating a person by email is exactly the same as the process of OpenID — a site sends the user to a third-party only to be sent back with the secret key that was sent through the third-party in order to verify ownership of identity. Just an interesting correlation.

Leave a Reply