data leakage detection

Data Leakage Detection

Data leakage is the unauthorized transmission of sensitive data or information from within an organization to an external destination or recipient.

Sensitive data of companies and organization includes

intellectual property, financial information , patient information, personal credit card data , and other information depending upon the business

and the industry.

What is Data Leakage?

In the course of doing business, sometimes data must be handed over to trusted third parties for some enhancement or operations.

Sometimes these trusted third parties may act as points of data leakage.

Example:a) A hospital may give patient records to researcher

who will devise new treatments.b) A company may have partnership with other

companies that require sharing of customer data.c) An enterprise may outsource its data processing ,

so data must be given to various other companies.

How data leakage takes place?

Development chains

Supply chains OutsourcingBusiness hubs

Demand chains

Owner of data is termed as the distributor and the third parties are called as the agents .

In case of data leakage, the distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.

Watermarking Overview: A unique code is embedded in each distributed

copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified.

Mechanism: The main idea is to generate a watermark [W(x;

y)] using a secret key chosen by the sender such that W(x; y) is indistinguishable from random noise for any entity that does not know the key (i.e., the recipients).

Existing data leakage detection technique

The sender adds the watermark W(x; y) to the information object I(x; y) and thus forms a transformed object TI(x; y) before sharing it with the recipient(s).

It is then hard for any recipient to guess the watermark W(x; y) (and subtract it from the transformed object TI(x; y));

The sender on the other hand can easily extract and verify a watermark (because it knows the key).

It involves some modification of data that is making the data less sensitive by altering attributes of the data.

The second problem is that these watermarks can be sometimes destroyed if the recipient is malicious.

Drawbacks of watermarking

Thus we need a data leakage detection technique which fulfils the following objective and abides by the given constraint.

CONSTRAINT To satisfy agent requests by providing them with the

number of objects they request or with all available objects that satisfy their conditions.

Avoid perturbation of original data before handing it to

agents

OBJECTIVE

To be able to detect an agent who leaks any portion of his data.

Entities and Agents: A distributor owns a set T = {t1, . . . , tm} of

valuable data objects. The distributor wants to share some of the objects

with a set of agents U1, U2, ...,Un, but does not wish the objects be leaked to other third parties.

The distributor distributes a set of records S to any agents based on their request such as sample or explicit request.

Sample request Ri= SAMPLE (T, mi): Any subset of mi records from T can be given to Ui .

Explicit request Ri= EXPLICIT (T; condition): Agent Ui receives all T objects that satisfy condition

PROBLEM SETUP AND NOTATION

Fake Objects: Fake objects are objects generated by the

distributor that are not in set S. The objects are designed to look like real objects, and are distributed to agents together with the S objects, in order to increase the chances of detecting agents that leak data.

Data Allocation Problem: The data allocation problem: “How can the distributor intelligently give

data to agents in order to improve the chances of detecting a guilty agent?”

There are four instances of this problem, depending on the type of data requests made by agents and whether “fake objects” are allowed.

Data leakage problem instances:

Sample data requests:

• The distributor has the freedom to select the data items to provide the agents with

• General Idea:– Provide agents with as much disjoint sets of

data as possible• Problem: There are cases where the

distributed data must overlap E.g., |Ri|+…+|Rn|>|T|

Distribution strategies

Explicit data requests:

The distributor must provide agents with the data they request

General Idea:Add fake data to the distributed ones to

minimize overlap of distributed dataProblem: Agents can collude and identify fake

data

Evaluation of Sample Data Request:1: Initialize Min_overlap ← 1, the minimum out

of the maximum relative overlaps that the allocations of different objects to Ui.

2: for k €{k |tk € Ri} do Initialize max_rel_ov ← 0, the maximum

relative Overlap between and any set that the allocation of tk to Ui

Algorithm

3: for all j = 1,..., n : j = i and tk € R do Calculate absolute overlap as abs_ov ← | Ri∩ Rj| + 1 Calculate relative overlap as rel_ov ← abs_ov / min ( mi, mj )4: Find maximum relative as max_rel_ov ← MAX (max_rel_ov,rel_ov) If max_rel_ov ≤ min_overlap then min_overlap ← max_rel_ovret_k ← kReturn ret_k

For Example:T={1,2,3} U={a,b,c} Ri={T,2} i={a,b,c}

Evaluation of Explicit Data Request:1: Calculate total fake records as sum of fake

records allowed.2: While total fake objects > 03: Select agent that will yield the greatest

improvement in the sum objective i.e. i=arg_max((1\|Ri|)-(1\|Ri|+1))sigmaj Ri∩ Rj4: Create fake record5: Add this fake record to the agent and also to fake

record set.6: Decrement fake record from total fake record

set.

Algorithm for explicit data request

Future work includes the investigation of agent guilt models that capture leakage scenarios that are not yet considered.

The extension of data allocation strategies so that they can handle agent requests in an online fashion .

Future Scope

The presented strategies assume that there is a fixed set of agents with requests known in advance.

The distributor may have a limit on the number of fake objects.

Limitations

It helps in detecting whether the distributor’s sensitive data has been leaked by the trustworthy or authorized agents.

It helps to identify the agents who leaked the data.

Reduces cybercrime.

Applications

Though the leakers are identified using the traditional technique of watermarking, certain data cannot admit watermarks.

In spite of these difficulties, it is possible to assess the likelihood that an agent is responsible for a leak.

We observed that distributing data judiciously can make a significant difference in identifying guilty agents using the different data allocation strategies.

Conclusion