data manipulation - fabrice...
TRANSCRIPT
Data Manipulation
Fabrice Rossi
CEREMADEUniversité Paris Dauphine
2019
Data Manipulation
In this courseI tabular dataI elementary extension to
multiple-table dataI data transformation
I wranglingI filteringI ordering
I data aggregation andsummary
I tidy data and reshaping
In other coursesI database management
systemI data modelsI relational dataI unstructured data
2
Data Model
In this courseI a data set is
I a (finite) set of entities (a.k.a. objects, instances, subjects)I each entity is described by its values with respect to a fix set of
variables (a.k.a. attributes)I in practice a data set is a table with
I a row per entityI a column per variable
ExtensionI multiple-table dataI a data set = several tables
3
Example
age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes
10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no
4
Variable types
NumericalI essentially “physical”
measurementsI integer or decimalI easier to handle than the
other types
CategoricalI a.k.a. Nominal (factors and
levels in R)I finite number of values
(called categories ormodalities)
I might be ordered
Dates and timesI very important in numerous
applicationsI notoriously difficult to handleI use specific libraries!
Short textsI a.k.a. stringsI could be handled as
categorical dataI specific processing in some
casesI do not confuse them with full
texts
5
Example
Bank datasetI sources
I https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing
I http://hdl.handle.net/1822/14838
I data typesI age: integerI balance: integerI education: categorical semi orderedI most of the others: categorical with some binary
6
Data Management
Data manipulation softwareI typical examples: R with tidyverse or python with pandasI limited automatic support for enforcing complex data models
I declarative support for broad typesI constraints can be checked explicitly
very complex constraints can be enforcederror/bug pronedifficult to read
I documentation is needed
7
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
8
Subsetting
Working on a subpopulationI a.k.a by “removing” rowsI several motivations
I speed (for large data sets)I robustness (removing outliers)I modeling
I generally called filteringI declarative approach in R and python
I give me the subset of the data that fulfills some conditionsI supported by comparison and Boolean operators
9
Example
Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)
Python (pandas)bank[ (bank.marital == 'married') &
(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]
R (dplyr)
bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)
10
Example
Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)
Python (pandas)bank[ (bank.marital == 'married') &
(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]
R (dplyr)
bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)
10
Computational considerations
Running timeI filtering is a row oriented operationI naive implementation
I browse the data row by rowI keep a row if the conditions are fulfilled
I run time proportional to the number of rows in the dataI can be improved in some cases via indexing
Do not program it yourself!I far less efficientI less readable
11
Computational considerations
Running timeI filtering is a row oriented operationI naive implementation
I browse the data row by rowI keep a row if the conditions are fulfilled
I run time proportional to the number of rows in the dataI can be improved in some cases via indexing
Do not program it yourself!I far less efficientI less readable
11
Dropping Variables
Column oriented subsettingI two main motivations
I to restrict the data set to variable types compatible with sometechnique
I to restrict the data set to meaningful variables for an automatedanalysis (e.g. clustering or predictive modeling)
I simple declarative approach
12
Example
Bank data setKeep some numerical variables
Python (pandas)bank[['age', 'balance', 'day', 'duration']]
R (dplyr)
bank %>% select(age, balance, day, duration)
13
Example
Bank data setKeep some numerical variables
Python (pandas)bank[['age', 'balance', 'day', 'duration']]
R (dplyr)
bank %>% select(age, balance, day, duration)
13
Ordering
SortingI standard sorting featureI multiple criteria
Pythonbank.sort_values(by=['age',
'balance'])
Rbank %>% arrange(age, balance)
14
Transformation of Variables
Variable OperationsI modifying a variableI adding new variables from
other sourcesI computing new variables
(based on existing ones)
Data WranglingI low level transformationI recodingI extraction and mergingI etc.
Data ManagementI context variablesI enforcing data model
Preparing AnalysisI e.g. recoding categorical to
numericalI or quantifying numerical
variablesI scaling, normalizationI merging categories
15
Computed variables
PrincipleI row oriented calculationI create a new variable using the existing onesI e.g. duration from starting and ending timesI combines nicely with aggregation/summary functions
SupportI numerous statistical summary functions (column oriented)I column oriented arithmetic (e.g. sum of columns)I column oriented logical operations (e.g. comparison)I function application (e.g. to each row)
16
Example
Bank data setBinary variable telling whether some client has some characteristicsI more than average mean annual balanceI at least one loan
Pythonbank['moreavg'] = bank['balance'] > bank['balance'].mean()bank['oneormoreloan'] = (bank['loan'] == 'yes') | (bank['housing'] == 'yes')
Rbank %>% mutate(moreavg = balance > mean(balance))bank %>% mutate(oneormoreloan = loan=="yes" | housing=="yes")
17
Example
One Hot EncodingI typical preparatory transformationI categorical variable turn into a set of binary variablesI e.g. Gender=Male or Female transformed into GenderMale and
GenderFemale (binary)
Pythonbank.join(pd.get_dummies(bank['education']))bank.join(pd.get_dummies(bank['education'])).drop('education',axis=1)
Rbank %>%
bind_cols(as.data.frame(model.matrix(~education-1,data=bank))) %>%select(-education)
18
Example
age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes
10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no
19
Example
age job marital primary secondary tertiary unknown1 30 unemployed married 1 0 0 02 33 services married 0 1 0 03 35 management single 0 0 1 04 30 management married 0 0 1 05 59 blue-collar married 0 1 0 06 35 management single 0 0 1 07 36 self-employed married 0 0 1 08 39 technician married 0 1 0 09 41 entrepreneur married 0 0 1 0
10 43 services married 1 0 0 011 39 services married 0 1 0 012 43 admin. married 0 1 0 013 36 technician married 0 0 1 014 20 student single 0 1 0 015 31 blue-collar married 0 1 0 016 40 management married 0 0 1 017 56 technician married 0 1 0 018 37 admin. single 0 0 1 019 25 blue-collar single 1 0 0 020 31 services married 0 1 0 0
20
Improving Representation
Data ManagementI convert values to proper
types, e.g.I integer only to language
supported integerI data as string to language
supported dateI string to semantic content
I proprer encoding of missingdata
I add diagnostic data
Nominal dataI important particular caseI frequently represented by
stringsI loss of efficiencyI cannot leverage automatic
handling (in R in particular)
21
Example
Data RepresentationI make sure that nominal variables are recognized as suchI yes/no nominal variables can be encoded as logical variables
Pythonfor myvar in ['job', 'marital', 'education']:
bank[myvar] = bank[myvar].astype('category')for myvar in ['default', 'housing', 'loan']:
bank[myvar] = bank[myvar] == 'yes'
Rbank %>% mutate_at(vars(job,marital,education), ~(as.factor(.)))bank %>% mutate_at(vars(default,housing,loan), ~(. == "yes"))
22
Example
Unknown valuesI specific “unknown” category in the bank dataset (for several
variables)I not considered “special” by the software while it should be
PythonNumpy provides a special value nan for not availablebank.replace("unknown",np.nan)
RSimilar special value NA
bank %>% mutate_all(~replace(., . == "unknown", NA))
23
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
24
Conditional analysis
Finding dependencies and linksOne of the main goal of data analysis, e.g.I predictive models: links between target variables and explanatory
variablesI frequent patterns: variables that are frequently non zero at the
same timeI etc.
Conditional summariesI chose one or more variablesI for each possible combination of the values of the chosen
variablesI find all corresponding objects in the data setI compute a summary of the other variables on this subset
25
Mechanism
X Y4 C2 B2 C3 C4 B1 C4 C4 A3 B3 A1 A1 B1 A3 B2 C
Split
Y XA 4A 3A 1A 1
Y XB 2B 4B 3B 1B 3
Y XC 4C 2C 3C 1C 4C 2
Apply
Y SumA 9
Y SumB 13
Y SumC 16
CombineY SumA 9B 13C 16
26
Example
Bank datasetI balance conditioned on the response to the marketing campaignI age conditioned on marital status and education level
Pythonbank.groupby('y')['balance'].median()bank.groupby(['marital', 'education'])['age'].mean()
Rbank %>% group_by(y) %>% summarize(median_balance = median(balance))bank %>% group_by(marital, education) %>%
summarize(mean_age = mean(age))
27
Example
Balance versus marketing
y median_balanceno 419.50yes 710.00
Age versus marital andeducation
marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65
28
Pivot Table
Conditioning by two variablesI useful special caseI we can leverage standard tabular representation
I compute a standard aggregated table with two conditioning variablesand a single aggregate
I “pivot” the tableI remove one of conditioning variable and the aggregateI creates as many columns as they are values of the removed variableI use the aggregate to populate cells
29
Mechanism
X Y Z2 B U2 C V3 A U2 B U1 C U4 C V3 B U4 C U1 B V3 A U2 A V4 A U3 A V4 B U3 B U
S-A-C
Y Z SumA U 10A V 5B U 14B V 1C U 5C V 6
PivotY/Z U VA 10 5B 14 1C 5 6
30
Example
Age versus marital and education
marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65
marital/education primary secondary tertiary missingdivorced 51.39 43.50 45.15 50.38married 47.51 42.40 41.78 48.44single 37.01 33.05 34.51 34.65
31
Howto
PythonSpecific support for Pivot tablesbank.pivot_table('age', index = 'marital', columns = 'education')
RA particular case of table reshaping
bank %>% group_by(marital, education) %>%summarize(mean_age = mean(age)) %>% spread(key = education,value = mean_age)
32
Multidimensional analysis (MDA)
Pivot (hyper)cubeI a pivot table with more than 2 “dimensions” (e.g. a pivot cube)I specific vocabulary:
I a dimension: a variable with a finite set of possible valuesI a measure: a numerical variableI a cell contains aggregate values for objects with given values for the
dimension
a very convoluted way of presenting conditional analysisrich possibilities when the set of values of a “dimension” isstructured (e.g. postcodes)rich support with specific OLAP software (not in this part of thecourse)
33
Example
Bank data setI Possible dimensions: job, marital, education, housing, loanI Possible measures: age and balanceI A cell (such as unemployed, married, primary education, no
housing and no loan) contains the average age and the medianbalance for the persons with the specified values on thedimensions
Tabular point of view
job marital education housing loan mean_age median_balanceadmin. divorced primary no no 57.00 1.00admin. divorced primary no yes 56.00 0.00admin. divorced primary yes no 57.00 179.00admin. divorced secondary no no 41.80 432.00admin. divorced secondary no yes 45.83 175.00
and 337 additional rows...
34
Example
Tabular viewy housing loan nno no no 1394no no yes 267no yes no 1958no yes yes 381yes no no 283yes no yes 18yes yes no 195yes yes yes 25
MDA view
y="no"housing/loan no yesno 1394 267yes 1958 381
y="yes"housing/loan no yesno 283 18yes 195 25
arranged as a cube!
35
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
36
Tidy Data
Concept introduced by Hadley Wickham the Tidy Data paper
DefinitionA data set made of several data tables is tidy if
1. each variable forms a column of a table2. each observation forms a row of a table3. each type of observational unit forms a table
Observational unitI a particular type of observations in a data setI e.g.
I personsI daily behavior of personsI etc.
37
Example
Bank data setI mixes person information and marketing campaign information
(and economic variables in an extended version!)I marketing campaign data
I last contact dataI contacts during the campaignI summary of previous campaigns!
I clearly untidyI violates property 3I multiple types of observational unit in a single table
38
Example
Pivot tablehousing/loan no yesno 1677 285yes 2153 406
I intrinsically untidyI columns are not variables but variable values!
Summary table
loan housing nno no 1677no yes 2153yes no 285yes yes 406
I tidy dataI new observational unit: groups of observations!
39
Example
Pivot tablehousing/loan no yesno 1677 285yes 2153 406
I intrinsically untidyI columns are not variables but variable values!
Summary table
loan housing nno no 1677no yes 2153yes no 285yes yes 406
I tidy dataI new observational unit: groups of observations!
39
Tidying Data
Splitting or JoiningI observational unit levelI splitting
I separates different observational unit into several tablesI filtering/selecting + linking
I joiningI merges several tables about the same observational unitI specific tools (see the last part of this course)
40
Tidying Data
Gathering or SpreadingI variable/observation levelI specific toolsI gathering
I reduces the number of columnsI merges several columns in a single one that corresponds to a proper
variableI spreading
I increases the number of columnsI splits a column into several ones that correspond to proper variables
41
Example (Splitting)
Splitting bank marketing dataSteps:
1. add an identifier to each row (to identify clients)2. select variables associated to each observational unit, keeping the
id2.1 persons2.2 current campaign (i.e. last contact)2.3 previous campaigns
3. clean the tables (remove useless rows, e.g. when no previouscampaign is available)
42
Example (Splitting)
PythonI pandas data frames have always row identifiersI splitting and cleaning
bank.persons = bank[['age', 'job', 'marital', 'education', 'default','balance', 'housing', 'loan']]
bank.current = bank[['contact', 'day', 'month', 'duration','campaign']]
bank.previous = bank[bank['pdays'] != -1][['pdays', 'previous','poutcome']]
43
Example (Splitting)
R1. identifier
bank.tidy <- bank %>% mutate(key = 1:nrow(bank))
2. selection with cleanupbank.persons <- bank.tidy %>% select(key, age, job, marital,
education, default, balance, housing, loan)bank.current <- bank.tidy %>% select(key, contact, day, month,
duration, campaign)bank.previous <- bank.tidy %>% filter(pdays != -1) %>% select(key,
pdays, previous, poutcome)
44
Spreading data
Replace variable encoded on rows by columnsI operates on two variables in the original table: a key and a valueI each value taken by the key becomes a columnI the value variable is used to fill the column
Original table
X Y Z1 A 21 B 32 A 42 B 5
Spread tableY is the key, Z is the value
X A B1 2 32 4 5
45
Example (Spreading)
CalIt2 datasetI flow (in number of persons) in and out a buildingI direction encoded in the flow column
flow date time count7 2005-07-24 00:00:00 09 2005-07-24 00:00:00 07 2005-07-24 00:30:00 19 2005-07-24 00:30:00 07 2005-07-24 01:00:00 09 2005-07-24 01:00:00 07 2005-07-24 01:30:00 09 2005-07-24 01:30:00 07 2005-07-24 02:00:00 09 2005-07-24 02:00:00 0
I untidy: flow is not a variable!I spreading is needed
46
Example (Spreading)
CalIt2 datasetI flow (in number of persons) in and out a buildingI tidy version
date time entering leaving2005-07-24 00:00:00 0 02005-07-24 00:30:00 1 02005-07-24 01:00:00 0 02005-07-24 01:30:00 0 02005-07-24 02:00:00 0 02005-07-24 02:30:00 2 02005-07-24 03:00:00 0 02005-07-24 03:30:00 0 02005-07-24 04:00:00 0 02005-07-24 04:30:00 0 0
47
Example (Spreading)
PythonI spreading can be done by leveraging the indexing systemI hierarchical indexing in this case
calit.set_index(['date', 'time'], inplace = True)tcalit = calit.pivot(columns='flow')tcalit.columns = tcalit.columns.to_flat_index()tcalit.rename(columns={('count', 7):'entering',
('count', 9): 'leaving'},inplace=True)
tcalit.reset_index(inplace = True)
I this is somewhat convoluted
48
Example (Spreading)
PythonI spreading can be also be done with pivot_table
tcalit = calit.pivot_table(values='count',index=['date', 'time'],columns='flow')
tcalit.rename(columns={7: 'entering', 9: 'leaving'},inplace=True)
tcalit.reset_index(inplace=True)
I a bit simpler
RStandard use of the spread function
calit %>% spread(flow, count) %>% rename(entering = `7`, leaving = `9`)
49
Gathering data
GatherI gathering is the reverse of spreadingI it reduces the number of columns by encoding them as a series of
rows and two new columns/variablesI the new key variable encode the gathered columns while the new
value variable contains the original value
Original table
Gather X, Y and ZW X Y Za 1 2 3b 5 6 7
Spread table
W K Va X 1a Y 2a Z 3b X 5b Y 6b Z 7
50
Example (Gathering)
Sales transaction data setI weekly product salesI one row per product, one column per week: product view
Product_Code W0 W1 W2 W3 W4 W5P1 11.00 12.00 10.00 8.00 13.00 12.00P2 7.00 6.00 3.00 2.00 7.00 1.00P3 7.00 11.00 8.00 9.00 10.00 8.00P4 12.00 8.00 13.00 5.00 9.00 6.00P5 8.00 5.00 13.00 11.00 6.00 7.00P6 3.00 3.00 2.00 7.00 6.00 3.00P7 4.00 8.00 3.00 7.00 8.00 7.00P8 8.00 6.00 10.00 9.00 6.00 8.00P9 14.00 9.00 10.00 7.00 11.00 15.00P10 22.00 19.00 19.00 29.00 20.00 16.00
with 53 columns and 811 rows
51
Example (Gathering)
Gather the weeksI gather all the columns except the product oneI one observation: (product, week)
Product_Code Week QuantityP1 1 11P2 1 7P3 1 7P4 1 12P5 1 8P6 1 3P7 1 4P8 1 8P9 1 14P10 1 22
with 42162 more rows
52
Example (Gathering)
PythonUsing the melt functionprodlong = pd.melt(prodperweek, 'Product_Code')prodlong.rename(columns={'variable': 'Week',
'value': 'Quantity'},inplace=True)
prodlong['Week'] = pd.to_numeric(prodlong['Week'].str[1:])+1
RStandard use of the gather function
prodperweek %>% gather(Week, Quantity, -Product_Code) %>%mutate(Week = parse_number(Week) + 1)
53
Tidy Data
Modeling assumptionsI data are tidy only with respect to some modeling assumptionsI what is an observation?
I maybe the most important assumptionI very frequently associated to an independence assumption
Practical aspectsI the data format must be adapted to the toolI data mining and machine learning
I generally limited to a single tableI one might need to merge tables (e.g. bank data set)
I column oriented software (R)I have limited row oriented capabilitiesI might need a specific untidy representation
54
Tidy Data
ExamplesI bank marketing: several object types or only one (person +
marketing action)I flow in/out a building
I the half-hour bidirectional flow is an observationI or a day is an observation
I weekly salesI product point of view (original data set)I product × week point of viewI week point of view
55
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
56
Multiple data tables
Tidy dataI one table per observational
unitI complex data
I multiple observational units(e.g. persons andproducts)
I multiple tables!
DifficultiesI the vast majority of data
analysis methods are limitedto single tables
I complex real world data usemultiple table!
I a core data manipulationtask: join multiple tables intodata analysis oriented tables
57
Example
Loan application data setI https://relational.fit.cvut.cz/dataset/Financial
I 8 tables includingI client tableI account tableI credit card tableI loan tableI etc.
I open ended data set: no specific goal
58
Associations and keys
Relational dataI tables must be related one to anotherI in the relational model a table is a relationI some relations describe entities while others describe links
between entities
KeysI a key is a (set of) variable(s) that uniquely identifies an entityI a primary key does that in the relation/table that describe the entityI a foreign key does that in another relation/table
59
Loan application data set
Client tableclient_id gender birth_date district_id
1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 5
KeysI primary key client_idI foreign key
district_id
Account tableaccount_id district_id date
1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-30
KeysI primary key account_idI foreign key
district_id
60
Loan application data set
Disposition tabledisp_id client_id account_id type
1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT
KeysI primary key disp_idI foreign keys client_id
and account_id
Link tableI the disposition table/relation is a typical example of a link tableI it associates clients with accountsI the link is also an entity as it has characteristics
61
Loan application data set
District tabledistrict_id Name Region Inhabitants A5 A6 A7 A8
1 Hl.m. Praha Prague 1204953 0 0 0 12 Benesov central Bohemia 88884 80 26 6 23 Beroun central Bohemia 75232 55 26 4 14 Kladno central Bohemia 149893 63 29 6 25 Kolin central Bohemia 95616 65 30 4 16 Kutna Hora central Bohemia 77963 60 23 4 27 Melnik central Bohemia 94725 38 28 1 38 Mlada Boleslav central Bohemia 112065 95 19 7 1
Direct linksI no link table from accounts and clients to the district tableI district_id is used as a foreign key in the account and client
tables
62
Joining tables
Main operationsI we need to build unique tables that gather information from
separate onesI this is done via join operations
I identifying matching entities in two different tablesI generating tables which combine variables from said tables for
matched entities
ExampleI joining client information with account informationI joining client information with district information
63
Basic case
PrincipleI input
I two tables A and BI A contains a variable V which is a foreign keyI B contains the same variable V as its primary key
I resultI a table with all the variables in A and B (no repeat)I such that each entity in A is merged with the entity in B referenced
by the value of V
64
Example
Client with district informationI join the client table with the district tableI district_id: foreign key in the client table, primary key in the
district tableI (part of the) resultclient_id gender birth_date district_id Name Region Inhabitants A5 A6 A7
1 F 1970-12-13 18 Pisek south Bohemia 70699 60 13 22 M 1945-02-04 1 Hl.m. Praha Prague 1204953 0 0 03 F 1940-10-09 1 Hl.m. Praha Prague 1204953 0 0 04 M 1956-12-01 5 Kolin central Bohemia 95616 65 30 45 F 1960-07-03 5 Kolin central Bohemia 95616 65 30 4
Pythonpd.merge(client, district)
Rclient %>% inner_join(district)
Common semantics: natural joinI common columns/variables are considered as a key
65
Join types
Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys
How to build the join table?I intersection approachI missing data approach
Inner joinI most common solutionI keeps only full rowsI discard rows that would have
missing values
Outer joinsI produce tables with missing
dataI full or asymmetric (left or
right) joins
66
Join types
Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys
How to build the join table?I intersection approachI missing data approach
Inner joinI most common solutionI keeps only full rowsI discard rows that would have
missing values
Outer joinsI produce tables with missing
dataI full or asymmetric (left or
right) joins
66
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
Implementation
PythonI merge functionI pd.merge(left, right):
inner joinI parameters
I how: join type among'left', 'right','outer', 'inner'
I on: column name(s) for thejoin
I many others
RI merge in base RI dplyr:
I several functions in withexplicit names:inner_join,full_join, left_join,right_join
I by parameter: columnname(s) for the join
left %>% full_join(right)
68
Multiple joins
Loan dataI table with clients and
accountsI difficulties
I link tableI duplicate variable name
district_id
SolutionI two joinsI columns renaming
Pythonda = pd.merge(disposition, account)da.rename(columns={
'district_id':'acc_district_id'},
inplace=True)fulldata = pd.merge(da, client)
Rdisposition %>% inner_join(account) %>%
rename(acc_district_id=district_id) %>%
inner_join(client)
69
Multiple joins
Loan dataI table with clients and
accountsI difficulties
I link tableI duplicate variable name
district_id
SolutionI two joinsI columns renaming
Pythonda = pd.merge(disposition, account)da.rename(columns={
'district_id':'acc_district_id'},
inplace=True)fulldata = pd.merge(da, client)
Rdisposition %>% inner_join(account) %>%
rename(acc_district_id=district_id) %>%
inner_join(client)
69
Result
disp_id client_id account_id type1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT6 6 4 OWNER7 7 5 OWNER8 8 6 OWNER9 9 7 OWNER
10 10 8 OWNER11 11 8 DISPONENT12 12 9 OWNER13 13 10 OWNER14 14 11 OWNER15 15 12 OWNER16 16 12 DISPONENT17 17 13 OWNER18 18 13 DISPONENT19 19 14 OWNER20 20 15 OWNER
account_id district_id date1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-306 51 1994-09-277 60 1996-11-248 57 1995-09-219 70 1993-01-27
10 54 1996-08-2811 76 1995-10-1012 21 1997-04-1513 76 1997-08-1714 47 1996-11-2715 70 1993-10-0216 12 1997-09-2317 1 1997-01-0818 43 1993-05-2619 21 1995-04-0720 74 1996-08-24
70
Result
disp_id client_id account_id type acc_district_id date1 1 1 OWNER 18 1995-03-242 2 2 OWNER 1 1993-02-263 3 2 DISPONENT 1 1993-02-264 4 3 OWNER 5 1997-07-075 5 3 DISPONENT 5 1997-07-076 6 4 OWNER 12 1996-02-217 7 5 OWNER 15 1997-05-308 8 6 OWNER 51 1994-09-279 9 7 OWNER 60 1996-11-24
10 10 8 OWNER 57 1995-09-2111 11 8 DISPONENT 57 1995-09-2112 12 9 OWNER 70 1993-01-2713 13 10 OWNER 54 1996-08-2814 14 11 OWNER 76 1995-10-1015 15 12 OWNER 21 1997-04-1516 16 12 DISPONENT 21 1997-04-1517 17 13 OWNER 76 1997-08-1718 18 13 DISPONENT 76 1997-08-1719 19 14 OWNER 47 1996-11-2720 20 15 OWNER 70 1993-10-02
71
Result
client_id type acc_district_id date1 OWNER 18 1995-03-242 OWNER 1 1993-02-263 DISPONENT 1 1993-02-264 OWNER 5 1997-07-075 DISPONENT 5 1997-07-076 OWNER 12 1996-02-217 OWNER 15 1997-05-308 OWNER 51 1994-09-279 OWNER 60 1996-11-24
10 OWNER 57 1995-09-2111 DISPONENT 57 1995-09-2112 OWNER 70 1993-01-2713 OWNER 54 1996-08-2814 OWNER 76 1995-10-1015 OWNER 21 1997-04-1516 DISPONENT 21 1997-04-1517 OWNER 76 1997-08-1718 DISPONENT 76 1997-08-1719 OWNER 47 1996-11-2720 OWNER 70 1993-10-02
client_id gender birth_date district_id1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 56 M 1919-09-22 127 M 1929-01-25 158 F 1938-02-21 519 M 1935-10-16 60
10 M 1943-05-01 5711 F 1950-08-22 5712 M 1981-02-20 4013 F 1974-05-29 5414 F 1942-06-22 7615 F 1918-08-28 2116 M 1919-02-25 2117 M 1934-10-13 7618 F 1931-04-05 7619 M 1942-12-28 4720 M 1979-01-04 46
72
Result
disp_id client_id account_id type acc_district_id date gender birth_date district_id1 1 1 OWNER 18 1995-03-24 F 1970-12-13 182 2 2 OWNER 1 1993-02-26 M 1945-02-04 13 3 2 DISPONENT 1 1993-02-26 F 1940-10-09 14 4 3 OWNER 5 1997-07-07 M 1956-12-01 55 5 3 DISPONENT 5 1997-07-07 F 1960-07-03 56 6 4 OWNER 12 1996-02-21 M 1919-09-22 127 7 5 OWNER 15 1997-05-30 M 1929-01-25 158 8 6 OWNER 51 1994-09-27 F 1938-02-21 519 9 7 OWNER 60 1996-11-24 M 1935-10-16 60
10 10 8 OWNER 57 1995-09-21 M 1943-05-01 5711 11 8 DISPONENT 57 1995-09-21 F 1950-08-22 5712 12 9 OWNER 70 1993-01-27 M 1981-02-20 4013 13 10 OWNER 54 1996-08-28 F 1974-05-29 5414 14 11 OWNER 76 1995-10-10 F 1942-06-22 7615 15 12 OWNER 21 1997-04-15 F 1918-08-28 2116 16 12 DISPONENT 21 1997-04-15 M 1919-02-25 2117 17 13 OWNER 76 1997-08-17 M 1934-10-13 7618 18 13 DISPONENT 76 1997-08-17 F 1931-04-05 7619 19 14 OWNER 47 1996-11-27 M 1942-12-28 4720 20 15 OWNER 70 1993-10-02 M 1979-01-04 46
73
Advanced join topics
Support for real world issuesI key selectionI variable renamingI filtering join (dplyr R only)I enforcing/checking unicity of keys (python only)I index based join (python only)I etc.
74
Licence
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
http://creativecommons.org/licenses/by-sa/4.0/
75
Version
Last git commit: 2019-12-11By: Fabrice Rossi ([email protected])Git hash: af2cee4da140c15fb47c0ff45a9ff8b1028fcbd0
76
Changelog
I November 2019: added multiple table dataI October 2019: initial version
77