© 2005 the mathworks, inc. handling large data sets efficiently in matlab ® stuart mcgarrity the...
TRANSCRIPT
© 2
005
The
Mat
hWor
ks, I
nc.
Handling Large Data Sets Efficiently in MATLAB®
Stuart McGarrity
The MathWorks, Inc.
2
Handling Large Data Sets is Like…
3
Agenda
Problems in handling large data sets Strategies for handling large data set Maximizing available memory on your system Minimizing required memory in MATLAB®
4
Problems in Handling Large Data Sets
5
What are Large Data Sets?
Data in MATLAB® represent physical quantities
Large Data– Lots of quantities– Varying by time, space– High resolution– Example: 5-10 TB data
per flight test Trends: Devices, Computers,
RAM, Hard drives
6
What are Large Data Handling Problems?
Running out of memory– Large data sets need lots of memory to store and process– Computers have finite memory– Data set size > available memory: “Out of memory” errors
Slowness – Large data sets need lots of operations to process, and
access– Today's CPUs have limited speed– Slowness due to page file use
7
Causes of Out of Memory Error
Required memory > available memory– Memory constraints on (32-bit) computer system– Memory usage characteristics of MATLAB
Lack of understanding of memory constraints or requirements– >>A=rand(6e3,6e3);B=svd(A);” Why out of
memory?”– “I have 1 GB file but I have 3 GB of RAM. Why out of
memory?” Mistakes
– >>a=rand(10000); % need 800MB storage
8
Example from CSSM: MATLAB Memory Limitation Problem!!!!!!!!!!!"erican" <[email protected]> wrote in message news:<ijkq1gh4vqd8@legacy>...
It seems Matlab has a bad memory management. I got a new PC with 4G physical memory and hoped to free my worry about memory usage.However, I found that I can only use about 1G memory for several big matrices. If I continue to create small matrix, I can use about 1.5G memory. Although I tried to change the system performance so that the swap space was set to the maximum, it still doesn't work. I want to maximize the usage of my 4G memory, at least to 2G. Any suggestions?
Thanks!
9
MATLAB Users’ Data Set Sizes
25% of MATLAB users have data set sizes > 100 MB
10
MATLAB Users’ File Sizes
30% of MATLAB users access files > 100 MB
11
Strategies for Handling Large Data Sets
12
Strategies
Ensure available memory > required memory Maximizing available memory on your system
– System configuration Minimize required memory in MATLAB
– During access, storage, processing, plotting
13
Two Tactics Not Focused on Today
Use 64-bit– Removes one key memory constraint allowing processes
to address many tera bytes of data. Distributed computing
– N Machines ~= N x Memory– Subset of all applications (data parallel)
14
Maximizing Available Memory on Your System
15
What’s the biggest array in MATLAB under Windows XP?
>>a=zeros(?,1);>>whos
a) 600MBb) 1GBc) 1.5GB
16
Memory Constraints Causing Out of Memory Errors
DataDatazeros(1e9,1)zeros(1e9,1)
New Data New Data RequirementsRequirements
ContiguousContiguousFree BlockFree Block}}
Other FragmentsOther Fragments
MATLABMATLABWorkspaceWorkspace
Other ML VariablesOther ML Variables
WorkspaceWorkspace
ML ML footprint, footprint, Win DLLsWin DLLs
}}
MATLABMATLABProcess virtual Process virtual
memorymemory
Limit OS dependantLimit OS dependante.g. 2GB in Win2ke.g. 2GB in Win2k
Other AppsOther Apps
MATLABMATLABProcessProcessVirtual Virtual memorymemory
All ApplicationsAll Applicationsmemorymemory
requirementrequirement
}} RAMRAM
Page FilePage File
}}
Total SystemTotal SystemMemory Memory
17
Total System Memory
DataDatarand(1e8,1)rand(1e8,1)
New Data New Data RequirementsRequirements
ContiguousContiguousFree BlockFree Block}}
Other FragmentsOther Fragments
MATLABMATLABWorkspaceWorkspace
Other ML VariablesOther ML Variables
WorkspaceWorkspace
ML ML footprint, footprint, Win DLLsWin DLLs
}}
MATLABMATLABProcess virtual Process virtual
memorymemory
Limit OS dependantLimit OS dependante.g. 2GB in Win2ke.g. 2GB in Win2k
Other AppsOther Apps
MATLABMATLABProcessProcessVirtual Virtual memorymemory
All ApplicationsAll Applicationsmemorymemory
RequirementRequirement
}} RAMRAM
Page FilePage File
}}
Total SystemTotal SystemMemory Memory
18
Total System Memory Available
Storage for all processes =
– Physical RAM (fast and expensive)+
– Page file on disk (cheap and slow)
Memory Management Guide Tech note 1106 Amount of RAM affects performance; not direct cause of
“out of memory” errors
19
Viewing Total System Memory Available and Usage: Task Manager Alt-Ctrl-Del Right-click task bar Physical, commit charge Process Explorer (Google
“process explorer”)
20
Maximizing Total System Memory
Size– Ensure non-zero or system managed page file– Max 4 GB
Performance– Add RAM– Max 4 GB
21
The MATLAB Process’s Virtual Memory
DataDatarand(1e8,1)rand(1e8,1)
New Data New Data RequirementsRequirements
ContiguousContiguousFree BlockFree Block}}
Other FragmentsOther Fragments
MATLABMATLABWorkspaceWorkspace
Other ML VariablesOther ML Variables
WorkspaceWorkspace
ML ML footprint, footprint, Win DLLsWin DLLs
}}
MATLABMATLABProcess virtual Process virtual
memorymemory
Limit OS dependantLimit OS dependante.g. 2GB in Win2ke.g. 2GB in Win2k
Other AppsOther Apps
MATLABMATLABProcessProcessVirtual Virtual memorymemory
All ApplicationsAll Applicationsmemorymemory
RequirementRequirement
}} RAMRAM
Page FilePage File
}}
Total SystemTotal SystemMemory Memory
22
It is Limited and OS-Dependent
32-bit platforms– Windows 2000 and XP (by default): 2 GB– Linux/UNIX/MAC system configurable: 3-4 GB– Windows XP with /3gb boot.ini switch: 3 GB
64-bit platforms– Linux: 8TB (not all 64 bits used)
23
Maximizing The MATLAB Process’s Virtual Memory Choose OS with largest process memory (in order):
– 64-bit Linux, (future Win64)– 32-bit UNIX/Linux/MAC– Windows XP with /3gb– Window 2000, Windows XP (by default)
24
Checking The Virtual Memory Limit
>>system_dependent memstats UNDOCUME
NTED
25
Increasing Process Limit on XP to 3G: 3GB Switch
Right-click Properties> Advanced > Startup and Recovery > Edit
Make copy of [Operating system line], change comment and add /3gb
Reboot, select new OS option, check memstats
26
27
28
29
30
31
33
34
35
36
37
38
The MATLAB Process’s Virtual Memory Limit with 3GB Switch>>system_dependent memstats UNDO
CUMENTED
39
Workspace Size and Largest Block
DataDatarand(1e8,1)rand(1e8,1)
New Data New Data RequirementsRequirements
ContiguousContiguousFree BlockFree Block}}
Other FragmentsOther Fragments
MATLABMATLABWorkspaceWorkspace
Other ML VariablesOther ML Variables
WorkspaceWorkspace
ML ML footprint, footprint, Win DLLsWin DLLs
}}
MATLABMATLABProcess virtual Process virtual
memorymemory
Limit OS dependantLimit OS dependante.g. 2GB in Win2ke.g. 2GB in Win2k
Other AppsOther Apps
MATLABMATLABProcessProcessVirtual Virtual memorymemory
All ApplicationsAll Applicationsmemorymemory
RequirementRequirement
}} RAMRAM
Page FilePage File
}}
Total SystemTotal SystemMemory Memory
40
Workspace Size and Largest Block
Workspace Size = 2/3GB minus:– System DLLs– Java – MATLAB.exe and DLLs
Largest block (for numerical arrays)– Fragmentation (Mainly on Windows)– Third party DLLs (e.g., Google Desktop, Fineprint)– Windows security updates
41
Finding Size of Workspace and Largest Block>>system_dependent memstats
Workspace size– 1.7 or 2.7 GB
Largest block– Goal 1.5 GB– Diagnose if less
Atlantis
UNDOCUME
NTED
42
Diagnosing Memory (Workspace) Fragmentation>>system_dependent dumpmem Shows MATLAB memory map and DLLs Shows where largest blocks starts Reveals causing fragmentation
UNDOCUME
NTED
43
Example: Print Driver and New System DLLs Third party DLLs (e.g. Fineprint) Windows Security Updates and service pack DLLs
– Use movedlls.exe– www.mathworks.com/support (Solution Number: 1-1HE4G5)
44
Maximizing Largest Contiguous Block
Uninstall third-party tools if issue Use XP SP2 and movedlls fix utility Other techniques
– Pack (save and load): Useful when using lots of variables after working for a while
– Restart MATLAB
45
Final Memory Available on XP
32-bit Windows XP default– 1.7 GB total, 1.5 GB contiguous
32-bit Windows XP with /3gb switch– 2.7 GB Total, 1.5 GB contiguous
Rough guide what is possible– Can process 100s MB arrays with simple operations– Can process 10s MB arrays with complex operations,
many operations, lines of codes
46
What’s the biggest real array in MATLAB under 64-bit Linux?
>>a=zeros(?,1);>>whos
a) 2 GBb) 16 GBc) 8 TB
47
64-bit Linux MATLAB
Workspace size limit: TB’s Single array number of elements size limit 2e9 elements (2^31-2), mxarray limitation
48
Summary: Maximizing Available Memory
Maximize total system memory and performance– Use non-zero page file and have lots RAM
Minimize the system memory requirements– Close other applications
Maximize MATLAB process’s virtual memory– Use best OS or configure
Maximize largest contiguous block– Diagnose fragmentation
49
Minimizing Required Memory in MATLAB
50
Minimizing Memory Requirements
Data access Data storage Processing Plotting
51
Application Problem Description
Semiconductor wafer thickness test data– Six months of production, millions of wafers
Problem– What percentage of the wafers manufactured last month
meet thickness specifications? Thickness data
– Large text waferdata.csv– Nine position plus other information– Try import– View contents waferdata_start.csv
52
Data Access
53
Take Only What You Need
Take only what you need for the calculation– Not usually problem with databases– Common problem with big flat files
Consider block processing– Independent blocks– State saved (e.g., filtering)
Tip: Clear variable first
54
For Text Files Use textscan
Take rows and columns you need For example: data = textscan(fid, format, N, ‘delimiter’,’,’)– Select columns with format string– Select rows with N
Returns cells for each data type in format string. You need to convert to doubles.
Only read in nine columns and one month (1e6 rows)
Exercise 5
55
Binary Files: Try memmapfile
Map a section of a binary file into memory Benefits
– Faster than fread and fwrite– Access files with MATLAB indexing operations
Other– Random access to sections– Can have multiple views– Take from MATLAB memory
Example– Simple homogenous file – Mixed data types and access as arrays
56
Memory Mapping Example
%% Default, map whole file as uint8m=memmapfile('waferdata_uint8.bin')m.Data(1:20);
%% Specify format and namem=memmapfile('waferdata_uint8.bin','format',{'uint8' [20 100] 'x'},'repeat',20*1000)
A=m.Data;
%% Change format on the flym.format={'uint8' [1 4] 'headerbits';... 'uint8' [4 9] 'middle';... 'uint8' [7 1] 'tail'};A=m.Data;
57
Data Storage
58
Use Smallest Data Type
Depends on intended actions Complicated Math (e.g., Linear Algebra)
– Doubles or singles, 8 or 4 bytes– For example: a=single(7)
Simple arithmetic and original data is integers– Integers,1-4 bytes, for example e.g. a=int8(7)– Can be faster than doubles– Try with waferdata (Exercise 6)
Sparse– Just non-zero values and index stored>>a = sparse(2e9,1,pi);
Exercise 6a cell execute
59
Use Smallest Data Type (cont.)
Categories, dates– Cell arrays of strings, 60 byte header
for each element– Use sparingly
Contiguousness– Numeric arrays must be contiguous– Cell arrays and structures do not
Comparison with C– For numerical processing, similar
choice of data types
60
Load and Store as uint8
Read in uint8– Note: when preallocating need to specify data types
Exercise 6
61
Processing
62
Memory for Processing
Calculate only the results you need MATLAB operators and functions need extra memory
storage– Passing data to functions by value and assignments– Copy on write
Makes MATLAB safer, easier to debug – More memory than in-place operations using pointers
63
Monitoring MATLAB Memory Usage
Process Explorer or Task Manager
MATLAB Monitoring Tool: www.MATLABcentral.com
MATLAB Central > File Exchange > Utilities > Development Environment > MATLAB Monitoring Tool
Example:– x=rand(10e6,1);– y=x; y(2)=1.5;
UNDOCUME
NTED
64
Monitoring MATLAB Memory Usage (cont.) Total memory usage: workspaces, graphics>>system_dependent(‘CheckMallocMemoryUsage’);ans = 6820440 Show results in bytes (not MBytes) Starts with a few MB To use it, set an environmental variable
– MATLAB_MEM_MGR debug– In DOS window, C:\ setx MATLAB_MEM_MGR debug
UNDOCUME
NTED
65
How much temporary memory does data=data+1 require, where data is a 1 MB array take in an M-File?
Run M-file containing:data=zeros(1e6,1,’int8’); data=data+1;
a) 0 MBb) 1 MBc) 2 MB
66
JIT Vs Interpreter for Large Data Set Handling Run loaddata Offset in the thicknesses data needs to be corrected
with data = data+5– At command line– In M-File
Run Loaddata, then exercise 7a
UNDOCUME
NTED
67
Minimizing Copies and Temporaries
Share data rather than pass to functions– Nested Functions – Global
Reduce size of array to scalars or smaller blocks– Memory copies are equal to size of array– Process with for loops, de-vectorize
Can be slower, so must trade off speed for less memory
68
What is the fastest way to process MATLAB matrices with for loops?
a) Down the columnsb) Along the rowsc) Doesn't matter
Exercise 7
UNDOCUME
NTED
69
Example: De-Vectorizing
1D 2D
Exercise 7
70
Find Percentage of Wafers Meeting Specification What percentage of the wafers meet specification
– Must be < Maximum thickness of 200– Must be > Minimum thickness of 70
exercise 8a, 8c
71
Plotting
72
How much extra memory do you need to plot a 10 MB double array?
>>x=rand(125e4,1);>>plot(x);
a) 10 MBb) 20 MBc) 40 MB
Exercise 9a, process explorer on MATLAB task
73
Memory Copies When Plotting
Need extra memory to plot– For example: >>plot(data(:,1));
plot(x,y)– Copy of x, y to xdata and ydata properties of line
plot(y)– Copy of y in ydata and indices in xdata, 1:size(y)
Results– Memory requirements triple at least (more
temporarily)– Integers more (stored as doubles)
74
Minimize Copies: Plot Only What You Need Limited resolution of screen/human eye Sub-select, down-sample
– Plot every Nth element– Plot min/max in each block
75
Minimize Memory Requirements: Review
Accessing data– Take only what you need– Try processing in blocks
Data Storage– Store in smallest data type
Processing– Reduce temporaries with loops, blocking, nested, globals
Plotting– Plot what you need
76
Summary: Strategies For Handling Large Data Sets Ensure available memory > required memory 64-bit and distributed computing 32-bit single CPU
– Maximize available memory– Minimize required memory
77
Appendix
78
Appendix Contents
Resources 1 MB is Not Equal to Million Bytes Setting the Paging File Size File Size vs. Data Size Minimizing Other Applications’ Memory
79
Resources
Technical note 1106– www.mathworks.com/support, enter 1106
“Large Data Set Handling in MATLAB 7” Digest Article November 2004: www.mathworks.com/company/newsletters/digest/nov04/newfeatures.html
80
1 MB is Not Equal to Million Bytes
1 KB= 2^10 bytes = 1024 bytes– Not 1e3 bytes
1 MB=2^20 bytes = 1048576 bytes– Not 1e6 bytes
1 GB=2^30 bytes = 1073741824 bytes– Not 1e9 bytes
81
Setting the Paging File Size
My Computer > Properties > Advanced > Performance > Advanced > Virtual memory, Change then press “Set”
Set to System Managed Size or Custom
Slow down with page file, swapping
82
File Size vs. Data Size
Binary files: File size = data size– Assuming you use same data type as in the binary files
Text files: File size ~= data size, for example:– Data element: double= 8 bytes– Text file element, format 1: 3.4, 4 chars = 4 bytes– Text file element, format 2: 3.47896e+001, 13 chars= 13
bytes
83
Minimizing Other Applications’ Memory
Shut them down Restart Computer, if can’t
get < 300 MB Use windows msconfig
to avoid starting up applications
Not too important if you have a page file