TABLE OF CONTENTS
LIST OF FIGURES……………………………………………………………………….iiACKNOWLEDGEMENTS...............................................................................................iii
1. INTRODUCTION...........................................................................................................1
1.1 Introduction to microarray:........................................................................................11.2 Image Analysis:.........................................................................................................51.3 Data Analysis:............................................................................................................6
2. TECHNOLOGY:.............................................................................................................72.1 Excel Database:.........................................................................................................72.2 Microsoft .NET:.........................................................................................................8
2.2.1) Why .NET?........................................................................................................82.2.2) VS.NET for Office Solutions:...........................................................................82.2.3) Primary Interop Assemblies (PIA)....................................................................92.2.4) Coding with .NET Libraries:.............................................................................9
3. IMPLEMENTATION OVERVIEW:............................................................................103.1 Application Design Overview:................................................................................103.2 Algorithm Outline....................................................................................................113.3 Data description.......................................................................................................123.4 User Interface:...................................................................................................13
1) Parent Form:......................................................................................................132) Detection Calls Option:.....................................................................................133) T-test of Signal Log Ratios Option:...................................................................144) Set the criteria for Fold Change Option:............................................................14
3.5 Why DLLs:........................................................................................................143.6 Class Diagram of the methods:..........................................................................163.7 Description of methods used in the dlls:............................................................16
3.7.1 VedmetDLL.dll – SetWorksheet ():..........................................................173.8 Description of methods used in the Options form:............................................21
3.8.1 Class: Form 1.............................................................................................213.9 Features:.............................................................................................................223.10 Screen Shots.......................................................................................................23
4. TESTING:......................................................................................................................28
5. ENHANCEMENTS:......................................................................................................32
6. REFLECTIONS:............................................................................................................32
7. CONCLUSION:............................................................................................................33
8. REFERENCES:.............................................................................................................34
9. APPENDIX....................................................................................................................35
a) Transferring the GCOS CAB files into Excel:..............................................................35
i
LIST OF FIGURES
Figure 1: Gene Chip Array………………………………………………………………2
Figure 2: Hybridization of tagged and untagged probes………………………………..3
Figure 3: Gene Chip Technology………………………………………………………..4
Figure 4: Scanning of tagged and untagged probes…………………………………….6
Figure 5: Excel Object Model…………………………………………………………..9
Figure 6: Application Design Overview……………………………………………….11
Figure 7: Algorithm Overview…………………………………………………………12
Figure 8: Class Diagram………………………………………………………………..16
ii
ACKNOWLEDGEMENTS
I sincerely thank Dr. Dan Andresen, my major professor, for his invaluable
guidance and advice throughout the whole project. I also thank him for being flexible and
adjusting during the course of the project.
I am grateful to Dr. Daniel Marcus who has helped me to understand the
background of this project. I thank him for his valuable suggestions and being supportive.
I thank Dr. Mitch Neilsen for serving in my committee and agreeing to review my
report.
I am indebted to my parents for their love and encouragement. Last but not least I
thank all my friends including Palani and my brother Dinesh for their support.
iii
1
1. INTRODUCTIONThe analysis of Micro array data has been a time consuming task which involves
implementing different algorithms on the genome databases. This was done earlier using
a collection of different software packages that suited the varied purposes. As a step
toward automating the procedure, a Windows application was developed which has a
biologist friendly user-interface interacting with the genome databases (gene lists)
obtained from the Gene Chip Operating Software (GCOS) 1.2. GCOS is a software
package which is used specifically for acquiring and analyzing gene array data from the
Affymetrix Gene Chip platform.
1.1 Introduction to microarray:The fundamental basis of DNA microarrays is the process of hybridization. Two
DNA strands hybridize (stick together) if they are complementary to each other.
Complementarity reflects the Watson-Crick rule that adenine (A) binds to thymine (T)
and Cytosine (C) binds to guanine (G). Hybridization has for decades been used in
molecular biology as the basis for such techniques as Southern blotting and Northern
blotting. Where before it was possible to run a couple of Northern blots or a couple of
Southern blots in a day to identify a few expressed genes, it is now possible with DNA
arrays to run hybridizations to test for expression of tens of thousands of genes. This has
in some sense revolutionized molecular biology and medicine. Instead of studying one
gene and one messenger at a time, experimentalists are now studying many genes and
many messages at the same time. In fact, DNA arrays are often used to study all known
messages for genes of an organism. This has opened the possibility of an entirely new,
systematic view of how cells react in response to certain stimuli. It is also an entirely new
way to study human disease by viewing how it affects the expression of all genes inside
the cell.
The Technology behind DNA Microarrays:
A microarray is a solid support on which DNA of known sequence is deposited
in a regular grid-like array. The DNA may take the form of cDNA or oligonucleotides,
although other materials may be deposited as well. Typically, several nanograms (per
chip) of DNA are immobilized on the surface of an array.
1
Figure 1: GeneChip Array
Image courtesy of Affymetrix -
http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
RNA is extracted from biological sources of interest, such as cell lines with or without
drug treatment, tissues from wild-type or mutant organisms, or samples studied across a
time course. The RNA (or mRNA) is often converted to cDNA, labeled with fluorescence
or radioactivity, and hybridized to the array.
2
Figure 2: Hybridization of tagged probes
Image courtesy of Affymetrix -
http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
During this hybridization, cDNAs derived from RNA molecules in the biological starting
material can hybridize selectively to their corresponding nucleic acids on the microarray
surface. Following washing of the microarray, image analysis and data analysis are
performed to quantitate the signals that are detected. Through this process, microarray
technology allows the simultaneous measurement of the expression levels of thousands of
genes represented on the array.
3
Figure 3: GeneChip Technology:
Image courtesy of Affymetrix -
http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
Advantages of Microarrays:
1. Its fast; one can obtain data on the expression levels of over 50000 genes within
one week.
2. The entire genome can be represented on a chip and thus it is comprehensive.
3. It is flexible because cDNAs or oligonucleotides corresponding to any gene can
be represented on a chip.
Disadvantages of Microarrays:
1. Many researchers find it prohibitively expensive to perform sufficient replicates
and other controls.
2. There are many artifacts associated with image analysis and data analysis.
Researchers are still figuring out how to get the “best” answers from microarray
experiments.
3. It is just not enough to do microarrays; usually the microarray results have to be
validated using some technique like RTPCR.
4. There is NO standard way to analyze microarray data.
4
5. It is best to combine knowledge of biology, statistics and computers to get
answers and hence the learning curve is high.
Applications of Microarray:
1. Studying the effects of drug treatment
2. Gene knock out effects
3. Gene cloning
4. Cancer research
5. Developmental biology (like stem cell populations)
1.2 Image Analysis:
When the microarray chip is illuminated by a laser beam, the RNA that has been
hybridized fluoresces, producing brightness proportional to the amount of hybridized
RNA. This image is captured by a camera and it is then processed by a computer to get
the expression levels of all the genes.
5
Figure 4: Scanning of tagged and untagged probes
Image courtesy of Affymetrix -
http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
Background subtraction:
The first step of data analysis is to correct for background across the entire array.
The array is divided into equally spaced zones and is assigned an average background to
the center of each zone. The calculated background computed from each cell establishes
an intensity floor that is subtracted from all intensity values.
1.3 Data Analysis:The data analysis starts with the normalization and/or scaling of the data which
is a built-in step when the micro array data image is acquired by the Gene Chip Operating
software GCOS 1.2. (The normalization and scaling can be performed by either data
acquisition software or data analysis software). The data correlation between the samples
6
can be viewed by performing a lot of methods like Cluster analysis, Condition trees,
Principal component analysis etc. Once the data is acquired, generally it is exported to an
Excel file where analysis is done. The post-analysis steps include Gene Ontology (i.e.,
classifying the genes into different functional groups) and inputting the genes into
pathways from different databases (which can be performed in different software).
There are a number of commercially-available software packages that might
include the analysis steps that are implemented in this Windows application. As such,
there is no perfect method for data analysis and it is up to the investigators to decide
which of the steps or algorithms to follow for their data analysis. This application is
highly customized for use in the Cellular Biophysics laboratory of the Department of
Anatomy and Physiology, College of Veterinary Medicine, Kansas State University.
2. TECHNOLOGY:The front end has been implemented in Visual C#.net as a Windows application.
For the back-end, the spreadsheet that is obtained from GCOS is Microsoft Excel and the
output data sheet that biologists expect out of a tool is also an Excel spreadsheet.
2.1 Excel Database:Excel spreadsheets is a straightforward solution for storing tabular data, and recent
versions of the ubiquitous Microsoft Excel include some surprisingly sophisticated data
access and manipulation functions.
Issues:There are a number of reasons why Excel is not to be preferred for data management
and/or statistical analysis; some of the simple reasons being
a) There is no way to record what you have done
b) Poor statistical routines-it is impossible to view the source code that implements the
statistical routines; several Excel procedures are misleading.
c) Routines for handling missing data were incorrect in prior versions of Excel 2000. [In
reference to pre-2000,"Excel does not calculate the paired t-test correctly when some
observations have one of the measurements but not the other." E. Goldwater, ref. (1)].
Nonetheless, it is a manual process to copy the formulas to all the cells and sometimes it
is dangerous to sort the columns and the datasheets too.
7
However, the conventional method of data analysis for huge genomes used by
biologists is to use Microsoft Excel. Though the calculation of the cells in Excel for the
complete genome, sorting & filtering the data, and copying different results to different
worksheets for subsequent manipulation is a menial task, it is still considered as an easy
solution. So the solution I have proposed is to develop a Windows application tool that
will aid the analysis task by interfacing and programming in Excel worksheet.
2.2 Microsoft .NET:2.2.1) Why .NET?
The reasons for choosing .NET include the following:
1. .NET provides the ability to create rich clients that execute within the Common
Language Runtime. These applications utilize a new Windows forms processing
engine, called Windows Forms. Any .NET language can use Windows Forms to
build Windows applications. These applications have access to the complete .NET
Framework of namespaces and objects, and have all of the advantages which the
Framework can offer.
2. It is object-oriented and has many programming tools that allow for faster development and more functionality.
3. All applications in .NET are "garbage-collected", which means that objects are
destroyed automatically when they are no longer in use.
2.2.2) VS.NET for Office Solutions:
The key benefits of choosing VS.NET as development environment for Office Solutions
are,
Power of writing managed .NET code that executes behind Word and Excel
documents
Developers get the full, robust advantages of the Visual Studio .NET environment
Allows developers to create applications with a more robust security model,
restricting code that can execute only on a fully trusted corporate server.
8
Code-behind .NET projects can be started in .NET with new Office documents,
applied to existing Excel spreadsheets or Word documents and templates, and
even co-exist with current VBA-based logic.
Using VS.NET facilitates language freedom, easier debugging, better memory
management, and a more robust security model.
VBA programming model still exists and the .NET Office tools are just another
choice.
2.2.3) Primary Interop Assemblies (PIA)
Microsoft provides official wrapper assemblies for writing managed
code against the programmable unmanaged Microsoft Office libraries. These are called
the primary interop assemblies.
The Office PIAs are installed due to the following reasons:
Develop managed Office applications within the robust Visual Studio
development environment
oCreate Excel and Word solutions directly from within Visual Studio
Develop applications more quickly with less code
Utilize Visual Studio’s vast array of tools, easy access to web services, and
access to the .NET Framework
Leverage existing Visual Studio/VB/C# experience
2.2.4) Coding with .NET Libraries:Here is a quick walkthrough about Excel Object Model; the complete Excel Object model
is little complicated, there are only few objects required for our Windows application,
Figure 5 – Excel Object Model
9
Application object is the controller object of all other subsystems in the Excel
Application.
Each application can have multiple Workbooks; there will be one default
workbook for each application, the default workbook is returned to the
‘ThisWorkbook’ object variable.
Each Workbook will have 3 Worksheets (actual data presentation area).
Developer can present data in any one of these sheets or they can create their own
additional worksheets.
3. IMPLEMENTATION OVERVIEW:
3.1 Application Design Overview: The application has references to the Microsoft Excel 11.0 Object
Library and to the dlls that contains the analysis modules and the user interface for those
analysis modules. All the references and their associated items can be managed by
Solution Explorer which is provided as a part of the Integrated Development
Environment IDE. The solution is a container for the projects and solution items that can
be built into the application. The Windows application interacts with the Excel database
through the solution level that has all the build configurations.
Figure 5 – Application Design Overview
10
3.2 Algorithm OutlineThe working of this application is given in the architecture diagram
given below. The user can import the Excel file (See Appendix (a) for transferring a CAB
file into Excel) containing the gene list to be processed and can build the analysis strategy
by selecting the options given in the first form. All the analysis modules and their user
interface are placed in different dynamic link libraries and the dlls are added as reference
to the Windows application. Depending on the steps selected by the user, appropriate
forms are opened by calling the dlls wherein the user can set the criteria for that step. The
actions in those forms call the corresponding functions in the dlls. These modules in the
dlls are interfaced with the Excel spreadsheet and hence the analysis module
corresponding to the first option selected by the user filters the data set and returns an
array list containing the row numbers of the genes that match the criteria. This array list
11
is passed to the next analysis module corresponding to the second option selected by the
user and thus subsequent filtering is performed. Finally the dataset which has been
filtered by all the analysis steps is exported as an Excel worksheet. It can also be
previewed within the “Preview Result” textbox of the form.
Figure 6 – Algorithm Overview
3.3 Data descriptionThe Excel spreadsheet (See Appendix (a) for transferring a GCOS CAB file into
Excel) consists of the following data:
a) Affymetrix id
b) Detection (for every single and comparison analysis):
The values of detection can be
1. “P”, (denoting the Present Call)
2. “M”, (denoting the Marginal Call)
3. “A” (denoting the Absent Call)
c) Signal Log Ratio (for every comparison analysis)
d) Change (for every comparison analysis)
12
e) Detection p-value (for every single and comparison analysis)
f) Description
g) Different types of annotations like common name, GenBank Accession number,
product, function, GO biological process, GO cellular component, GO molecular
function etc.
All of these fields might vary depending on the options selected by the user.
3.4 User Interface:1) Parent Form:
The user can import the file for analysis by clicking the button “Import the
samples file for analysis”. The options available in the tool are listed in the left panel and
the user can select the options by clicking on the items in the ListBox one at a time and
by clicking on the button “Perform this Filter”.
The corresponding panels are opened as per the options selected by the user and
after all the analyses steps are performed, the results can be exported into a separate
Excel sheet in this form. This is done by clicking on the button “Export the result”,
thereby a savefiledialogbox opens and user can save the file at the desired location. Also,
there is a button “Result Preview” on clicking which the user can preview the genes that
are filtered by the analysis modules.
2) Detection Calls Option:This is one of the options listed in the listbox of the parent form. The user
interface for handling this option is located in the dll DetectionCalls.dll. On clicking this
option and on pressing the button “Perform this Filter”, a panel is displayed on the right.
This panel is for setting the criteria for detection calls. The user must type
in the number of samples in each group in the textbox next to the label “Input the number
of Samples in each group”. (It has been assumed that there are two groups and that the
number of samples in each group is the same number). The next two textboxes are for
specifying the criteria for the number of present calls and the number of marginal calls
which are next to the labels “Input the number of Present calls” and “Input the number of
Marginal Calls”. The user input in these two textboxes has been validated so that their
13
total does not exceed the number of samples inputted by the user in the textbox1. On
clicking the “Perform test” button, the application returns the number of genes that satisfy
the criteria set by the users and this number is displayed in the label at the end of this
form.
3) T-test of Signal Log Ratios Option:This is other option listed in the left panel. The user interface for handling this
option is present in the dll Ttest.dll. The panel has controls to set values for performing t-
test on the signal log ratio values. The textbox textBox1 gets the input from the user for
setting the criteria for the p-value for obtaining significant genes. Since the p-value
should always be 0.05 (as per the requirement of the users), it has been made as Read
only. On clicking the button “Perform t-test”, the application displays the number of
significant genes that pass the criteria set by the user.
4) Set the criteria for Fold Change Option:This is also one of the options in the listbox. The user interface for handling this
option is created by the dll FoldChange.dll. The purpose of this option is to calculate the
fold change of the significant genes (if the significance has been evaluated prior to this)
or just the fold change of all the genes and to set the criteria for the up-regulated and the
down-regulated genes. Here, Fold Change is calculated from the median of the signal log
ratios. The textboxes textBox1 and textBox2 are for setting the cut-off fold change for
determining the up and down regulated genes respectively. The two textboxes have been
made as Read only and their values are set as 2.0 and -2.0 respectively. The number of
up-regulated and the number of down-regulated genes are displayed on clicking the
“Submit” button.
3.5 Why DLLs:1. The Dynamic linked library shares the memory. So, the system performance is
improving compared to using applications.
2. We can build and test separately each DLL.
3. We can load and unload at run time. This helps to improve application
performance.
14
4. The big software products were divided into several DLLs. The developers easily
develop their application.
5. Eases the creation of international versions.
A potential disadvantage to using DLLs is that the application is not self-contained; it
depends on the existence of a separate DLL module.
Even though DLLs and applications are both executable program modules, they differ in several ways. To the end-user, the most obvious difference is that DLLs are not programs that can be directly executed. From the system's point of view, there are two fundamental differences between applications and DLLs:
An application can have multiple instances of itself running in the system simultaneously, whereas a DLL can have only one instance.
An application can own things such as a stack, global memory, file handles, and a message queue, but a DLL cannot.
Loading DLLs in run-time:
The .NET Framework allows an assembly to inspect and manipulate itself at runtime through System.Runtime.Reflection namespace and associated classes.
A sample of the code that handles the dlls at run time is given below:
Assembly assembly = Assembly.LoadFrom(@"MyAssembly.dll");
foreach( Type type in assembly.GetExportedTypes())
{
ConstructorInfo constructor = type.GetConstructor( new Type[] {} );
object newObject = constructor.Invoke( new object[] {} );
}
This way the newObject is created from that dll.
In this application, a button with the tick mark located at the top-right corner
handles this event. The filenames of the dlls that are added at run time are appended to
15
the list box in the left panel. Also, the dlls with the same name cannot be added; a
message box pops up indicating that the dll already exists.
3.6 Class Diagram of the methods:
Figure 7 – Class Diagram
3.7 Description of methods used in the dlls:All the modules that are used in the analysis and their corresponding user
interface are placed in the dlls - VedmetDLL.dll, DetectionCalls.dll, Ttest.dll and
FoldChange.dll; they are invoked whenever it is necessary in the application. These
dynamic link libraries are added as references in the Windows application. The dlls and
their functions are described here:
16
3.7.1 VedmetDLL.dll – SetWorksheet ():This method creates an Excel application and opens the Excel file
specified by the user. This also sets the Excel workbooks, Excel sheets and
the Excel worksheet of the Excel object.
3.7.2 DetectionCalls.dll – Test_1():This method finds the number of present calls and marginal calls for
each sample group for each gene and compares it to the input provided by
the user; the input is passed as parameters to this function. This is
implemented by creating an Excel range object for the first sample group
for the columns containing the detection calls. The find() method of the
range object is used for finding the cells containing the detection values
“P”/”M” and counters evaluate the count of the present and marginal calls.
If the values of the counters are greater than the values entered by the user
then the flag of that row is set to true and the row number is added to an
arraylist. The genes for which the flag is not set to true, another range
object is created for the second group and the above procedure is repeated
again. If the selected row number is not already in the arraylist, it is added
to it. At the end, the arraylist will have all the rows that satisfy the criteria
in either of the sample groups and thus the count of this arraylist will have
all the genes that are filtered through this test and it will be displayed in
the form.
3.7.3 DetectionCalls.dll – CreatePanel2():This method creates the user interface for this option. This method
returns a panel object to the options form for display. It creates a panel
with 3 textboxes, 5 labels and a button. The first textbox is for getting the
number of samples from the user. The next two text boxes are for
specifying the call criteria for selection. On clicking the button “Perform
Test”, the results i.e. the number of genes that meet the criteria will be
displayed in a label at the bottom.
17
3.7.4 DetectionCalls.dll – button_PerformTest_Click():This method checks for the validity of the user input; if the total of the
number of present calls and the number of marginal calls is greater than
the number of samples inputted in the first textbox, it is considered as an
invalid entry. Then a message box is displayed indicating that it is an
invalid entry.
This method invokes the test_1 () method of the DetectionCalls.dll
class. An arraylist is returned from the method test_1 () and the count of
that arraylist gives the number of genes that have satisfied this criteria.
The count is appended to the label box “Number of genes that meet the
criteria:”
3.7.5 DetectionCalls.dll – textBox1_TextChanged():This method handles the event when the text in the textBox1 is
changed or entered by the user. The value entered by the user in the
textBox1 is appended next to the textboxes for entering the Present and the
Marginal Calls criteria.
3.7.6 Ttest.dll – test_2():This function performs t-test on the whole set of genes or the genes
that are already processed by the Detection call filter, depending on the
options set by the user. The first argument of this function is an arraylist
which has all the row numbers of the genes for which t-test is to be done.
If the t-test is the first option selected by the user then the arraylist will
have the row numbers of all the genes. The second argument to this
function is the user’s input for p-value and the third argument is a flag
value denoting whether any other analysis module has been performed
previously. The t-test is implemented as follows. An arraylist keeps track
of all the column headers containing the string “Ratio” which is a
substring of “Signal Log Ratio”.
The Application object includes a property called the
WorksheetFunction which returns an instance of WorksheetFunction class.
The class provides a number of useful mathematical functions like mean,
18
variance etc which allow performing calculations on ranges and hence are
used in this method.
So the p-value of the t-test is derived from the mean and the standard
deviation of the signal log ratio values. If the calculated p-value is less
than the p-value set by the user in the form, then that row number is added
to an arraylist and is returned back to the form.
3.7.7 Ttest.dll – CreatePanel3():This method creates the user interface for this particular analysis
option. This method returns a panel object to the options form for display.
This method creates a panel with two labels, a textbox and a button. The
textbox is for setting the criteria for p-value. The result obtained after the
button has been handled will be appended to the label at the bottom of the
panel.
3.7.8 Ttest.dll - Button3_Click_2():This method invokes the test_2 () method of the dll. The third
argument (i.e., the flag) to the function test_2 () will vary depending upon
whether the analysis method in form 2 i.e., the detection call filter has
been evaluated or not. The test_2 () function returns an arraylist and the
count of the arraylist is appended to the label box “Number of Significant
Genes”.
3.7.9 FoldChange.dll – test_3():This function calculates fold change on the significant genes that are
evaluated by t-test and it compares the calculated fold change with the
input from the user to determine the up-regulated and the down-regulated
genes. The first argument of this function is an arraylist that has the list of
all the row numbers of the genes that are significant, that is those genes
that are filtered by t-test. The second and third arguments are the user’
input in the textboxes of the form that set the cut-off fold change for
finding the up and down regulated genes. Here again like test_2 (), an
arraylist is created for finding out the columns (from their headers) that
contain the signal log ratio values. For all the rows the median of the
19
signal log ratio is calculated by a separate function Median () that sorts all
the signal log ratios and finds the median of the sorted list. The fold
change is evaluated from the median in this function test_3 (). If the fold
change is greater than or equal to the cutoff value for up-regulated genes
specified by the user, then the row numbers are added to an arraylist.
Similarly if the fold change is lesser than or equal to the cutoff value for
down-regulated genes specified by the user, the row numbers
corresponding to those genes are added to another arraylist. Both of these
arraylists containing the up and down regulated genes are returned back to
the form.
3.7.10 FoldChange.dll – CreatePanel4():This method creates the user interface required for this particular
analysis option. It creates a panel with 7 labels, 2 textboxes and a button.
The two textboxes are for setting the cutoff fold-change for the up and
down regulated genes. This method returns the panel object containing all
the other created controls to the options form.
3.7.11 FoldChange.dll – Median():The fold change is calculated from the median of the signal log ratios
and so this function sorts all the signal log ratios and finds the median of
the sorted list.
3.7.12 FoldChange.dll – button1_Click():This method is used to submit the information entered by the user in
the textboxes of the form i.e., the cutoff values for the up and down
regulated genes. This method calls the text_3 () method of the dll with the
values of the arguments based on whether the analysis options are selected
or not. An arraylist containing the up and down regulated genes is returned
and the count of those genes is appended to the label boxes “Up-regulated
genes” and “Down-regulated genes”.
20
3.8 Description of methods used in the Options form:3.8.1 Class: Form 1Methods:
i. buttonImport_Click (): This method is used to import the Excel file
containing the genelist to be processed. An openfiledialogbox is
created and the method SetWorkSheet in the VedmetDLL dynamic
link library is called to set the Excel application and the workbook.
ii. button_PerformFilter_Click (): This button is clicked after selecting the
analysis option in the panel. This method calls another method
DoFiltertest () in the same class.
iii. DoFiltertest (): Depending on the option selected by the user, this method
calls the other method StartAppropriateForm ().
iv. StartAppropriateForm (): This method has the switch-case statements for
creating the user interface corresponding to those options by
calling the appropriate dlls for each analysis step. Depending on
the users choice at run time the other panels are made invisible. It
also adds the created panel to the set of controls of the form.
v. button_Export_Click (): This method gets the results of the analysis
modules from all the forms and exports it to Excel sheets. If the
analysis strategy excludes the last step i.e., fold change calculator,
then there is only one Excel sheet that needs to be exported
whereas if the fold change calculator is also included in the
analysis strategy, the result set has two arraylists namely the up
and down regulated genes and hence they need to be exported to
multiple Excel worksheets of the Excel application. These two
types of exporting is performed by ExportArrayListToExcel () and
ExportArrayListToExcelSheets ().
vi. ExportArrayListToExcel (): This method is used for exporting the genes
that have satisfied the criteria set by the user to an Excel file. A
savefiledialogbox is opened and the user can save the file in the
desired location.
21
vii. ExportArrayListToTextBox (): This method is used for giving the user a
preview of the results that have been obtained. The resultant genes
are displayed in a textbox at the bottom of the form.
3.9 Features:Some of the features of this tool are:
a) Dynamic selection of the analysis method by the user:
The analyses modules are placed in dlls and so depending on the
options selected by the user for analysis, the corresponding modules are
dynamically invoked. This feature gives the user the flexibility to
determine the analysis strategy.
b) Dynamic loading of dlls:
At run time, the user can load dlls dynamically. This feature allows
the user to load as many dlls needed that can handle the events and perform
analysis modules. This is an added feature to this application that makes it
scalable.
c) Extensible:
If any other analysis module is to be added to this tool in future, it
can be added as dll since the use of dlls in this application for invoking the
user interface and the analysis modules has been made modular. This also
makes this Windows application extensible for future enhancements.
d) Minimal user input:
Most of the textboxes used in this application have been
customized to be read only. This avoids the user to enter the same values in
the text boxes every time an analysis is performed.
e) Implemented navigation – order of tabs:
The order of tabs has been configured so that the users can rapidly
interact with the user interface of the application.
22
3.10 Screen ShotsForm1:
This form displays the options that can be selected for the analysis.
23
On clicking the “Import the Samples file for analysis”, the open file dialog box pops up
and the user can select the file to be processed.
24
The user can select from the options and the corresponding user interface pops up.
On clicking the button “Perform the selected filter”, the Detection Call Summarizer panel
appears and the user can set the criteria for selection.
25
The t-test panel appears on selecting the option in the left panel and on clicking the
button “Perform the selected filter”.
The number of genes that meet the criteria is displayed.
26
The fold change calculator panel appears when the corresponding option is selected in the
left panel.
27
4. TESTING:
i. Validated user input:
In the Detection Call Summarizer panel, the user’s input in the text boxes
(wherein the call criteria is specified) is checked to see if the total of the present call
value and the marginal call value given by the user is less than or equal to the user’s
input in the first textbox for the number of samples in each group.
ii. Difference in performance between a dlls’ function and a stand alone module
which are functionally same:
The performance of an analysis module in a dll has been compared with
the performance of a stand alone function. This test is to evaluate whether the
procedure calls to the dll has slowed down the process of analysis or not. The
performance testing is done in ANTS Profiler (Advanced .Net Testing System) which
records the performance and the time spent by different modules of the source code.
Here, for this testing, the test_1 () module of the DetectionCalls.dll has been
compared with the same function but handled by the form.
The results from the ANTS Profiler when analyzing samples of different
sizes say 500 genes, 5000 genes and 30000 genes are given below:
28
Here is a snapshot of the recording from the ANTS profiler when analyzing a
sample with 500 genes.
It can be observed that the event handler button5_Click () in Form 1 has taken ~14
seconds whereas the DetectionCalls.test_1 () has taken ~12 seconds for evaluating the
same result. It can be observed that there is an increase in performance of about 2
seconds when it was evaluated by using dlls.
29
This is a snapshot of a recording from ANTS Profiler when the application was
analyzing a sample file containing 5000 genes.
It is seen that the button5_Click () has taken ~122 seconds to perform the analysis
whereas the DetectionCalls.test_1 () has taken ~116 seconds to evaluate the same
result. The results are consistent with the previous results and it can be seen that the
dlls have taken lesser time to evaluate than the stand alone module.
30
To determine if the testing results are consistent with bigger sample files,
a sample with 30000 genes has been taken. The results from the ANTS Profiler are
given below:
There is a difference of approximately 8 seconds between the two
functions for analyzing the sample file with 30000 genes, difference of 6 seconds for
analyzing the sample file with 5000 genes and a difference of 2 seconds for analyzing
sample file with 500 genes. The performance of the dll has been better compared to
that of the stand alone module in the application.
31
5. ENHANCEMENTS:This application can be extended for analyzing more than two groups and also
for analyzing the groups that have unequal number of samples. Also, many more analysis
modules like ANOVA test, clustering etc can be implemented in the dynamic link
libraries if necessary.
This project can be extended for handling the loading of malformed dlls,
duplicate dlls, deleted dlls, dlls with the same name but different version number etc. It
can also be extended to handle missing data, invalid data, and duplicate records in the
Excel worksheet. Further enhancements would include handling of different formats of
Excel like comma delimited, tab delimited, text files etc. A dynamic way of handling data
and presentation would be to implement the data formats in XML and XSLT.
This project has been done in a very broad sense. Its design allows for easy
enhancements in the future and adding other options would be easy. This system is
accessible to the biologists and scientists of minimal background.
6. REFLECTIONS:Implementing the user interface and the analysis modules as dynamic link
libraries and coordinating all the dlls in the application was a difficult task at the
beginning, but I have learnt a lot in due course of the project. Though it is not a common
and an easy solution to develop and implement dlls for a Windows application, I have
realized that it is always a good design practice to make an application more modular and
extensible.
Using Excel as a back-end would be a painful experience for programmers who
are used to programming in databases like SQL and Oracle. Unlike other databases, there
is no primary key to refer to records and there is no commit or rollback operation. The
cells, rows and columns are all referred as Excel Range objects. It was a whole new
different experience for me to develop such an office solution.
32
7. CONCLUSION:This application reduces the burden of analyzing the Excel files manually. Also,
it would reduce the load of maintaining multiple Excel tabs with the intermediate results
in the whole procedure.
This is a stand-alone application and the executable can be downloaded in as
many workstations as needed. It would cater to the needs of the individual users who
work on gene array data and would simplify the process. The users can import their gene
lists into this tool, set their own parameters to process them and export the analyzed gene
lists again.
The use of dlls improves the performance of the application and it makes the
application extensible.
33
8. REFERENCES:1) Eva Goldwater, (1999), Univ. of Massachusetts Office of Information Technology,
“Using Excel for statistical data analysis”
http://www.umass.edu/acco/statistics/handout/excel.html
2) Mohammed Ashraf, “Dll Profiler in C#”
http://www.codeproject.com/csharp/dll_profiler.asp
3) Mark Belles, “Hosting Control Panel Applets using C#/C++ - Finding and loading
unmanaged DLLs dynamically”
http://www.developerfusion.co.uk/show/4451/2/
4) Microsoft Corp, “Determining which linking method to use”
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/
_core_determine_which_linking_method_to_use.asp
5) Microsoft Corp, “AppDomains and Dynamic Loading”
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncscol/html/
csharp05162002.asp
6) “How do I dynamically load a control from a DLL?”
http://www.syncfusion.com/FAQ/WindowsForms/FAQ_c41c.aspx#q709q
7) “Building a DLL From Several Classes”
http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/
tutorial/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm
8) “Debugging Dll projects”
http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vsdebug/
html/433cab30-d191-460b-96f7-90d2530ca243.asp
34
9. APPENDIX Raw Affymetrix Gene Chip data are stored in files with the “CAB” extension. A suite of
programs from Affymetrix are used to extract and pre-process this data. The result is a
list of the genes tested in the experiment as well as the fluorescence signal strength for
each probe and associated information. These pre-processed data are placed in an Excel
file that is used as the input to the Program developed in the Report.
a) Transferring the GCOS CAB files into Excel:Affymetrix Data Transfer tool is one of the ways for transferring the CAB files
into GCOS. The screenshot of the welcome page is given below, in which one can select
the transfer option and click “Next”.
35
Browse to the location of the data file, specify the type of the data and click on “Next”.
36
Select the CAB files to be imported and click on “Start”.
After the importing is complete, click the “Finish” button. Depending on the number of
samples imported, experiments are created in GeneChip Operating System.
37
By clicking the analysis results of any comparison array analysis, one can view the
results.
38
The “Analysis Options” window gives the user various options that can be viewed in the
window.
39
One can save the results by clicking on the “Save As” button and one can select the
export type as Excel file.
This way, the user can export out all the necessary comparisons into Excel and the Excel
file can be inputted into the Windows application.
40