Download - IMPLEMENTATION OF MICROARRAY DATA …people.cis.ksu.edu/~nithya/Report_July28.doc · Web viewHybridization has for decades been used in molecular biology as the basis for such techniques

TABLE OF CONTENTS

LIST OF FIGURES……………………………………………………………………….iiACKNOWLEDGEMENTS...............................................................................................iii

1. INTRODUCTION...........................................................................................................1

1.1 Introduction to microarray:........................................................................................11.2 Image Analysis:.........................................................................................................51.3 Data Analysis:............................................................................................................6

2. TECHNOLOGY:.............................................................................................................72.1 Excel Database:.........................................................................................................72.2 Microsoft .NET:.........................................................................................................8

2.2.1) Why .NET?........................................................................................................82.2.2) VS.NET for Office Solutions:...........................................................................82.2.3) Primary Interop Assemblies (PIA)....................................................................92.2.4) Coding with .NET Libraries:.............................................................................9

3. IMPLEMENTATION OVERVIEW:............................................................................103.1 Application Design Overview:................................................................................103.2 Algorithm Outline....................................................................................................113.3 Data description.......................................................................................................123.4 User Interface:...................................................................................................13

1) Parent Form:......................................................................................................132) Detection Calls Option:.....................................................................................133) T-test of Signal Log Ratios Option:...................................................................144) Set the criteria for Fold Change Option:............................................................14

3.5 Why DLLs:........................................................................................................143.6 Class Diagram of the methods:..........................................................................163.7 Description of methods used in the dlls:............................................................16

3.7.1 VedmetDLL.dll – SetWorksheet ():..........................................................173.8 Description of methods used in the Options form:............................................21

3.8.1 Class: Form 1.............................................................................................213.9 Features:.............................................................................................................223.10 Screen Shots.......................................................................................................23

4. TESTING:......................................................................................................................28

5. ENHANCEMENTS:......................................................................................................32

6. REFLECTIONS:............................................................................................................32

7. CONCLUSION:............................................................................................................33

8. REFERENCES:.............................................................................................................34

9. APPENDIX....................................................................................................................35

a) Transferring the GCOS CAB files into Excel:..............................................................35

i

LIST OF FIGURES

Figure 1: Gene Chip Array………………………………………………………………2

Figure 2: Hybridization of tagged and untagged probes………………………………..3

Figure 3: Gene Chip Technology………………………………………………………..4

Figure 4: Scanning of tagged and untagged probes…………………………………….6

Figure 5: Excel Object Model…………………………………………………………..9

Figure 6: Application Design Overview……………………………………………….11

Figure 7: Algorithm Overview…………………………………………………………12

Figure 8: Class Diagram………………………………………………………………..16

ii

ACKNOWLEDGEMENTS

I sincerely thank Dr. Dan Andresen, my major professor, for his invaluable

guidance and advice throughout the whole project. I also thank him for being flexible and

adjusting during the course of the project.

I am grateful to Dr. Daniel Marcus who has helped me to understand the

background of this project. I thank him for his valuable suggestions and being supportive.

I thank Dr. Mitch Neilsen for serving in my committee and agreeing to review my

report.

I am indebted to my parents for their love and encouragement. Last but not least I

thank all my friends including Palani and my brother Dinesh for their support.

iii

1. INTRODUCTIONThe analysis of Micro array data has been a time consuming task which involves

implementing different algorithms on the genome databases. This was done earlier using

a collection of different software packages that suited the varied purposes. As a step

toward automating the procedure, a Windows application was developed which has a

biologist friendly user-interface interacting with the genome databases (gene lists)

obtained from the Gene Chip Operating Software (GCOS) 1.2. GCOS is a software

package which is used specifically for acquiring and analyzing gene array data from the

Affymetrix Gene Chip platform.

1.1 Introduction to microarray:The fundamental basis of DNA microarrays is the process of hybridization. Two

DNA strands hybridize (stick together) if they are complementary to each other.

Complementarity reflects the Watson-Crick rule that adenine (A) binds to thymine (T)

and Cytosine (C) binds to guanine (G). Hybridization has for decades been used in

molecular biology as the basis for such techniques as Southern blotting and Northern

blotting. Where before it was possible to run a couple of Northern blots or a couple of

Southern blots in a day to identify a few expressed genes, it is now possible with DNA

arrays to run hybridizations to test for expression of tens of thousands of genes. This has

in some sense revolutionized molecular biology and medicine. Instead of studying one

gene and one messenger at a time, experimentalists are now studying many genes and

many messages at the same time. In fact, DNA arrays are often used to study all known

messages for genes of an organism. This has opened the possibility of an entirely new,

systematic view of how cells react in response to certain stimuli. It is also an entirely new

way to study human disease by viewing how it affects the expression of all genes inside

the cell.

The Technology behind DNA Microarrays:

A microarray is a solid support on which DNA of known sequence is deposited

in a regular grid-like array. The DNA may take the form of cDNA or oligonucleotides,

although other materials may be deposited as well. Typically, several nanograms (per

chip) of DNA are immobilized on the surface of an array.

1

Figure 1: GeneChip Array

Image courtesy of Affymetrix -

http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx

RNA is extracted from biological sources of interest, such as cell lines with or without

drug treatment, tissues from wild-type or mutant organisms, or samples studied across a

time course. The RNA (or mRNA) is often converted to cDNA, labeled with fluorescence

or radioactivity, and hybridized to the array.

2

Figure 2: Hybridization of tagged probes



During this hybridization, cDNAs derived from RNA molecules in the biological starting

material can hybridize selectively to their corresponding nucleic acids on the microarray

surface. Following washing of the microarray, image analysis and data analysis are

performed to quantitate the signals that are detected. Through this process, microarray

technology allows the simultaneous measurement of the expression levels of thousands of

genes represented on the array.

3

Figure 3: GeneChip Technology:



Advantages of Microarrays:

1. Its fast; one can obtain data on the expression levels of over 50000 genes within

one week.

2. The entire genome can be represented on a chip and thus it is comprehensive.

3. It is flexible because cDNAs or oligonucleotides corresponding to any gene can

be represented on a chip.

Disadvantages of Microarrays:

1. Many researchers find it prohibitively expensive to perform sufficient replicates

and other controls.

2. There are many artifacts associated with image analysis and data analysis.

Researchers are still figuring out how to get the “best” answers from microarray

experiments.

3. It is just not enough to do microarrays; usually the microarray results have to be

validated using some technique like RTPCR.

4. There is NO standard way to analyze microarray data.

4

5. It is best to combine knowledge of biology, statistics and computers to get

answers and hence the learning curve is high.

Applications of Microarray:

1. Studying the effects of drug treatment

2. Gene knock out effects

3. Gene cloning

4. Cancer research

5. Developmental biology (like stem cell populations)

1.2 Image Analysis:

When the microarray chip is illuminated by a laser beam, the RNA that has been

hybridized fluoresces, producing brightness proportional to the amount of hybridized

RNA. This image is captured by a camera and it is then processed by a computer to get

the expression levels of all the genes.

5

Figure 4: Scanning of tagged and untagged probes



Background subtraction:

The first step of data analysis is to correct for background across the entire array.

The array is divided into equally spaced zones and is assigned an average background to

the center of each zone. The calculated background computed from each cell establishes

an intensity floor that is subtracted from all intensity values.

1.3 Data Analysis:The data analysis starts with the normalization and/or scaling of the data which

is a built-in step when the micro array data image is acquired by the Gene Chip Operating

software GCOS 1.2. (The normalization and scaling can be performed by either data

acquisition software or data analysis software). The data correlation between the samples

6

can be viewed by performing a lot of methods like Cluster analysis, Condition trees,

Principal component analysis etc. Once the data is acquired, generally it is exported to an

Excel file where analysis is done. The post-analysis steps include Gene Ontology (i.e.,

classifying the genes into different functional groups) and inputting the genes into

pathways from different databases (which can be performed in different software).

There are a number of commercially-available software packages that might

include the analysis steps that are implemented in this Windows application. As such,

there is no perfect method for data analysis and it is up to the investigators to decide

which of the steps or algorithms to follow for their data analysis. This application is

highly customized for use in the Cellular Biophysics laboratory of the Department of

Anatomy and Physiology, College of Veterinary Medicine, Kansas State University.

2. TECHNOLOGY:The front end has been implemented in Visual C#.net as a Windows application.

For the back-end, the spreadsheet that is obtained from GCOS is Microsoft Excel and the

output data sheet that biologists expect out of a tool is also an Excel spreadsheet.

2.1 Excel Database:Excel spreadsheets is a straightforward solution for storing tabular data, and recent

versions of the ubiquitous Microsoft Excel include some surprisingly sophisticated data

access and manipulation functions.

Issues:There are a number of reasons why Excel is not to be preferred for data management

and/or statistical analysis; some of the simple reasons being

a) There is no way to record what you have done

b) Poor statistical routines-it is impossible to view the source code that implements the

statistical routines; several Excel procedures are misleading.

c) Routines for handling missing data were incorrect in prior versions of Excel 2000. [In

reference to pre-2000,"Excel does not calculate the paired t-test correctly when some

observations have one of the measurements but not the other." E. Goldwater, ref. (1)].

Nonetheless, it is a manual process to copy the formulas to all the cells and sometimes it

is dangerous to sort the columns and the datasheets too.

7

However, the conventional method of data analysis for huge genomes used by

biologists is to use Microsoft Excel. Though the calculation of the cells in Excel for the

complete genome, sorting & filtering the data, and copying different results to different

worksheets for subsequent manipulation is a menial task, it is still considered as an easy

solution. So the solution I have proposed is to develop a Windows application tool that

will aid the analysis task by interfacing and programming in Excel worksheet.

2.2 Microsoft .NET:2.2.1) Why .NET?

The reasons for choosing .NET include the following:

1. .NET provides the ability to create rich clients that execute within the Common

Language Runtime. These applications utilize a new Windows forms processing

engine, called Windows Forms. Any .NET language can use Windows Forms to

build Windows applications. These applications have access to the complete .NET

Framework of namespaces and objects, and have all of the advantages which the

Framework can offer.

2. It is object-oriented and has many programming tools that allow for faster development and more functionality.

3. All applications in .NET are "garbage-collected", which means that objects are

destroyed automatically when they are no longer in use.

2.2.2) VS.NET for Office Solutions:

The key benefits of choosing VS.NET as development environment for Office Solutions

are,

Power of writing managed .NET code that executes behind Word and Excel

documents

Developers get the full, robust advantages of the Visual Studio .NET environment

Allows developers to create applications with a more robust security model,

restricting code that can execute only on a fully trusted corporate server.

8

Code-behind .NET projects can be started in .NET with new Office documents,

applied to existing Excel spreadsheets or Word documents and templates, and

even co-exist with current VBA-based logic.

Using VS.NET facilitates language freedom, easier debugging, better memory

management, and a more robust security model.

VBA programming model still exists and the .NET Office tools are just another

choice.

2.2.3) Primary Interop Assemblies (PIA)

Microsoft provides official wrapper assemblies for writing managed

code against the programmable unmanaged Microsoft Office libraries. These are called

the primary interop assemblies.

The Office PIAs are installed due to the following reasons:

Develop managed Office applications within the robust Visual Studio

development environment

oCreate Excel and Word solutions directly from within Visual Studio

Develop applications more quickly with less code

Utilize Visual Studio’s vast array of tools, easy access to web services, and

access to the .NET Framework

Leverage existing Visual Studio/VB/C# experience

2.2.4) Coding with .NET Libraries:Here is a quick walkthrough about Excel Object Model; the complete Excel Object model

is little complicated, there are only few objects required for our Windows application,

Figure 5 – Excel Object Model

9

Application object is the controller object of all other subsystems in the Excel

Application.

Each application can have multiple Workbooks; there will be one default

workbook for each application, the default workbook is returned to the

‘ThisWorkbook’ object variable.

Each Workbook will have 3 Worksheets (actual data presentation area).

Developer can present data in any one of these sheets or they can create their own

additional worksheets.

3. IMPLEMENTATION OVERVIEW:

3.1 Application Design Overview: The application has references to the Microsoft Excel 11.0 Object

Library and to the dlls that contains the analysis modules and the user interface for those

analysis modules. All the references and their associated items can be managed by

Solution Explorer which is provided as a part of the Integrated Development

Environment IDE. The solution is a container for the projects and solution items that can

be built into the application. The Windows application interacts with the Excel database

through the solution level that has all the build configurations.

Figure 5 – Application Design Overview

10

3.2 Algorithm OutlineThe working of this application is given in the architecture diagram

given below. The user can import the Excel file (See Appendix (a) for transferring a CAB

file into Excel) containing the gene list to be processed and can build the analysis strategy

by selecting the options given in the first form. All the analysis modules and their user

interface are placed in different dynamic link libraries and the dlls are added as reference

to the Windows application. Depending on the steps selected by the user, appropriate

forms are opened by calling the dlls wherein the user can set the criteria for that step. The

actions in those forms call the corresponding functions in the dlls. These modules in the

dlls are interfaced with the Excel spreadsheet and hence the analysis module

corresponding to the first option selected by the user filters the data set and returns an

array list containing the row numbers of the genes that match the criteria. This array list

11

is passed to the next analysis module corresponding to the second option selected by the

user and thus subsequent filtering is performed. Finally the dataset which has been

filtered by all the analysis steps is exported as an Excel worksheet. It can also be

previewed within the “Preview Result” textbox of the form.

Figure 6 – Algorithm Overview

3.3 Data descriptionThe Excel spreadsheet (See Appendix (a) for transferring a GCOS CAB file into

Excel) consists of the following data:

a) Affymetrix id

b) Detection (for every single and comparison analysis):

The values of detection can be

1. “P”, (denoting the Present Call)

2. “M”, (denoting the Marginal Call)

3. “A” (denoting the Absent Call)

c) Signal Log Ratio (for every comparison analysis)

d) Change (for every comparison analysis)

12

e) Detection p-value (for every single and comparison analysis)

f) Description

g) Different types of annotations like common name, GenBank Accession number,

product, function, GO biological process, GO cellular component, GO molecular

function etc.

All of these fields might vary depending on the options selected by the user.

3.4 User Interface:1) Parent Form:

The user can import the file for analysis by clicking the button “Import the

samples file for analysis”. The options available in the tool are listed in the left panel and

the user can select the options by clicking on the items in the ListBox one at a time and

by clicking on the button “Perform this Filter”.

The corresponding panels are opened as per the options selected by the user and

after all the analyses steps are performed, the results can be exported into a separate

Excel sheet in this form. This is done by clicking on the button “Export the result”,

thereby a savefiledialogbox opens and user can save the file at the desired location. Also,

there is a button “Result Preview” on clicking which the user can preview the genes that

are filtered by the analysis modules.

2) Detection Calls Option:This is one of the options listed in the listbox of the parent form. The user

interface for handling this option is located in the dll DetectionCalls.dll. On clicking this

option and on pressing the button “Perform this Filter”, a panel is displayed on the right.

This panel is for setting the criteria for detection calls. The user must type

in the number of samples in each group in the textbox next to the label “Input the number

of Samples in each group”. (It has been assumed that there are two groups and that the

number of samples in each group is the same number). The next two textboxes are for

specifying the criteria for the number of present calls and the number of marginal calls

which are next to the labels “Input the number of Present calls” and “Input the number of

Marginal Calls”. The user input in these two textboxes has been validated so that their

13

total does not exceed the number of samples inputted by the user in the textbox1. On

clicking the “Perform test” button, the application returns the number of genes that satisfy

the criteria set by the users and this number is displayed in the label at the end of this

form.

3) T-test of Signal Log Ratios Option:This is other option listed in the left panel. The user interface for handling this

option is present in the dll Ttest.dll. The panel has controls to set values for performing t-

test on the signal log ratio values. The textbox textBox1 gets the input from the user for

setting the criteria for the p-value for obtaining significant genes. Since the p-value

should always be 0.05 (as per the requirement of the users), it has been made as Read

only. On clicking the button “Perform t-test”, the application displays the number of

significant genes that pass the criteria set by the user.

4) Set the criteria for Fold Change Option:This is also one of the options in the listbox. The user interface for handling this

option is created by the dll FoldChange.dll. The purpose of this option is to calculate the

fold change of the significant genes (if the significance has been evaluated prior to this)

or just the fold change of all the genes and to set the criteria for the up-regulated and the

down-regulated genes. Here, Fold Change is calculated from the median of the signal log

ratios. The textboxes textBox1 and textBox2 are for setting the cut-off fold change for

determining the up and down regulated genes respectively. The two textboxes have been

made as Read only and their values are set as 2.0 and -2.0 respectively. The number of

up-regulated and the number of down-regulated genes are displayed on clicking the

“Submit” button.

3.5 Why DLLs:1. The Dynamic linked library shares the memory. So, the system performance is

improving compared to using applications.

2. We can build and test separately each DLL.

3. We can load and unload at run time. This helps to improve application

performance.

14

4. The big software products were divided into several DLLs. The developers easily

develop their application.

5. Eases the creation of international versions.

A potential disadvantage to using DLLs is that the application is not self-contained; it

depends on the existence of a separate DLL module.

Even though DLLs and applications are both executable program modules, they differ in several ways. To the end-user, the most obvious difference is that DLLs are not programs that can be directly executed. From the system's point of view, there are two fundamental differences between applications and DLLs:

An application can have multiple instances of itself running in the system simultaneously, whereas a DLL can have only one instance.

An application can own things such as a stack, global memory, file handles, and a message queue, but a DLL cannot.

Loading DLLs in run-time:

The .NET Framework allows an assembly to inspect and manipulate itself at runtime through System.Runtime.Reflection namespace and associated classes.

A sample of the code that handles the dlls at run time is given below:

Assembly assembly = Assembly.LoadFrom(@"MyAssembly.dll");

foreach( Type type in assembly.GetExportedTypes())

{

ConstructorInfo constructor = type.GetConstructor( new Type[] {} );

object newObject = constructor.Invoke( new object[] {} );

}

This way the newObject is created from that dll.

In this application, a button with the tick mark located at the top-right corner

handles this event. The filenames of the dlls that are added at run time are appended to

15

the list box in the left panel. Also, the dlls with the same name cannot be added; a

message box pops up indicating that the dll already exists.

3.6 Class Diagram of the methods:

Figure 7 – Class Diagram

3.7 Description of methods used in the dlls:All the modules that are used in the analysis and their corresponding user

interface are placed in the dlls - VedmetDLL.dll, DetectionCalls.dll, Ttest.dll and

FoldChange.dll; they are invoked whenever it is necessary in the application. These

dynamic link libraries are added as references in the Windows application. The dlls and

their functions are described here:

16

3.7.1 VedmetDLL.dll – SetWorksheet ():This method creates an Excel application and opens the Excel file

specified by the user. This also sets the Excel workbooks, Excel sheets and

the Excel worksheet of the Excel object.

3.7.2 DetectionCalls.dll – Test_1():This method finds the number of present calls and marginal calls for

each sample group for each gene and compares it to the input provided by

the user; the input is passed as parameters to this function. This is

implemented by creating an Excel range object for the first sample group

for the columns containing the detection calls. The find() method of the

range object is used for finding the cells containing the detection values

“P”/”M” and counters evaluate the count of the present and marginal calls.

If the values of the counters are greater than the values entered by the user

then the flag of that row is set to true and the row number is added to an

arraylist. The genes for which the flag is not set to true, another range

object is created for the second group and the above procedure is repeated

again. If the selected row number is not already in the arraylist, it is added

to it. At the end, the arraylist will have all the rows that satisfy the criteria

in either of the sample groups and thus the count of this arraylist will have

all the genes that are filtered through this test and it will be displayed in

the form.

3.7.3 DetectionCalls.dll – CreatePanel2():This method creates the user interface for this option. This method

returns a panel object to the options form for display. It creates a panel

with 3 textboxes, 5 labels and a button. The first textbox is for getting the

number of samples from the user. The next two text boxes are for

specifying the call criteria for selection. On clicking the button “Perform

Test”, the results i.e. the number of genes that meet the criteria will be

displayed in a label at the bottom.

17

3.7.4 DetectionCalls.dll – button_PerformTest_Click():This method checks for the validity of the user input; if the total of the

number of present calls and the number of marginal calls is greater than

the number of samples inputted in the first textbox, it is considered as an

invalid entry. Then a message box is displayed indicating that it is an

invalid entry.

This method invokes the test_1 () method of the DetectionCalls.dll

class. An arraylist is returned from the method test_1 () and the count of

that arraylist gives the number of genes that have satisfied this criteria.

The count is appended to the label box “Number of genes that meet the

criteria:”

3.7.5 DetectionCalls.dll – textBox1_TextChanged():This method handles the event when the text in the textBox1 is

changed or entered by the user. The value entered by the user in the

textBox1 is appended next to the textboxes for entering the Present and the

Marginal Calls criteria.

3.7.6 Ttest.dll – test_2():This function performs t-test on the whole set of genes or the genes

that are already processed by the Detection call filter, depending on the

options set by the user. The first argument of this function is an arraylist

which has all the row numbers of the genes for which t-test is to be done.

If the t-test is the first option selected by the user then the arraylist will

have the row numbers of all the genes. The second argument to this

function is the user’s input for p-value and the third argument is a flag

value denoting whether any other analysis module has been performed

previously. The t-test is implemented as follows. An arraylist keeps track

of all the column headers containing the string “Ratio” which is a

substring of “Signal Log Ratio”.

The Application object includes a property called the

WorksheetFunction which returns an instance of WorksheetFunction class.

The class provides a number of useful mathematical functions like mean,

18

variance etc which allow performing calculations on ranges and hence are

used in this method.

So the p-value of the t-test is derived from the mean and the standard

deviation of the signal log ratio values. If the calculated p-value is less

than the p-value set by the user in the form, then that row number is added

to an arraylist and is returned back to the form.

3.7.7 Ttest.dll – CreatePanel3():This method creates the user interface for this particular analysis

option. This method returns a panel object to the options form for display.

This method creates a panel with two labels, a textbox and a button. The

textbox is for setting the criteria for p-value. The result obtained after the

button has been handled will be appended to the label at the bottom of the

panel.

3.7.8 Ttest.dll - Button3_Click_2():This method invokes the test_2 () method of the dll. The third

argument (i.e., the flag) to the function test_2 () will vary depending upon

whether the analysis method in form 2 i.e., the detection call filter has

been evaluated or not. The test_2 () function returns an arraylist and the

count of the arraylist is appended to the label box “Number of Significant

Genes”.

3.7.9 FoldChange.dll – test_3():This function calculates fold change on the significant genes that are

evaluated by t-test and it compares the calculated fold change with the

input from the user to determine the up-regulated and the down-regulated

genes. The first argument of this function is an arraylist that has the list of

all the row numbers of the genes that are significant, that is those genes

that are filtered by t-test. The second and third arguments are the user’

input in the textboxes of the form that set the cut-off fold change for

finding the up and down regulated genes. Here again like test_2 (), an

arraylist is created for finding out the columns (from their headers) that

contain the signal log ratio values. For all the rows the median of the

19

signal log ratio is calculated by a separate function Median () that sorts all

the signal log ratios and finds the median of the sorted list. The fold

change is evaluated from the median in this function test_3 (). If the fold

change is greater than or equal to the cutoff value for up-regulated genes

specified by the user, then the row numbers are added to an arraylist.

Similarly if the fold change is lesser than or equal to the cutoff value for

down-regulated genes specified by the user, the row numbers

corresponding to those genes are added to another arraylist. Both of these

arraylists containing the up and down regulated genes are returned back to

the form.

3.7.10 FoldChange.dll – CreatePanel4():This method creates the user interface required for this particular

analysis option. It creates a panel with 7 labels, 2 textboxes and a button.

The two textboxes are for setting the cutoff fold-change for the up and

down regulated genes. This method returns the panel object containing all

the other created controls to the options form.

3.7.11 FoldChange.dll – Median():The fold change is calculated from the median of the signal log ratios

and so this function sorts all the signal log ratios and finds the median of

the sorted list.

3.7.12 FoldChange.dll – button1_Click():This method is used to submit the information entered by the user in

the textboxes of the form i.e., the cutoff values for the up and down

regulated genes. This method calls the text_3 () method of the dll with the

values of the arguments based on whether the analysis options are selected

or not. An arraylist containing the up and down regulated genes is returned

and the count of those genes is appended to the label boxes “Up-regulated

genes” and “Down-regulated genes”.

20

3.8 Description of methods used in the Options form:3.8.1 Class: Form 1Methods:

i. buttonImport_Click (): This method is used to import the Excel file

containing the genelist to be processed. An openfiledialogbox is

created and the method SetWorkSheet in the VedmetDLL dynamic

link library is called to set the Excel application and the workbook.

ii. button_PerformFilter_Click (): This button is clicked after selecting the

analysis option in the panel. This method calls another method

DoFiltertest () in the same class.

iii. DoFiltertest (): Depending on the option selected by the user, this method

calls the other method StartAppropriateForm ().

iv. StartAppropriateForm (): This method has the switch-case statements for

creating the user interface corresponding to those options by

calling the appropriate dlls for each analysis step. Depending on

the users choice at run time the other panels are made invisible. It

also adds the created panel to the set of controls of the form.

v. button_Export_Click (): This method gets the results of the analysis

modules from all the forms and exports it to Excel sheets. If the

analysis strategy excludes the last step i.e., fold change calculator,

then there is only one Excel sheet that needs to be exported

whereas if the fold change calculator is also included in the

analysis strategy, the result set has two arraylists namely the up

and down regulated genes and hence they need to be exported to

multiple Excel worksheets of the Excel application. These two

types of exporting is performed by ExportArrayListToExcel () and

ExportArrayListToExcelSheets ().

vi. ExportArrayListToExcel (): This method is used for exporting the genes

that have satisfied the criteria set by the user to an Excel file. A

savefiledialogbox is opened and the user can save the file in the

desired location.

21

vii. ExportArrayListToTextBox (): This method is used for giving the user a

preview of the results that have been obtained. The resultant genes

are displayed in a textbox at the bottom of the form.

3.9 Features:Some of the features of this tool are:

a) Dynamic selection of the analysis method by the user:

The analyses modules are placed in dlls and so depending on the

options selected by the user for analysis, the corresponding modules are

dynamically invoked. This feature gives the user the flexibility to

determine the analysis strategy.

b) Dynamic loading of dlls:

At run time, the user can load dlls dynamically. This feature allows

the user to load as many dlls needed that can handle the events and perform

analysis modules. This is an added feature to this application that makes it

scalable.

c) Extensible:

If any other analysis module is to be added to this tool in future, it

can be added as dll since the use of dlls in this application for invoking the

user interface and the analysis modules has been made modular. This also

makes this Windows application extensible for future enhancements.

d) Minimal user input:

Most of the textboxes used in this application have been

customized to be read only. This avoids the user to enter the same values in

the text boxes every time an analysis is performed.

e) Implemented navigation – order of tabs:

The order of tabs has been configured so that the users can rapidly

interact with the user interface of the application.

22

3.10 Screen ShotsForm1:

This form displays the options that can be selected for the analysis.

23

On clicking the “Import the Samples file for analysis”, the open file dialog box pops up

and the user can select the file to be processed.

24

The user can select from the options and the corresponding user interface pops up.

On clicking the button “Perform the selected filter”, the Detection Call Summarizer panel

appears and the user can set the criteria for selection.

25

The t-test panel appears on selecting the option in the left panel and on clicking the

button “Perform the selected filter”.

The number of genes that meet the criteria is displayed.

26

The fold change calculator panel appears when the corresponding option is selected in the

left panel.

27

4. TESTING:

i. Validated user input:

In the Detection Call Summarizer panel, the user’s input in the text boxes

(wherein the call criteria is specified) is checked to see if the total of the present call

value and the marginal call value given by the user is less than or equal to the user’s

input in the first textbox for the number of samples in each group.

ii. Difference in performance between a dlls’ function and a stand alone module

which are functionally same:

The performance of an analysis module in a dll has been compared with

the performance of a stand alone function. This test is to evaluate whether the

procedure calls to the dll has slowed down the process of analysis or not. The

performance testing is done in ANTS Profiler (Advanced .Net Testing System) which

records the performance and the time spent by different modules of the source code.

Here, for this testing, the test_1 () module of the DetectionCalls.dll has been

compared with the same function but handled by the form.

The results from the ANTS Profiler when analyzing samples of different

sizes say 500 genes, 5000 genes and 30000 genes are given below:

28

Here is a snapshot of the recording from the ANTS profiler when analyzing a

sample with 500 genes.

It can be observed that the event handler button5_Click () in Form 1 has taken ~14

seconds whereas the DetectionCalls.test_1 () has taken ~12 seconds for evaluating the

same result. It can be observed that there is an increase in performance of about 2

seconds when it was evaluated by using dlls.

29

This is a snapshot of a recording from ANTS Profiler when the application was

analyzing a sample file containing 5000 genes.

It is seen that the button5_Click () has taken ~122 seconds to perform the analysis

whereas the DetectionCalls.test_1 () has taken ~116 seconds to evaluate the same

result. The results are consistent with the previous results and it can be seen that the

dlls have taken lesser time to evaluate than the stand alone module.

30

To determine if the testing results are consistent with bigger sample files,

a sample with 30000 genes has been taken. The results from the ANTS Profiler are

given below:

There is a difference of approximately 8 seconds between the two

functions for analyzing the sample file with 30000 genes, difference of 6 seconds for

analyzing the sample file with 5000 genes and a difference of 2 seconds for analyzing

sample file with 500 genes. The performance of the dll has been better compared to

that of the stand alone module in the application.

31

5. ENHANCEMENTS:This application can be extended for analyzing more than two groups and also

for analyzing the groups that have unequal number of samples. Also, many more analysis

modules like ANOVA test, clustering etc can be implemented in the dynamic link

libraries if necessary.

This project can be extended for handling the loading of malformed dlls,

duplicate dlls, deleted dlls, dlls with the same name but different version number etc. It

can also be extended to handle missing data, invalid data, and duplicate records in the

Excel worksheet. Further enhancements would include handling of different formats of

Excel like comma delimited, tab delimited, text files etc. A dynamic way of handling data

and presentation would be to implement the data formats in XML and XSLT.

This project has been done in a very broad sense. Its design allows for easy

enhancements in the future and adding other options would be easy. This system is

accessible to the biologists and scientists of minimal background.

6. REFLECTIONS:Implementing the user interface and the analysis modules as dynamic link

libraries and coordinating all the dlls in the application was a difficult task at the

beginning, but I have learnt a lot in due course of the project. Though it is not a common

and an easy solution to develop and implement dlls for a Windows application, I have

realized that it is always a good design practice to make an application more modular and

extensible.

Using Excel as a back-end would be a painful experience for programmers who

are used to programming in databases like SQL and Oracle. Unlike other databases, there

is no primary key to refer to records and there is no commit or rollback operation. The

cells, rows and columns are all referred as Excel Range objects. It was a whole new

different experience for me to develop such an office solution.

32

7. CONCLUSION:This application reduces the burden of analyzing the Excel files manually. Also,

it would reduce the load of maintaining multiple Excel tabs with the intermediate results

in the whole procedure.

This is a stand-alone application and the executable can be downloaded in as

many workstations as needed. It would cater to the needs of the individual users who

work on gene array data and would simplify the process. The users can import their gene

lists into this tool, set their own parameters to process them and export the analyzed gene

lists again.

The use of dlls improves the performance of the application and it makes the

application extensible.

33

8. REFERENCES:1) Eva Goldwater, (1999), Univ. of Massachusetts Office of Information Technology,

“Using Excel for statistical data analysis”

http://www.umass.edu/acco/statistics/handout/excel.html

2) Mohammed Ashraf, “Dll Profiler in C#”

http://www.codeproject.com/csharp/dll_profiler.asp

3) Mark Belles, “Hosting Control Panel Applets using C#/C++ - Finding and loading

unmanaged DLLs dynamically”

http://www.developerfusion.co.uk/show/4451/2/

4) Microsoft Corp, “Determining which linking method to use”

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/

_core_determine_which_linking_method_to_use.asp

5) Microsoft Corp, “AppDomains and Dynamic Loading”

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncscol/html/

csharp05162002.asp

6) “How do I dynamically load a control from a DLL?”

http://www.syncfusion.com/FAQ/WindowsForms/FAQ_c41c.aspx#q709q

7) “Building a DLL From Several Classes”

http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/

tutorial/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm

8) “Debugging Dll projects”

http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vsdebug/

html/433cab30-d191-460b-96f7-90d2530ca243.asp

34

http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vsdebug/html/433cab30-d191-460b-96f7-90d2530ca243.asp

http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vsdebug/html/433cab30-d191-460b-96f7-90d2530ca243.asp

http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/tutorial/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm

http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/tutorial/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm

http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/tutorial/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm#BuildingaDLLFromSeveralClasses%23BuildingaDLLFromSeveralClasses

http://www.syncfusion.com/FAQ/WindowsForms/FAQ_c41c.aspx#q709q

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncscol/html/csharp05162002.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncscol/html/csharp05162002.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_determine_which_linking_method_to_use.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_determine_which_linking_method_to_use.asp

http://www.developerfusion.co.uk/show/4451/2/

http://www.developerfusion.co.uk/profile/23914

http://www.codeproject.com/csharp/dll_profiler.asp

http://www.umass.edu/acco/statistics/handout/excel.html

9. APPENDIX Raw Affymetrix Gene Chip data are stored in files with the “CAB” extension. A suite of

programs from Affymetrix are used to extract and pre-process this data. The result is a

list of the genes tested in the experiment as well as the fluorescence signal strength for

each probe and associated information. These pre-processed data are placed in an Excel

file that is used as the input to the Program developed in the Report.

a) Transferring the GCOS CAB files into Excel:Affymetrix Data Transfer tool is one of the ways for transferring the CAB files

into GCOS. The screenshot of the welcome page is given below, in which one can select

the transfer option and click “Next”.

35

Browse to the location of the data file, specify the type of the data and click on “Next”.

36

Select the CAB files to be imported and click on “Start”.

After the importing is complete, click the “Finish” button. Depending on the number of

samples imported, experiments are created in GeneChip Operating System.

37

By clicking the analysis results of any comparison array analysis, one can view the

results.

38

The “Analysis Options” window gives the user various options that can be viewed in the

window.

39

One can save the results by clicking on the “Save As” button and one can select the

export type as Excel file.

This way, the user can export out all the necessary comparisons into Excel and the Excel

file can be inputted into the Windows application.

40

Download - IMPLEMENTATION OF MICROARRAY DATA …people.cis.ksu.edu/~nithya/Report_July28.doc · Web viewHybridization has for decades been used in molecular biology as the basis for such techniques

Top Related