final year thesis - hkust

25
Final Year Thesis Multi-Variate Data Visualization: Visual Analysis of Air Pollution Problem in Hong Kong CHAN Wing-Yi, Winnie Supervised by Professor Huamin QU Submitted in partial fulfillment of the requirements for COMP 398H in the Department of Computer Science and Engineering Hong Kong University of Science and Technology 2006 - 2007 Signature: Date: April 16, 2007 i

Upload: others

Post on 16-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Year Thesis - HKUST

Final Year Thesis

Multi-Variate Data Visualization:

Visual Analysis of Air Pollution Problem in Hong Kong

CHAN Wing-Yi, Winnie

Supervised byProfessor Huamin QU

Submitted in partial fulfillment

of the requirements for COMP 398H

in the

Department of Computer Science and Engineering

Hong Kong University of Science and Technology

2006 - 2007

Signature: Date: April 16, 2007

i

Page 2: Final Year Thesis - HKUST

Abstract

In this thesis we present a comprehensive system for weather data visualization. Weatherdata is in essence multivariate and contains vector field such as wind and geographical infor-mation. Different visualization techniques such as parallel coordinates and pixel bar charts areintegrated into the system. We also developed several novel methods such as circular pixelbar charts embedded into polar systems, enhanced parallel coordinates and weighted completegraphs to facilitate knowledge discovery by domain scientists. To evaluate the effectivenessand usefulness of the visualization tool, we analyzed the air pollution problem in Hong Kongusing our system and various interesting patterns have been detected.

ii

Page 3: Final Year Thesis - HKUST

Contents1 Introduction 1

1.1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Previous Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 4

3 System Overview 43.1 Data Collection and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Visualization Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Visualization Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Visualization Techniques 64.1 Polar System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1.1 Circular Pixel Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1.2 Time-Series Polar System . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Weighted Complete Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3.1 Definition and Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . 114.3.2 Layout Optimization and Encoding Scheme . . . . . . . . . . . . . . . . . 124.3.3 Axis Order Selection for Parallel Coordinates . . . . . . . . . . . . . . . . 12

5 Experimental Results 135.1 Correlation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.1 Using Polar System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1.2 Using Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2 Similarities and Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.1 Using Polar System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.2 Using Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 Time-Series Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3.1 Using Polar System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3.2 Using Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusions and Future Work 21

7 Acknowledgements 21

References 21

iii

Page 4: Final Year Thesis - HKUST

1 IntroductionWhile information is growing in an exponential way, our world is flooded with data which, webelieve, should contain some kind of valuable information that can possibly expand the humanknowledge. However, extracting meaningful information is a difficult task when large quantitiesof data are presented in plain text or traditional tabular form. Effective graphical representationsof the data thus enjoy popularity by harnessing the human visual perception capabilities.

Information visualization is the use of computer-based interactive visual representations ofabstract and non-physically based data to amplify human cognition. It aims at helping users toeffectively detect and explore the expected, as well as discovering the unexpected to gain insightinto the data. For multivariate data visualization, the dataset to be visually analyzed is of high di-mensionality. Being a specific type of information visualization, multivariate data visualization isan active research field with numerous applications in diverse areas ranging from science commu-nities and engineering design to industry and financial markets, in which the correlations betweenmany attributes are of vital interest.

We concretely studied weather data visualization as one sub-domain of multivariate data vi-sualization. We then applied the visualization technique to aid visual analysis of the air pollutionproblem in Hong Kong, which shaped a paper submitted to the IEEE Visualization Conference2007 (Vis’07) in March. In this section, we will present the background and motivations of ourwork, followed by the challenge of visualization weather data for knowledge discovery.

1.1 Background and MotivationsThe shocking deterioration of air quality in Hong Kong has aroused much attention recently [2].The city is often cloaked in a heavy haze and our picturesque skyline is barely visible over the pastfew years (see Figure 1). The filthy air not only affects our respiratory health and tourist industry,but also tremendously reduces Hong Kong’s sustainable competitive advantages. A recent annualquality-of-life study [1, 4] suggested that serious air pollution has already made Hong Kong aless attractive place for expatriates. Hypotheses including external pollutants generated from themainland China, local pollution from crowded vehicles, use of inferior coal in electric plants, andcurtain wall effect resulted from dense buildings were proposed without any formal proof yet.

Different meteorological organizations over the world have been maintaining huge databasesof weather data to provide scientific information for experts to perform trend analysis and study avariety of environmental issues. In particular, to address the Hong Kong air pollution problem, theInstitute for the Environment of the Hong Kong University of Science and Technology has devel-oped a comprehensive atmospheric and environmental database on Hong Kong and surroundingregions. They attempted to study the correlations between different attributes concerning air qual-ity with classical analysis techniques. Some interesting patterns have been found, but they fail toobtain convincing results for high-level correlations that cannot be computed with solely numericalmethods. Visualization techniques are hence demanded to assist them to detect trends and simi-larities, and to spot possible correlations between multiple attributes which are unique for certainregions.

1

Page 5: Final Year Thesis - HKUST

Figure 1: Hong Kong on a better day. The spectacular harbor view has been increasingly crippledby massive haze [1, 4, 3]

.

1.2 Weather DataWeather data possesses eminent special features. Weather data is usually recorded by automaticmeteorological stations located in representative regions at regular time intervals, thus the data isintrinsically time-varying and contains inherited geographic information. This observed data canalso serve as the basis for computer simulation in conducting sensitivity analysis and forecasting.In addition, weather data is typically multivariate that often consists of more than 10 dimensions.Wind speed and direction resulting in a vector is one of the most important attributes in weatherdata, which differentiates it from ordinary multivariate data with only scalar values.

1.3 ChallengesDespite the significance of weather data, one confronts various problems when processing and vi-sualizing the data for knowledge discovery. Traditional statistical methods do not work well dueto the complexity and size of the data. It would be very time-consuming to apply any of thesemeasures on giant datasets. It is also not intuitive as both the geographic information and vectorfeature such as wind speed and direction are completely neglected, and interesting local patterns orhigh-level correlations are less likely to be revealed through standard analysis. Visualization thusbecomes desirable for experts to gain insights into the data, but it is not trivial to visualize weatherdata because of its uniqueness. People are too familiar with polar coordinates and orientated ar-rows in representing the wind profile; it is not wise to adopt other display and therefore constraintsthe design of visualization tool. The large amount of data items and the high dimensionality alsopresent challenges in visualizing weather data effectively and efficiently for visual analytic. More-over, it is not straightforward to handle multivariate time-series data for comparing across timeand stations. Time delays may occur and different stations may exhibit similar patterns at differentpoints in time, making the situation more complicated with greater uncertainties.

2

Page 6: Final Year Thesis - HKUST

1.4 Previous ApproachesPrevious studies tended to treat weather data as common multivariate data being visualized byscatter-plot matrices, parallel coordinates and pixel-based techniques. The vector value, geo-graphic information and time-series properties are either lost or represented rather tediously. Oth-ers applied non-photorealistic brushes [8] and natural textures [19] to encode the data in a moreaesthetic way. However, the major drawback of these methods is the low scalability. They can onlyencode up to 4 dimensions without severe interference between visual channels.

1.5 Our ApproachesIn this project, we integrate several well established visualization techniques, namely polar sys-tem, parallel coordinates, and weighted complete graph into a comprehensive system for weatherdata visualization and apply our system to analyze the air pollution problem in Hong Kong. Somenovel techniques have been developed to address the special challenges posed by weather data.Specifically, we introduce circular pixel bar charts to detect correlations between wind direction,wind speed and other attributes. We demonstrate how vectors and multiple scalar attributes inweather data can be visualized effectively by polar systems with embedded pixel bar charts andtailored parallel coordinates. The weighted complete graph is employed to reveal the overall cor-relation of all data dimensions and to determine the order of axes in parallel coordinates. Based onour system, some interesting patterns have been detected by domain scientists and some valuablefeedback about various visualization techniques has been obtained. The major contributions of ourwork are as follows:

1. We demonstrated how visualization techniques can be used to attack one of the most seriousproblems in modern days. Domain scientists have gained new insight into the air pollutionproblem in Hong Kong with our advanced visualization techniques. We limit our study toHong Kong weather data but the basic system, visualization techniques, and lessons learnedcan be applied to general air quality analysis problems.

2. We developed novel methods to address the special challenges posed by weather data. Theapplications of these novel methods are not limited to weather data visualization. For exam-ple, circular pixel bar charts embedded in polar systems can be exploited to visualize othervector fields with mutli-variate attributes. The weighted complete graph can be used to de-termine the axis order of general parallel coordinates.

The remaining part of this thesis is organized as follows. Section 2 reviews some related workon weather data visualization. The system overview is then described in Section 3, followed bya detailed discussion of our approach to visualize weather data in Section 4. The experimentalresults are presented in Section 5. Finally, it is concluded with possible directions for future workin Section 6.

3

Page 7: Final Year Thesis - HKUST

2 Related WorkWeather data visualization is rarely considered as a standalone problem, instead it is commonly ad-dressed in the scope of multivariate data visualization, which sometimes overlooks the uniquenessof weather data including important vector values and time-series nature.

Treinish [20, 21, 22] has conducted various research work on weather data visualization buthis approaches are more on scientific visualization than information visualization. Healey et al.[8] used nonphotorealistic brush strokes for visualizing multidimensional information spaces likeweather data; Tang et al. [19], on the other hand, applied a controllable texture synthesis techniqueto harness natural textures for the same purpose. Each attribute value is mapped to an individualvisual channel of the texture such as luminance, scale and orientation. Although these methodsyield effective and aesthetic results, they inevitably suffer from the limited scalability as the numberof individual visual channels is believed to be around three.

In other cases, weather data visualization appears as a concrete application for a particular vi-sualization tool. Sauber et al. [14] extended existing methods to handle spatial distribution dataincluding weather data. Specifically, they used streamlines with color field for visualizing vectordistribution datasets. Wilkinson et al. [24] proposed a statistical measures for organizing mul-tivariate displays and for guiding interactive exploration, which help users discover any patternsor outliners more easily. A comprehensive system VIS-STAMP [13] is especially designed forspace-time and multivariate data, which consists of parallel coordinates, self-organizing maps, andgeneral pixel-based methods. Distribution of each attributes in the multidimensional data or eachtime frame in time-series data is shown in individual maps encoded with color.

3 System OverviewIn this section we will first introduce the data collection and the Environment Facility Center ofHKUST which kindly provided us with the comprehensive data. The necessary visualization tasksand modules are then discussed next.

3.1 Data Collection and ProcessingThe Hong Kong weather data used in our system is collected and processed by the EnvironmentFacility Center (ENVF) at Hong Kong University of Science and Technology. More than 10 mon-itoring stations (see Figure 2), including both ambient and roadside stations, have been establishedto measure various weather parameters hourly. These datasets are then maintained by the ENVFAtmospheric & Environmental Database for future studies. The database is open to the publicthrough the ENVF’s website1 which has accumulated more than 10 million visits. The websiteonly employed some primitive visualization techniques such as scatter-plots to display these data.The attributes recorded at these stations are summarized in Table 1. The data spans more than 10years and contains more than 13 dimensions. Although each station records its own set of windinformation, they are generally replaced by those observed at standard stations during the air qual-ity analysis. This is because local wind speed and direction are easily affected by surroundingbuildings and hence cannot show the actual wind condition in the territory.

1HKUST ENVF Website: http://envf.ust.hk/dataview/gts/current/

4

Page 8: Final Year Thesis - HKUST

Name UnitPrecipitation mm

Wind Direction bearingAir Temperature Degree Celsius

Wind Speed m/sDew Point Degree Celsius

Relative Humidity %Sea Level Pressure hPa

Respirable suspended particulates (RSP) ug/m3Nitrogen oxide (NO) ppb

Nitrogen dioxide (NO2) ppbNitrogen oxides (NOX ) ppbSulphur dioxide (SO2) ppb

Ozone (O3) ppbCarbon monoxide (CO) ppb

Solar Radiation mw/cm2Air Pollution Index (API) scale 100

Contributed Pollutant to API RSP, O3, NO2, SO2 or CO

Table 1: Data attributes collected at different monitoring stations.

Figure 2: Location of different air quality monitoring stations in Hong Kong.

As the worsening air pollution problem has become one of the urgent threats facing HongKong, our society is now desperately pushing for solutions to clean up our sky. ENVF has beendevoting endless efforts in conducting diverse environmental studies from atmospheric modelingto air pollution index forecasting. They also apply scientific simulation to generate continuousweather data over Hong Kong for advanced analysis.

3.2 Visualization TasksThe visualization tasks for analyzing weather data can be classified into three categories:

• Finding correlations between different attributes. For example, any correlations between airpollution index and temperature should be examined for pinpointing air pollution sources.

• Comparing data from different stations. It is always remarkable to examine the similarity ordifference at different locations, as the geographic information can greatly affect the weatherbehavior and thus lead to more accurate analysis and forecast.

5

Page 9: Final Year Thesis - HKUST

• Detecting the trend for Hong Kong’s weather and air quality. For time-series data, one of themost important issues is how to predict the future tendency based on the pattern we observetoday.

3.3 Visualization ModulesBased on the data we have and the visualization task to be accomplished, we develop a compre-hensive visualization system. Our system consists of three major visualization modules: polarsystem, parallel coordinates, and weighted complete graph. These three modules all have theirown advantages and disadvantages and can be applied in different ways to achieve visualizationgoals. Various novel techniques have been developed and integrated into each module.

Unlike common multivariate data, weather data has certain special vector dimensions such aswind direction and speed. Therefore, we introduced the embedded circular pixel bar chart for polarsystems and specifically tailored parallel coordinates to visualize these dimensions in intuitively.As there are complicated relationships among the attributes of multivariate weather data, weightedcomplete graphs have therefore been exploited to give an overview of correlations of all dimensionsand help users determine the order of axes for parallel coordinates.

4 Visualization TechniquesIn this section, we introduce the major visualization techniques employed in our system and ex-tensions we have incorporated to address the challenges posed by weather data visualization.

4.1 Polar SystemIn weather data, wind speed and wind direction are frequently used as the key attributes among allparameters. By visualizing the distribution of other attributes based on the wind profile, interestingpatterns are more likely to be observed especially for data related to air quality, where differentpollutants are correlated with the wind in some ways. Therefore, we apply the polar system, oneof the common representations for vectors, to encode weather data so that the principal wind speedand direction are shown in an intuitive manner.

Figure 3 shows the encoding scheme of the initial polar system scheme. For a particular pixelon the polar plane, its distance from the center and the angle spanning from the north direction canencode wind speed and direction respectively, while the pixel color encodes a third scalar attributesuch that its relationship with wind speed and direction is clearly shown. To generate a morereliable representation for the underlying data, as a common practice in the environmental field, anarea-preserving mapping should be applied on the distance from the center so that points locatedcloser to the center are not overcompressed (see Figure 3(c)). The simplest measure is by takingthe square root of the linearly computed distance value.

4.1.1 Circular Pixel Bars

To cope with higher dimensions, we introduce the circular pixel bar chart which can be conve-niently embedded into a polar system. The circular pixel bar chart is an extension to the PixelBar

6

Page 10: Final Year Thesis - HKUST

North

θ d

Color

(Wind Direction)

(Wind Speed)

xx

(a) (b) (c)

Figure 3: Traditional polar system: (a) Encoding scheme; (b) Mapping radius without preservingthe area; (c) Area-preserving polar system.

[12] which is derived from traditional histogram where each bar represents a group of categorizeddata with its width or height encoding the value of one parameter. The bar is then utilized to plotthose data items contained in the category with pixel-based techniques. Three more attributes maybe visualized with the x-position, y-position, and color of a pixel inside the bar.

The initial polar system first acts as a guideline for users to explore other parameters based onthe existing information provided. When users are interested in some data items lying in a certainrange of wind direction and wind speed, they can select the respective sector on the polar planeas described in Figure 4. The wind information then becomes irrelevant and a circular pixel bar isproduced only for data items falling within the sector of interest. A pixel bar is then banded andplaced in the sector region. The width and height of the original pixel bar are now transformed tothe arc and radius of the sector accordingly, showing the range of wind direction and speed that thepixel bar sector occupied. The x-position, y-position and pixel color in this circular pixel bar nowcan encode 3 additional attribute values. For example, from the original polar system, users mayobserve that the amount of sulphur dioxide (SO2) are remarkably high at certain wind direction andwind speed. They may then further examine the temperature, amount of carbon dioxide (CO2) andnitrogen dioxide (NO2) for higher level correlation under that particular circumstances. It wouldbe difficult to compute this kind of correlations by traditional correlation measures as they are onlyfound in a relatively small subset of the data. Nevertheless, it can be notably important as users aremore interested in extreme or abnormal cases in general.

Sometimes users may want to examine the pattern of a sector against the overall one to see ifthe current sector exhibits similar or exceptional distribution with that of the complete datasets.We thus introduce a complement circular pixel bar, which essentially encodes the data fallingoutside the sector, blended underneath the in-range sector. In effect, it highlights the selecteddata in the plot such that users can obtain an overall picture for tracking any provocative behavior.Subsequently, there is no re-normalization in the sector as opposed to the traditional pixel bar chart;data points are normalized based on the full range of data attribute values.

When the data size increases, multiple data items may feature the identical wind direction andspeed, which are hence mapped to the same pixel location in the polar system. The display would

7

Page 11: Final Year Thesis - HKUST

North

1

2

y

x

color

(a) (b)

Figure 4: Polar system with circular pixel bar: (a) A sector selected by users and a circular barchart embedded into the sector; (b) Blending of circular pixel bar for data falling in the sectoragainst one for its complement.

become chaotic and the overlapping of pixels may affect the accuracy of the visualization. Datafiltering and clustering can reduce the data size to a manageable level. For example, we may onlyshow one representative pixel with the average value of all the data items that possess similar winddirection and wind speed. After users select a sector and zoom in, a corresponding circular pixelbar chart will be drawn and the overlapped data items are usually mapped to different pixels.

(a) (b)

Figure 5: Comparing circular pixel bar with rectangular one: (a) Polar system with multiple circu-lar pixel bar charts; (b) Conventional pixel bars for the sectors. The overall patterns are preservedin the sector for comparison, in-depth numerical analysis may be performed on the supplementrectangular pixel bars.

A polar system with circular pixel bars is more favored by weather data visualization thanthe basic pixel bar mainly due to the unique vector information present in the weather data. Windspeed and direction are the major principal attributes from which further investigation begins. With

8

Page 12: Final Year Thesis - HKUST

(a) (b) (c)

Figure 6: Polar system with time information: (a) x-position, y-position and color of the sectorindicates the month of observation, amount of SO2, and temperature respectively (b) x-positionnow represents which day the entry was recorded; (c) y-position now encodes day and x-positionencodes month.

circularly arranged pixel bar in a well-established polar system for handling vectors, the relation-ship between wind and individual attributes is clearly revealed. Such interaction also allows usersto demand details for other parameters only on a subset of the featured data they are interested in,so that the result is less overwhelming and distracting information is minimized. This is especiallyimportant if multiple sectors are selected and compared by users because the wind information forthese sectors is presented simultaneously in an intuitive manner (see Figure 5(a)).

One major concern for the circular pixel bar chart is that the regular shape of the basic plot isdistorted, which may affect the accuracy of data analysis. In spite of this, it certainly takes advan-tage of arranging the detailed plots in a way effectively related to the wind speed and direction,which are often the most dominating attributes during analysis. While the accuracy of interpretingthe result may be diminished, it is capable for comparing the pattern between sectors efficientlywith the wind profile provided on the spot, which cannot be achieved by placing the original pixelbar in any mean especially when the number of selected sectors is relatively large. To facilitatequantitative studies, the conventional rectangular pixel bars are also provided alongside with thepolar system, as presented in Figure 5(b).

4.1.2 Time-Series Polar System

The above polar system does not distinguish between data recorded in different time periods. Ourpolar system can also visually represent the time information of the data. For easier comparison,we may show the time information in the sector pixel bar. Users may specify a certain time rangeand the corresponding data items are then plotted within the sector, where the x-position and y-position are possibly mapped to time units such as hour, day, month, and year. The data patternagainst time is now clearly shown, helping users to detect any significant distribution at a particularpoint of time. When only one of the time units is chosen, it is scaled to the entire timeline of theinput dataset or the user-defined time range if any. A different scaling strategy is applied whenmore than one time unit are selected as plotting parameters. The smaller time quantity will bescaled according to the larger time quantity, unlike the previous case which will span the complete

9

Page 13: Final Year Thesis - HKUST

(a) (b) (c)

Figure 7: Different layouts of parallel coordinates: (a) Traditional layout; (b) Circular layout; (c)S-style layout

period of time. For instance, if day is the only time unit specified, it will then be scaled over theavailable time periods of the input data. Contrastingly, the day information will be scaled basedon the number of days in a month if the parameter month also serves as one encoding attribute, asillustrated in Figure 6. Moreover, users may define the time interval in which they wish to dividethe data and subsequent polar plots will be generated to show the trend over time.

4.2 Parallel CoordinatesParallel coordinates [9, 10] is a powerful visual technique where attributes are represented byparallel vertical axes. Each data item is encoded by a polygonal line that intersects each axis atrespective attribute data value. Parallel coordinates is also integrated into our system and severalextensions have been made to improve its effectiveness for weather data visualization.

As we mentioned earlier, wind direction is one of the most important attributes in the air qual-ity analysis. However, the vertical axis used in traditional parallel coordinates is not effective atencoding directions. To address this problem, we experimented with different styles of axis forwind direction (see Figure 7) and adopt the S-shape axis in our system. An S-shape line is morenatural than a vertical one to indicate direction and it also stands out among all axes to attract users’attention.

Parallel coordinates has been thoroughly studied and a lot of excellent techniques are availableto reduce visual clutters caused by too many crossing lines [6, 11, 25]. For example, polylineaveraging, which represents a range of polylines with an averaged one, can be used for dynamicallysummarizing a set of polylines [18]. However, the detailed information of the data might be lostduring the “clustering” or “summarization” processes. To mitigate this problem, we also draw ascatterplot above every pair of neighboring axes for accurate quantitative analysis (see Figure 8).Then each line between the parallel axes becomes a point in the scatterplot. Because points useless space than lines do and thus the overall visual clutter is significantly reduced.

4.3 Weighted Complete GraphIn our system, both polar systems and parallel coordinates can be used to detect possible correla-tions between multiple attributes. However, polar systems with embedded circular pixel bar chartscan only reveal relationship between five dimensions, while correlations between attributes repre-sented by neighboring axes can usually be detected in parallel coordinates. Sometimes users maywant to explore the overall relationship among all data dimensions. Therefore we propose a new

10

Page 14: Final Year Thesis - HKUST

Figure 8: Enhanced Parallel Coordinates with S shape axis to encode wind direction and scatterplotto reveal bi-variate relationship between neighbor axes.

technique, called weighted complete graph, to act as a guide map for polar systems and parallelcoordinates for displaying the relationship between dimensions at a high level.

Recently, Sauber et al. [15] proposed multifield graphs as visual aids to intuitively provideinformation about the amount of correlations contained in each correlation field. However, onedrawback of this technique is that the number of nodes in the graph grows exponentially with thedata dimensionality. Although they proposed an optimal strategy to reduce the number of nodes toa certain degree, it is still not straightforward for users to perceive the correlation information ofall nodes at once.

In the weighted complete graph, each node represents one dimension of the data and the weightof edge encodes the correlation between the adjacent nodes. Compared with multifield graphs, thenumber of nodes in weighted complete graphs is greatly reduced and the topology of the graph notonly reveals the relationship between any two dimensions, but also shows the overview relationshipamong all dimensions. The use of the weighted complete graph in the system is two-fold: tovisualize the overview relationship among all dimensions, and to generate an optimized axis orderof parallel coordinates interactively or automatically.

4.3.1 Definition and Distance Metrics

A (finite) graph G (V , E) is a (finite) non-empty set of vertices V and a (finite) set of edges E suchthat all elements vi, v j of E are unordered pairs of distinct elements of V (i.e. G is undirected).Let n denote the number of vertices and m the number of edges of G. A complete graph is a graphin which each pair of graph vertices is connected by an edge. A weighted complete graph is acomplete graph where each edge has an associated weight. In our application, we can use thenodes to encode the attributes and use the weight of edge to encode the strength of correlationsbetween two nodes.

The weight of edge between two nodes in the graph can be computed using different metrics.There are several correlation measures available for two variables. A common measure, called cor-relation coefficient, can detect linear dependencies for normally distributed data. Another metric,pointwise mutual information, which is based on entropy, is able to detect more general dependen-cies but at the cost of much longer computation time. After testing with various metrics, we finally

11

Page 15: Final Year Thesis - HKUST

adopt the standard correlation coefficient in our system. The correlation between two dimensionsXi and X j is defined as:

Cs(Xi,Yj) =‖(Xi−X i)(X j−X j)T‖

((Xi−X i)(Xi−X i)T )12 ((X j−X j)(X j−X j)T )

12

(1)

4.3.2 Layout Optimization and Encoding Scheme

As viewers naturally interpret closely-positioned nodes in a graph as strongly related [7], we needa layout algorithm for all the nodes in a weighted complete graph to reflect the relationships ofall dimensions. Barnes and Hut [5] initially proposed a type of multi-scale algorithm for thesimulation of astronomical systems. The algorithm has been introduced to the graph drawingfield [17, 23]. Energy-based methods are popular for creating straight-line drawings of undirectedgraphs. Noack [16] proposed the edge-repulsion LinLog energy model whose minimum energydrawings reveal the clusters of the drawn graphs.

In our system, we use the LinLog energy model [16] with the Barnes-Hut algorithm to displayweighted complete graphs. The weight of the edge is computed by the correlation metric intro-duced in the previous section. Because the weighted graph in our system is complete, edges in thegraph are eliminated by setting thresholds to avoid visual clutter.

Furthermore, we introduce several encoding schemes to make the weighted complete graphdrawing more meaningful, which can give a better relationship information overview. First, wecan use the size of a node to encode the accumulated correlations between this particular nodeand all other nodes. Under this strategy, the area of the node is proportional to the sum of thecorrelation measures between the node and the other nodes. Thus, a bigger node means that thenode has larger accumulated correlation measures and may have strong relationship with othernodes. Second, we can set a threshold for the edge weight and remove all low values and draw theremaining ones with lines to visually represent the strength of the correlations. The pattern, width,and color of lines can be used to encode the correlation coefficients. This filtering strategy can helpusers locate strong relationships between different nodes easily.

Our layout optimization algorithm can also be applied to a subgraph. Sometimes users maywant to see only the relationship among certain dimensions. Under this circumstance, our systemallows users to select attributes they are interested in and render the corresponding subgraph usingthe layout algorithm.

4.3.3 Axis Order Selection for Parallel Coordinates

In parallel coordinates, different orders of the axes could reveal different aspects of the dataset andthe order in which the axes are drawn is critically important for effective visualization. Axes rep-resenting the attributes with potential correlations should be placed together so that the resultingpatterns can be better shown. In the weighted complete graph, it displays the overall relationshipsof all attributes and attributes having strong correlations tend to appear closer in the graph. There-fore, users can apply the weighted complete graph as a guideline in adjusting the axis order ofparallel coordinates. Three strategies have been proposed to determine the axis order of parallelcoordinates based on the weighted complete graph.

12

Page 16: Final Year Thesis - HKUST

(a) (b) (c)

(d)

Figure 9: Weighted complete graph: (a) the layout of a weighted complete graph with size of nodesencoding the accumulated correlation coefficients; (b) the graph with color of edges encodingthe correlation coefficients between two nodes. The edges with small weights are removed forclarity. (c) Users manually select the order of nodes in the graph. (d) The corresponding parallelcoordinates with color encoding API.

1. Manual method. Based on the weighted complete graph, user can decide the axis order ofthe parallel coordinates manually. In the graph, we also draw some “important” edges ashints to help user in making better decision (see Figure 9).

2. Automatic method. Finding the optimal axis order of parallel coordinates to minimize theoverall path length in the weighted complete graph can be regarded as the classic TravelingSalesman Problem (TSP), which can be solved by various optimization methods [26].

3. Semi-automatical method. This interactive method consists of three steps. First, users selectnodes they are interested in from the graph. These nodes will be placed next to each other inthe parallel coordinates and the order of these nodes can be determined by the users or theautomatic method. We then group these nodes into a new node and redraw the graph. Lastlywe apply the automatic method to compute the order of nodes in the new graph.

As there are only 13 dimensions in our data and the layout can be easily grasped by users, theycan manually choose the axis order of the parallel coordinates (see Figure 9).

5 Experimental ResultsThe whole system is developed and installed on a Dell Precision mobile workstation M70 with 2GB memory and a 256M Nvidia Quadro FX Go 1400 graphics card. The VTK library is used forrendering. For all the experimental results in this section, almost real time feedback is provided tousers after the data is loaded into memory.

13

Page 17: Final Year Thesis - HKUST

(a)

(b) (c)

Figure 10: Detecting the correlation between Air Pollution Index (API) and other attributes whenAPI is high. (a) Initial polar system with color denoting API value. The northwest sector ischosen, plotting RSP against solar radiation. (b) Plotting RSP against SO2 instead, high API value(red pixels) are not found when SO2 is high, revealing SO2 contributed little to API. (c) Y-positionnow becomes O3 clearly correlated with API. For (b) and (c), suspicious clusters are shown (a bluecluster behind a green one), immediately holding domain experts’ attention.

5.1 Correlation DetectionIn the first set of examples, we tested the effectiveness of our method on detecting correlationsbetween different attributes of the weather data. Finding correlations between multiple attributesis always one of the major visualization tasks for multivariate data visualization. In this section,we present how our system can assist users in detecting informative correlations.

5.1.1 Using Polar System

The example in Figure 10 aims at finding the correlation between Air Pollution Index (API) andother dimensions. Our polar plots can facilitate the users to examine any possible correlationsbetween wind direction, wind speed, and any other two attributes chosen by users. We were onlyinterested in the condition under serious air pollution, which can be selected by picking a relevantsector. Obviously, API has no relationship with the solar radiation (Figure 10(a)). The pixel barhere shows that API is highly correlated with Respirable Suspended Particulates (RSP) encoded bycolors. In fact, the positive correlations between API with RSP and Ozone (O3) in Hong Kong areknown to the experts, as the polar system demonstrates in Figure 10(c). In addition, Figure 10(b)suggests that SO2, being mapped to the y-position, does not have a strong correlation with APIthough it is one of the major air pollutants taken into account for computing API. It has explainedby the environmental scientists that the contribution of SO2 is negligible compared with RSP and

14

Page 18: Final Year Thesis - HKUST

Figure 11: Detecting correlations of the same set of data by Parallel Coordinates, with color de-noting API value.

O3 in the API calculation. Significantly, the experts responded instantly to the two distinct clusters,one in blue and another in green in the two latter figures. The rationale behind such findings is stillunknown and could possibly initiate a new research direction in the environmental aspect.

5.1.2 Using Parallel Coordinates

Next we demonstrated how similar conclusions can be drawn from the data represented in par-allel coordinates in Figure 11. A gradual color change is perceived at the axis for RSP and O3as expected, indicating they are positively correlated with API. In contrast, a group of red linespassing through the SO2 axis at a low value which implies a high API reading does not necessarilyattribute to a large amount of SO2. Besides, the fact that the solar radiation and temperature arenot related to API is revealed by the messy colored lines found at their vertical axes. Although itis more difficult to assess the impact of wind direction and wind speed in parallel coordinates, it ismore effective to explore the correlations between multiple dimensions than a polar system. Forinstance, NO2 and CO or NO and NOX display some partial relationships in the graph that is worthinvestigating in the future.

5.2 Similarities and DifferencesThe Hong Kong society mostly weighs external factors more in tackling the air pollution problem.Many of us believe that the air pollutants are blown in from the flourishing factories on the PearlRiver Delta, the manufacturing heart of China, located at the northwest of Hong Kong and veryoften ignore the pollution incurred locally. Other parties hold a conviction that the monopolisticpower plants, the excessive number of vehicles and vessels should be responsible for the poor airwe are suffering in town.

5.2.1 Using Polar System

To judge the two adverse statements, we first visualized the amount of sulphur dioxide (SO2)recorded by nine air-monitoring stations in Hong Kong for the past three years in Figure 12. As theenergy sector and vehicular exhaust are the two major emission sources of SO2, it may provide usclues on less apparent internal pollution. All of them exhibit a relatively high SO2 amount when thewind speed is high and originated from the northwest direction. The high wind speed suggests thatSO2 is likely brought from the northwest region outside Hong Kong, which coincides with whatmost people suspect. Yet, much to our surprise, the station in Kwai Chung depicts a significantly

15

Page 19: Final Year Thesis - HKUST

(a)

(b)

(c)

Figure 12: Tracking possible internal and external pollution source with 9 stations in the past threeyears. Pixel color represents the amount of SO2 recorded in individual station. Most of them showrelatively large SO2 amount with strong northwest wind, but station Kwai Chung has the highestSO2 value when there is a southwest wind of all wind speed.

16

Page 20: Final Year Thesis - HKUST

Figure 13: Comparing two stations, Kwai Chung and Tung Chung with parallel coordinates usingcolor to represent the wind direction. Clusters of wind direction records are found in Tung Chungstation but not in Kwai Chung.

different relationship between SO2 and the wind. It recorded a large amount of SO2 even whenwe are experiencing southwest wind regardless of the wind speed. This probably implies that thepollution should be closely related to internal factors because the wind speed does not play animportant role here. The geographic location is then taken into consideration to explore the causesto the abnormal pattern to the southwest of Kwai Chung. It turns out that one of the world’s busiestport, Kwai Tsing Container Terminals, is operating at the southwest of Kwai Chung. The localpollution observed can hence be attributed to cargo ships frequently entering and leaving the port.Although pollutants generated externally usually affect most stations, we should not overlook theinternal factors particularly causing severe air pollution in several districts with congested trafficon land or in waters.

For more in-depth analysis, we compared Kwai Chung station with another station to studyhow it behaviors differently. In Figure 12(b), we chose the northwest of Tung Chung stationplot which also registers a high SO2 reading with sector selection and plotted the amount of SO2against day with color encoding Air Pollution Index (API). Kwai Chung data generally shows ahigher API value for higher recorded SO2 values than Tung Chung station. As we discussed in theprevious section, SO2 is not the main pollutant contributing to API under normal circumstances,so it again suggests that the local pollution resulted from heavy SO2 emission by vessels is in fact adominating factor in the Kwai Chung region. Prompt remedial action should therefore be initiatedbefore worse pain and greater costs are incurred than an early cure for environmental sores.

5.2.2 Using Parallel Coordinates

Figure 13 compares two stations, Kwai Chung and Tung Chung, using parallel coordinates. AtKwai Chung station, API is not strongly related to the wind direction whereas at Tung Chungstation clusters of red and blue lines, representing winds from the north and northwest, can be seenat the API axis. Moreover, yellow and green lines that denote southwesterly winds are mainlyconnected to the lower API value in Tung Chung. But for Kwai Chung, the color spreads diverselyand a noticeable number of yellowish lines marks the highest API, which agrees with what wediscovered from the polar system. Apart from that, while other dimensions give undifferentiablepatterns, these two stations experience fairly different distributions for O3. Winds originating from

17

Page 21: Final Year Thesis - HKUST

Figure 14: Visualizing time-series data. Each row comprises of polar plots for different station,namely Tung Chung, Yuen Long and Mong Kok, in different period of time from March 2004 toMarch 2007 in an interval of 6 months. Pixel color encodes the Air Pollution Index (API).

the east in cyan apparently yield a larger O3 record. Such instant visual aids are truly beneficial toexperts for advanced exploration.

5.3 Time-Series TrendOne of the outstanding features of weather data is its time-series nature. Weather essentially varieswith time in seasonal basis, and it may also signify certain trends over time when the global climateis changing in the long-run. Consequently our system supports querying with time range and isable to generate desired results based on user-defined time interval.

5.3.1 Using Polar System

In Figure 14, data of the three successive years for three different stations are visualized with timeinterval set to six months. The two time periods, March to August and September to February,display similar distributions over the past three years. The directions of the wind observed arealso opposite to each other, which is typical to subtropical regions like Hong Kong which havedistinguishable seasons. Furthermore, they also clearly show that the air quality is worse in thewinter than in the summer with a higher API value.

Despite the fact that we are experiencing worse air quality in the recent years, these time-seriessequences does not present any growing trend for the API value. It is possible that the overalldistribution remains rather constant and the variation that we are looking for is subtle and obscure.More thorough investigation is then conducted for spotting any minor abnormality, by showing thedetailed distribution on year basis with sector selection described in Figure 15 for Kwai Chung.The red pixels clearly outstand from the first pixel bar, suggesting that local pollution from SO2emission is significant in 2004. It is glad to observe a slight improvement in the following years,

18

Page 22: Final Year Thesis - HKUST

Figure 15: Time-series polar plots for Kwai Chung station focusing on the impact of local pollutionfrom the southwest direction. X-position, y-position and color of the sector encodes day, SO2 andAPI accordingly. Prominent red pixels are mainly seen in year 2004.

inferring that local pollution has become less dominating in the district.In addition to visualizing time-series data in different time frames, we can simultaneously

utilize the pixel sector to review static patterns in the detailed time domain as well. In Figure 16,we are interested in the API value in different time of a day at Mong Kok station when high RSPvalue is logged. The highest API is mainly present in the afternoon period, sharply represented bypixels in red. All these years yield similar API patterns, and the lowest API is found around Aprilto June.

5.3.2 Using Parallel Coordinates

To visualize time-series data, we do not simply render a sequence of parallel coordinates. Instead,we first apply the polar system for sampling and then add a time axis in the parallel coordinatesto show the time domain. As demonstrated in the previous examples, polar system provides theusers an intuitive interface to select the data that they are interested in for in-depth analysis. Al-though we supplied the pixel sector for viewing more attributes, it was not able to give a generaloverview on every dimension in the data as parallel coordinates does. But the major deficiency ofparallel coordinates is that clusters and overlapping become severe that it is difficult to spot anyinteresting pattern. By combining the two techniques, unnecessary data items are filtered, leavingthe important time-series data represented effectively in parallel coordinates for detailed studies,as illustrated in Figure 17(b).

Users first selected the items with high RSP values and one parallel coordinates on this subsetwas shown for each year. With the help of the weighted complete graph (Figure 17(a)) to arrangemore correlated oxygenic attributes closer together, one may quickly notice that polylines for year

19

Page 23: Final Year Thesis - HKUST

Figure 16: Air pollution in different times of a day per year for Mong Kok station; x-position ofsector mapped to day, y-position mapped to hour and color encoding API value. All 3 plots showa similar API trend over a year, but year 2005 generally has less severe air pollution. High APItended to appear in the afternoon and is mostly found in year 2006.

(a)

(b)

Figure 17: 3-year time-series data of Yuen Long district, data is constrained to a range of windspeed and direction by sector selection. (a) Weighted complete graph for each year, year 2006gives the simplest correlation profile (b) Parallel Coordinates with a time axis, color also encodingtime value for clarity.

20

Page 24: Final Year Thesis - HKUST

2006 plot is elegantly clustered together for most dimensions, except for time and temperature. Allthree of them yields similar figures. The temperature also varies dramatically, in contrasting to thatin 2004 and 2005. In the vivid display for 2004, unusual yellow lines are seen at high RSP andNO2 values, resulting in the largest API in this set of data. Such abnormalities are not found in theother result, which indicates during this particular time in 2004, some mysterious factors causedthese unexpected high NO2 records. For SO2, NO and NOX, they reveal a rather constant patternthroughout the three years, which can also be seen from the weighted complete graph that theseparameters remain highly correlated in all three years; while a decreasing trend of O3 is observedwhen strong winds are blowing from the north.

6 Conclusions and Future WorkIn this thesis we proposed a comprehensive system for weather data visualization. Various visual-ization techniques are integrated into the system and several novel techniques have been developed.Our system has been applied to analyze the air pollution problem in Hong Kong and some inter-esting patterns have been detected by domain scientists using our system.

In the future, we plan to continue our work with domain scientists and make our system avail-able to the public through the website of ENVF at HKUST. Meanwhile, we will incorporate newdatasets such as visibility and PM2.5 into the existing system for further exploration. The rela-tionship between these two parameters are known to be not straightforward, but more in-depthanalysis possibly via our system is demanded by domain scientists. Besides, data transformationis also another highly relevant additional features which allows users to focus on a particular setof data. Very often experts are only interested in the oxide content of the pollutant and they maycompute the sum of oxygenate substances with different weight to seek any revealing findings.

7 AcknowledgementsWe wish to thank Professor Alexis Lau and Dr. Zibin Yuan at the Environmental Facility Center ofthe Hong Kong University of Science and Technology for providing the data and valuable feedbackin this project.

Personally, I also wish to thank Professor Huamin Qu for his guidance and encouragement; MrAnbang Xu for his novel ideas on deploying weighted complete graph into our system and othernotable contributions during paper submission period; and Mr Peter Kai-Lun Chung for his helpin system implementation. I am truly grateful that I could be engaged into real research projectsand work on a research paper during my final year undergraduate studies. Finally, I wish to thankeveryone in our research group for their support and tolerance.

References[1] Hong Kong’s air pollution hits the hightest in seven years. BBC Chinese, September 2002.

[2] Hong Kong’s bad air days. Cover Story, Time Asia, December 13 2004.

[3] Let there be light. Time Asia, May 8 2006.

[4] HK’s air pollution drags it down in eyes of expats, survey shows. South China Morning Post, March 15 2007.

21

Page 25: Final Year Thesis - HKUST

[5] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(4):446–449, 1986.

[6] M. de Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A survey. IEEE Transac-tions on Visualization and Computer Graphics, 19(3):378 – 394, 2003.

[7] E. Dengler and W. Cowan. Human perception of laid-out graphs. In Proceedings of the International Symposiumon Graph Drawing, pages 441 – 443, 1998.

[8] C. G. Healey, L. Tateosian, J. T. Enns, and M. Remple. Perceptually based brush strokes for nonphotorealisticvisualization. ACM Trans. Graph., 23(1):64–96, 2004.

[9] A. Inselberg. Multidimensional detective. In Proceedings of the IEEE Symposium on Information Visualization,pages 100–107, 1997.

[10] A. Inselberg and B. Dimsdale. Parallel coordinates: A tool for visualizing multidimensional geometry. InProceedings of the IEEE Symposium on Visualization, pages 361–378, 1990.

[11] D. A. Keim. Information visualization and visual data mining. IEEE Transactions on Visualization and ComputerGraphics, 8(1):1 – 8, 2002.

[12] D A. Keim, M. C. Hao, and U. Dayal. Hierarchical pixel bar charts. IEEE Trans. Vis. Comput. Graph., 8(3):255–269, 2002.

[13] K. Liao. A visualization system for space-time and multivariate patterns (vis-stamp). IEEE Transactions onVisualization and Computer Graphics, 12(6):1461–1474, 2006. Member-D. Guo and Student Member-J. Chenand Member-A. M. MacEachren.

[14] A. Luo, D. Kao, and A. Pang. Visualizing spatial distribution data sets. In VISSYM ’03: Proceedings of the sym-posium on Data visualisation 2003, pages 29–38, Aire-la-Ville, Switzerland, Switzerland, 2003. EurographicsAssociation.

[15] H. Theisel N. Sauber and H.-P. Seidel. Multifield-graphs: An approach to visualizing correlations in multifieldscalar data. IEEE Transactions on Visualization and Computer Graphics, 12(5):917 – 924, 2006.

[16] A. Noack. Energy-based clustering of graphs with nonuniform degrees. In Proceedings of the InternationalSymposium on Graph Drawing, pages 309–320, 2005.

[17] A. J. Quigley and P. Eades. Fade: Graph drawing, clustering, and visual abstraction. In Proceedings of theInternational Symposium on Graph Drawing, pages 197– 210, 2000.

[18] H. Siirtola. Direct manipulation of parallel coordinates. In Proceedings of IEEE Visualization, pages 373 – 378,2000.

[19] Y. Tang, H. Qu, Y. Wu, and H. Zhou. Natural textures for weather data visualization. pages 741–750, 2006.

[20] L. A. Treinish. Task-specific visualization design: a case study in operational weather forecasting. pages 405–409, 1998.

[21] L. A. Treinish. Multi-resolution visualization techniques for nested weather models. pages 513–516, 2000.

[22] L. A. Treinish. Visual data fusion for applications of high-resolution numerical weather prediction. pages 477–480, 2000.

[23] D. Tunkelang. A Numerical Optimization Approach to General Graph Drawing. PhD thesis, School of ComputerScience, Carnegie Mellon University, 1999.

[24] L. Wilkinson and A. Anand. High-dimensional visual analytics: Interactive exploration guided by pairwiseviews of point distributions. IEEE Transactions on Visualization and Computer Graphics, 12(6):1363–1372,2006. Member-Robert Grossman.

[25] J. Yang, W. Peng, M. O. Ward, and E. A. Rundensteiner. Interactive hierarchical dimension ordering, spacingand filtering for exploration of high dimensional datasets. In Proceedings of the IEEE Symposium on InformationVisualization, pages 105 – 112, 2003.

[26] P. Yip and Y. Pao. Combinatorial optimization with use of guided evolutionary simulated annealing. IEEETransactions on Neural Networks, 6(2):290 – 295, 1995.

22