synesthesia information collection method through auditory

2
Synesthesia Information Collection Method through Auditory Conversion of Visual Information Gyumin Cho† School of EECS, Gwangju Institute of Science and Technology Gwangju, Republic of Korea [email protected] Younkyung Jwa† School of EECS, Gwangju Institute of Science and Technology Gwangju, Republic of Korea [email protected] Chang Wook Ahn‡ AI Gradueate School, Gwangju Institute of Science and Technology Gwangju, Republic of Korea [email protected] ABSTRACT The human eardrum collects only sound waves as sensory informa- tion. If there is any loss or distortion in the sound wave information, it will not correctly receive the data. To overcome this limitation, we propose the "method of collecting synesthetic information." This method creates sound using data whose initial information is visual images. This visual image can be created by recording a vibrating object that is affected by sound waves. We probe the possibility of applying the inpainting technique using a deep convolution net- work to sound restoration. Our method shows a similarity of 80% between the original sound and the restored sound. CCS CONCEPTS Human-centered computing Accessibility theory, concepts and paradigms; Computing methodologies Artificial intel- ligence. KEYWORDS synesthetic, vibration, waveformn ACM Reference Format: Gyumin Cho, Younkyung Jwa, and Chang Wook Ahn. 2020. Synesthesia Information Collection Method through Auditory Conversion of Visual Information. In SIG Proceedings Paper in LaTeX Format. ACM, New York, NY, USA, 2 pages. 1 INTRODUCTION People hear sound through the process of collecting auditory infor- mation by the eardrum, one of the sense organs. Similarly, to see an object, one of the other sensory organs, the eye, receives light and collects visual information. If the human eardrum and eyes are damaged, the information collection function that the sensory or- gan is responsible for is deteriorated, causing great inconvenience to life. To diminish this discomfort, people with damaged eardrums use hearing amplifiers or hearing aids, and people with damaged optic nerves use glasses. However, when the sensory organs are not working, other methods will be needed. We present a way to transform the initial information for the senses. Our study was conducted using images, not sound waves, as initial information for hearing sounds. Then, even in situations Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). SMA 2020, September 17-19, 2020, Jeju, Republic of Korea © 2020 Copyright held by the owner/author(s). Figure 1: Example of inpainting (Left)input image and (Right)restored image where sound waves are not present or damaged, we can hear sound using only visual data. We decided to call this "the method of col- lecting synesthetic information." The method will help to sense using other information even when there is no original initial in- formation. Based on the characteristics of the sound wave, we selected the video as an input of the method. The vibration generated from sound waves affects surrounding objects, causing the objects to vibrate subtly. Using this feature, we captured the vibration of an object as a video and obtained input sources. We aimed to create a waveform close to the quality of ordinary music from visual information. The amount of data determines the quality over time. In general, the amount of data between video and audio shows much difference. To solve this problem, we used the inpainting of deep image prior(DIP).[3] Figure 1. shows the example of the inpainting. As a result, it was possible to create a waveform that looked almost like the original. 2 METHOD We divide the method into two steps. First, we measure the degree to which an object in the video moves per unit of frames. Subsequently, using inpainting reduces the difference in the amount of data per time between video and audio. Then, our experiment shows the result as a waveform that turns the video into audio. In the first step, edge detection is required to measure the de- gree of movement of an object. We pick a Canny edge as an edge detection tool since it shows high performance in extracting a gray object’s boundary.[1, 2] By calculating the movement’s value from the initial position using the edge, the displacement is obtained. Among audio data, low-quality audio has a sampling rate(sample per second) of 22050hz, and video with high FPS(frame per second) has about 1200 FPS. Eventually, when we convert video into audio data, the sound is produced with a lower quality than the actual audio. We chose inpainting technology as a solution. Inpainting

Upload: others

Post on 07-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synesthesia Information Collection Method through Auditory

Synesthesia Information Collection Method through AuditoryConversion of Visual Information∗

Gyumin Cho†School of EECS, Gwangju Institute of

Science and TechnologyGwangju, Republic of Korea

[email protected]

Younkyung Jwa†School of EECS, Gwangju Institute of

Science and TechnologyGwangju, Republic of [email protected]

Chang Wook Ahn‡AI Gradueate School, Gwangju

Institute of Science and TechnologyGwangju, Republic of Korea

[email protected]

ABSTRACTThe human eardrum collects only sound waves as sensory informa-tion. If there is any loss or distortion in the sound wave information,it will not correctly receive the data. To overcome this limitation,we propose the "method of collecting synesthetic information." Thismethod creates sound using data whose initial information is visualimages. This visual image can be created by recording a vibratingobject that is affected by sound waves. We probe the possibility ofapplying the inpainting technique using a deep convolution net-work to sound restoration. Our method shows a similarity of 80%between the original sound and the restored sound.

CCS CONCEPTS•Human-centered computing→ Accessibility theory, conceptsand paradigms; • Computing methodologies → Artificial intel-ligence.

KEYWORDSsynesthetic, vibration, waveformnACM Reference Format:Gyumin Cho, Younkyung Jwa, and Chang Wook Ahn. 2020. SynesthesiaInformation Collection Method through Auditory Conversion of VisualInformation. In SIG Proceedings Paper in LaTeX Format. ACM, New York,NY, USA, 2 pages.

1 INTRODUCTIONPeople hear sound through the process of collecting auditory infor-mation by the eardrum, one of the sense organs. Similarly, to seean object, one of the other sensory organs, the eye, receives lightand collects visual information. If the human eardrum and eyes aredamaged, the information collection function that the sensory or-gan is responsible for is deteriorated, causing great inconvenienceto life. To diminish this discomfort, people with damaged eardrumsuse hearing amplifiers or hearing aids, and people with damagedoptic nerves use glasses. However, when the sensory organs arenot working, other methods will be needed.

We present a way to transform the initial information for thesenses. Our study was conducted using images, not sound waves,as initial information for hearing sounds. Then, even in situations

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).SMA 2020, September 17-19, 2020, Jeju, Republic of Korea© 2020 Copyright held by the owner/author(s).

Figure 1: Example of inpainting(Left)input image and (Right)restored image

where sound waves are not present or damaged, we can hear soundusing only visual data. We decided to call this "the method of col-lecting synesthetic information." The method will help to senseusing other information even when there is no original initial in-formation.

Based on the characteristics of the sound wave, we selected thevideo as an input of the method. The vibration generated fromsound waves affects surrounding objects, causing the objects tovibrate subtly. Using this feature, we captured the vibration of anobject as a video and obtained input sources.

We aimed to create a waveform close to the quality of ordinarymusic from visual information. The amount of data determines thequality over time. In general, the amount of data between videoand audio shows much difference. To solve this problem, we usedthe inpainting of deep image prior(DIP).[3] Figure 1. shows theexample of the inpainting. As a result, it was possible to create awaveform that looked almost like the original.

2 METHODWe divide themethod into two steps. First, wemeasure the degree towhich an object in the video moves per unit of frames. Subsequently,using inpainting reduces the difference in the amount of data pertime between video and audio. Then, our experiment shows theresult as a waveform that turns the video into audio.

In the first step, edge detection is required to measure the de-gree of movement of an object. We pick a Canny edge as an edgedetection tool since it shows high performance in extracting a grayobject’s boundary.[1, 2] By calculating the movement’s value fromthe initial position using the edge, the displacement is obtained.

Among audio data, low-quality audio has a sampling rate(sampleper second) of 22050hz, and video with high FPS(frame per second)has about 1200 FPS. Eventually, when we convert video into audiodata, the sound is produced with a lower quality than the actualaudio. We chose inpainting technology as a solution. Inpainting

Page 2: Synesthesia Information Collection Method through Auditory

SMA 2020, September 17-19, 2020, Jeju, Republic of Korea

Figure 2: Experiment process to restore the waveform of original sound

is one of an image restoration technology that fills an image withempty parts using several methods. DIP[3] proposes an inpaintingtechnique that can reconstruct an image without the original imageusing a deep convolution network. To apply this technique, wemade a masked image like Figure 2 (A) and put it in the DIP processas an input. We filled in the empty part of the waveform, and finally,made a reconstructed waveform.

3 EXPERIMENT AND RESULTSFigure 2 shows the appearance of the experiment process. As anobject for vibration, black pulp paper, is profoundly affected byvibration, was used. Besides, the background color was set to whiteto increase the difference in brightness, facilitating edge extraction.To make sound waves and vibrations of objects similar, we exper-imented in an environment where external factors such as windwere blocked to the maximum. The song for the experiment wasthe classic "The Carnival of the Animals" and played for about 6seconds. A 1200 FPS ultra-high-speed camera photographed thevideo.

Subsequently, the captured video was divided into frames at1/1200 second intervals.We extracted the boundary using the Cannyedge, acquired the displacement of the object over time, and madethe first-order waveform.

Next, we used inpainting to obtain a waveform closer to theoriginal sound. A waveform is masked at the location, which has novalue and used as an inpainting input. The second-order waveformwas created by filling the masked part after the inpainting process.

Figure 3 shows the original waveform and the final waveform. Tocompare these two images, we grayscaled the image and countedthe number of pixels with the same value. As a result, the matchrate between the waveform of the original sound and final restored

Figure 3: (Left)Waveform of original sound and(Right)waveform of final restored sound

sound reached 80%. We could hear sounds with similar rhythmsand pitches.

4 CONCLUSIONSWe presented "the method of collecting synesthetic information" tocreate sound from visual information. In the first-order waveform,the pitch changed rapidly due to the FPS limitation. However, thefinal waveform with high performance was obtained through theinpainting method. Also, deep convolution networks allowed us togenerate learning results without using the original waveform.

Our experiments’ results are significant in that they proposed anew form of sensory information collection. Initially, humans couldnot hear sound unless the initial information was sound waves.However, the experiments in this paper showed that auditory in-formation could be collected even when initial information is trans-mitted visually. Therefore, this paper’s research will be an essentialkey to solving the deficit or loss of the sense organs responsible forhearing or vision.

ACKNOWLEDGMENTSThis work was supported by the Korea Foundation for the Advance-ment of Science and Creativity grant funded by the Korea govern-ment and the National Research Foundation of Korea(NRF) grantfunded by the Korea government(MSIT) (No. NRF-2019R1I1A2A01057603).

REFERENCES[1] J Biemond, L Looijenga, D.E Boekee, and R.H.J.M Plompen. 1987. A pel-recursive

Wiener-based displacement estimation algorithm. Signal Processing 13, 4 (Dec.1987), 399–412. https://doi.org/10.1016/0165-1684(87)90021-1

[2] J. Canny. 1986. A Computational Approach to Edge Detection. IEEE Transactionson Pattern Analysis Machine Intelligence 8, 6 (Nov. 1986), 679–698. https://doi.org/10.1109/TPAMI.1986.4767851

[3] V. Lempitsky, A. Vedaldi, and D. Ulyanov. 2018. Deep Image Prior. In 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). IEEE ComputerSociety, Los Alamitos, CA, USA, 9446–9454. https://doi.org/10.1109/CVPR.2018.00984