Pengguna:KhalilullahAlFaath/Jaringan saraf konvolusional: Perbedaan antara revisi
Konten dihapus Konten ditambahkan
Penambahan Konten Tag: VisualEditor Suntingan perangkat seluler Suntingan peramban seluler |
|||
(2 revisi perantara oleh pengguna yang sama tidak ditampilkan) | |||
Baris 31:
{{TOC limit|3}}
== Arsitektur ==
[[File:Comparison image neural networks.svg|thumb|480px|Perbandingan lapisan konvolusi (convolution), pengumpul (pooling), rapat (dense) dari [[LeNet]] dan [[AlexNet]]<br>(Ukuran citra masukan AlexNet seharusnya 227×227×3, bukan 224×224×3 agar perhitungannya benar. Publikasi aslinya menyebutkan angka yang berbeda, tetapi Andrej Karpathy, kepala visi komputer Tesla mengatakan bahwa seharusnya ukuran citra masukannya adalah 227×227×3 (dia mengatakan bahwa Alex tidak menjelaskan mengapa dia menggunakan 224×224×3). Konvolusi berikutnya seharusnya 11×11 dengan langkah (''stride'') 4: 55×55×96 (bukan 54×54×96). Sehingga jika dihitung sebagai contoh: [(lebar input 227 - lebar kernel 11) / ''stride'' 4] + 1 = [(227 - 11) / 4] + 1 = 55. Karena luaran kernel memiliki panjang yang sama dengan lebar, maka luasnya adalah 55×55).]]
{{Main|Lapisan (pemelajaran dalam)}}
CNN terdiri atas satu lapisan masukan (''input layer''), [[jaringan saraf tiruan#organisasi|lapisan-lapisan tersembunyi]], dan satu lapisan luaran (''output layer''). Lapisan-lapisan tersembunyi tersebut di dalamnya termasuk satu atau lebih lapisan yang dapat mengkonvolusi. Lapisan ini biasanya menghitung [[Produk dot|perkalian titik]] kernel konvolusi dengan matriks lapisan masukan. Lapisan ini melakukan perkalian titik umumnya dengan [[produk titik frobenius|produk titik Frobenius]] dan [[rectifier (jaringan saraf)|ReLU]] sebagai fungsi aktivasinya. Proses konvolusi dilakukan dengan pergeseran kernel konvolusi pada matriks masukan pada lapisan tersebut, lalu menghasilkan peta fitur (''feature maps'') yang digunakan sebagai masukan untuk lapisan selanjutnya. Lapisan konvolusi ini diikuti lapisan-lapisan lainnya, seperti lapisan pengumpul (''pooling layer''), lapisan terhubung sepenuhnya ''(fully-connected layer''), dan lapisan normalisasi (''normalization layer''). Di sini dapat dilihat kemiripan antara CNN dengan [[matched filter]].<ref>Convolutional Neural Networks Demystified: A Matched Filtering Perspective Based Tutorial https://arxiv.org/abs/2108.11663v3</ref>
=== Lapisan konvolusi ===
Masukan pada CNN berupa [[Tensor (penelajaran mesin)|tensor]] dengan bentuk:
(Jumlah masukan) × (tinggi masukan) × (lebar masukan) × (masukan [[saluran (citra digital)|saluran]])
Setelah melewati sebuah lapisan konvolusi, citra tersebut diabstraksi menjadi sebuah peta fitur, disebut juga sebagai peta aktivasi (''activation map''), dengan bentuk:
(Jumlah masukan) × (tinggi peta fitur) × (lebar peta fitur) × (peta fitur [[saluran (citra digital)|saluran]]).
Lapisan konvolusi mengkonvolusi masukan dan melemparkan hasilnya kepada lapisan selanjutnya. Proses ini mirip dengan respons neuron dalam korteks visual terhadap rangsangan tertentu.<ref name="deeplearning">{{cite web |title=Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation |url=http://deeplearning.net/tutorial/lenet.html |work=DeepLearning 0.1 |publisher=LISA Lab |access-date=31 August 2013 |archive-date=28 December 2017 |archive-url=https://web.archive.org/web/20171228091645/http://deeplearning.net/tutorial/lenet.html |url-status=dead }}</ref> Setiap neuron konvolusi memproses data hanya untuk [[bidang reseptif|bidang reseptifnya]].
[[File:1D Convolutional Neural Network feed forward example.png|thumb|'''1D Convolutional Neural Network feed forward example''']]
Meskipun [[perseptron multi-lapisan|jaringan umpan-maju yang terhubung sepenuhnya]] dapat digunakan untuk mempelajari fitur-fitur dan mengklasifikasi data, arsitektur ini umumnya tidak praktis untuk masukan yang lebih besar, contohnya citra beresolusi tinggi yang membutuhkan neuron dalam jumlah besar; setiap pikselnya merupakan satu fitur masukan (''input feature''). Sebuah lapisan terhubung sepenuhnya untuk satu citra dengan ukuran 100 × 100 memiliki 10.000 bobot ''untuk'' setiap neuron di lapisan kedua. Proses konvolusi dapat mengurangi jumlah parameter bebas sehingga jaringan dapat menjadi lebih dalam.<ref name="auto1" /> Sebagai contoh, dengan menggunakan sebuah 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during [[backpropagation]] in earlier neural networks.<ref name="auto3" /><ref name="auto2" />
To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers,<ref>{{Cite arXiv |last=Chollet |first=François |date=2017-04-04 |title=Xception: Deep Learning with Depthwise Separable Convolutions |class=cs.CV |eprint=1610.02357 }}</ref> which are based on a depthwise convolution followed by a pointwise convolution. The ''depthwise convolution'' is a spatial convolution applied independently over each channel of the input tensor, while the ''pointwise convolution'' is a standard convolution restricted to the use of <math>1\times1</math> kernels.
=== Pooling layers ===
Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map.<ref name="flexible"/><ref>{{cite web |last=[[Alex Krizhevsky|Krizhevsky]] |first=Alex |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://image-net.org/static_files/files/supervision.pdf |access-date=17 November 2013 |archive-date=25 April 2021 |archive-url=https://web.archive.org/web/20210425025127/http://www.image-net.org/static_files/files/supervision.pdf |url-status=live }}</ref> There are two common types of pooling in popular use: max and average. ''Max pooling'' uses the maximum value of each local cluster of neurons in the feature map,<ref name=Yamaguchi111990>{{cite conference |title=A Neural Network for Speaker-Independent Isolated Word Recognition |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |location=Kobe, Japan |conference=First International Conference on Spoken Language Processing (ICSLP 90) |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |access-date=2019-09-04 |archive-date=2021-03-07 |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |url-status=dead }}</ref><ref name="mcdns">{{cite book |last1=Ciresan |first1=Dan |first2=Ueli |last2=Meier |first3=Jürgen |last3=Schmidhuber |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |chapter=Multi-column deep neural networks for image classification |date=June 2012 |pages=3642–3649 |doi=10.1109/CVPR.2012.6248110 |arxiv=1202.2745 |isbn=978-1-4673-1226-4 |oclc=812295155 |publisher=[[Institute of Electrical and Electronics Engineers]] (IEEE) |location=New York, NY |citeseerx=10.1.1.300.3283 |s2cid=2161592}}</ref> while ''average pooling'' takes the average value.
=== Fully connected layers ===
Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional [[multilayer perceptron]] neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.
=== Receptive field ===
In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's ''receptive field''. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the ''entire previous layer''. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes the value of a pixel into account, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.
To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution<ref>{{Cite arXiv|last1=Yu |first1=Fisher |last2=Koltun |first2=Vladlen |date=2016-04-30 |title=Multi-Scale Context Aggregation by Dilated Convolutions |class=cs.CV |eprint=1511.07122 }}</ref><ref>{{Cite arXiv|last1=Chen |first1=Liang-Chieh |last2=Papandreou |first2=George |last3=Schroff |first3=Florian |last4=Adam |first4=Hartwig |date=2017-12-05 |title=Rethinking Atrous Convolution for Semantic Image Segmentation |class=cs.CV |eprint=1706.05587 }}</ref> expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios,<ref>{{Cite arXiv|last1=Duta |first1=Ionut Cosmin |last2=Georgescu |first2=Mariana Iuliana |last3=Ionescu |first3=Radu Tudor |date=2021-08-16 |title=Contextual Convolutional Neural Networks |class=cs.CV |eprint=2108.07387 }}</ref> thus having a variable receptive field size.
=== Weights ===
Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.
The vectors of weights and biases are called ''filters'' and represent particular [[feature (machine learning)|feature]]s of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the [[memory footprint]] because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.<ref name="LeCun">{{cite web |url=http://yann.lecun.com/exdb/lenet/ |title=LeNet-5, convolutional neural networks |last=LeCun |first=Yann |access-date=16 November 2013 |archive-date=24 February 2021 |archive-url=https://web.archive.org/web/20210224225707/http://yann.lecun.com/exdb/lenet/ |url-status=live }}</ref>
|