Pengguna:KhalilullahAlFaath/Jaringan saraf konvolusional: Perbedaan antara revisi

Konten dihapus Konten ditambahkan
Tambah konten arsitektur
Tag: VisualEditor Suntingan perangkat seluler Suntingan peramban seluler
Baris 32:
{{TOC limit|3}}
 
== ArchitectureArsitektur ==
[[File:Comparison image neural networks.svg|thumb|480px|Perbandingan lapisan konvolusi (convolution), pengumpul (pooling), rapat (dense) dari [[LeNet]] dan [[AlexNet]]<br>(Ukuran citra masukan AlexNet seharusnya 227×227×3, bukan 224×224×3, agar perhitungannya benar. MakalahPublikasi aslinya menyebutkan angka yang berbeda, tetapi Andrej Karpathy, kepala visi komputer Tesla, mengatakan bahwa seharusnya ukuran citra masukannya adalah 227×227×3 (dia mengatakan bahwa Alex tidak menjelaskan mengapa dia menggunakan 224×224×3). Konvolusi berikutnya seharusnya 11×11 dengan langkah (''stride'') 4: 55×55×96 (bukan 54×54×96). Sehingga jika dihitung, misalnya,sebagai sebagaicontoh: [(lebar input 227 - lebar kernel 11) / ''stride'' 4] + 1 = [(227 - 11) / 4] + 1 = 55. Karena luaran kernel memiliki panjang yang sama dengan lebar, maka luasnya adalah 55×55).]]
{{Main|Lapisan (pemelajaran dalam)}}
CNN terdiri atas satu lapisan masukan (''input layer''), [[jaringan saraf tiruan#organisasi|lapisan-lapisan tersembunyi]], dan satu lapisan luaran (''output layer''). Lapisan-lapisan tersembunyi CNN di dalamnya termasuk satu atau lebih lapisan yang melakukan konvolusi. Lapisan ini biasanya menghitung [[Produk dot|perkalian titik]] kernel konvolusi dengan matriks masukan lapisan. Lapisan ini melakukan perkalian titik ini umumnya dengan [[produk titik frobenius|produk titik Frobenius]] dan menggunakan [[rectifier (jaringan saraf)|ReLU]] sebagai fungsi aktivasinya. Proses konvolusi dengan pergeseran kernel konvolusi pada matriks masukan pada layer tersebut menghasilkan peta fitur (''feature maps'') dan digunakan sebagai masukan untuk lapisan selanjutnya. Lapisan konvolusi ini dilanjutkan oleh lapisan-lapisan lainnya, seperti lapisan pengumpul (''pooling layer''), lapisan terhubung sepenuhnya ''(fully-connected layer''), dan lapisan normalisasi (''normalization layer''). Di sini perlu diperhatikan kemiripan antara CNN dengan [[matched filter]].<ref>Convolutional Neural Networks Demystified: A Matched Filtering Perspective Based Tutorial https://arxiv.org/abs/2108.11663v3</ref>
 
=== Lapisan konvolusi ===
Pada CNN, masukannya berupa [[Tensor (penelajaran mesin)|tensor]] dengan bentuk:
 
(Jumlah masukan) × (tinggi masukan) × (lebar masukan) × (masukan [[saluran (citra digital)|saluran]])
 
Setelah melewati sebuah lapisan konvolusi, citra tersebut diabstraksi menjadi sebuah peta fitur (''feature map''), disebut juga sebagai peta aktivasi (''activation map''), dengan bentuk:
 
(Jumlah masukan) × (tinggi peta fitur) × (lebar peta fitur) × (peta fitur [[saluran (citra digital)|saluran]]).
 
Lapisan konvolusi melakukan proses konvolusi pada masukan dan melemparkan hasilnya kepada lapisan selanjutnya. Proses ini mirip dengan respons sebuah neuron dalam korteks visual terhadap rangsangan tertentu.<ref name="deeplearning">{{cite web |title=Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation |url=http://deeplearning.net/tutorial/lenet.html |work=DeepLearning 0.1 |publisher=LISA Lab |access-date=31 August 2013 |archive-date=28 December 2017 |archive-url=https://web.archive.org/web/20171228091645/http://deeplearning.net/tutorial/lenet.html |url-status=dead }}</ref> Setiap neuron konvolusi memproses data hanya untuk [[bidang reseptif|bidang reseptifnya]]
[[File:1D Convolutional Neural Network feed forward example.png|thumb|'''1D Convolutional Neural Network feed forward example''']]
Although [[multilayer perceptron|fully connected feedforward neural networks]] can be used to learn features and classify data, this architecture is generally impractical for larger inputs (e.g., high-resolution images), which would require massive numbers of neurons because each pixel is a relevant input feature. A fully connected layer for an image of size 100 × 100 has 10,000 weights for ''each'' neuron in the second layer. Convolution reduces the number of free parameters, allowing the network to be deeper.<ref name="auto1" /> For example, using a 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during [[backpropagation]] in earlier neural networks.<ref name="auto3" /><ref name="auto2" />
 
To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers,<ref>{{Cite arXiv |last=Chollet |first=François |date=2017-04-04 |title=Xception: Deep Learning with Depthwise Separable Convolutions |class=cs.CV |eprint=1610.02357 }}</ref> which are based on a depthwise convolution followed by a pointwise convolution. The ''depthwise convolution'' is a spatial convolution applied independently over each channel of the input tensor, while the ''pointwise convolution'' is a standard convolution restricted to the use of <math>1\times1</math> kernels.
 
=== Pooling layers ===
Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map.<ref name="flexible"/><ref>{{cite web |last=[[Alex Krizhevsky|Krizhevsky]] |first=Alex |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://image-net.org/static_files/files/supervision.pdf |access-date=17 November 2013 |archive-date=25 April 2021 |archive-url=https://web.archive.org/web/20210425025127/http://www.image-net.org/static_files/files/supervision.pdf |url-status=live }}</ref> There are two common types of pooling in popular use: max and average. ''Max pooling'' uses the maximum value of each local cluster of neurons in the feature map,<ref name=Yamaguchi111990>{{cite conference |title=A Neural Network for Speaker-Independent Isolated Word Recognition |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |location=Kobe, Japan |conference=First International Conference on Spoken Language Processing (ICSLP 90) |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |access-date=2019-09-04 |archive-date=2021-03-07 |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |url-status=dead }}</ref><ref name="mcdns">{{cite book |last1=Ciresan |first1=Dan |first2=Ueli |last2=Meier |first3=Jürgen |last3=Schmidhuber |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |chapter=Multi-column deep neural networks for image classification |date=June 2012 |pages=3642–3649 |doi=10.1109/CVPR.2012.6248110 |arxiv=1202.2745 |isbn=978-1-4673-1226-4 |oclc=812295155 |publisher=[[Institute of Electrical and Electronics Engineers]] (IEEE) |location=New York, NY |citeseerx=10.1.1.300.3283 |s2cid=2161592}}</ref> while ''average pooling'' takes the average value.
 
=== Fully connected layers ===
 
Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional [[multilayer perceptron]] neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.
 
=== Receptive field ===
In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's ''receptive field''. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the ''entire previous layer''. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes the value of a pixel into account, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.
 
To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution<ref>{{Cite arXiv|last1=Yu |first1=Fisher |last2=Koltun |first2=Vladlen |date=2016-04-30 |title=Multi-Scale Context Aggregation by Dilated Convolutions |class=cs.CV |eprint=1511.07122 }}</ref><ref>{{Cite arXiv|last1=Chen |first1=Liang-Chieh |last2=Papandreou |first2=George |last3=Schroff |first3=Florian |last4=Adam |first4=Hartwig |date=2017-12-05 |title=Rethinking Atrous Convolution for Semantic Image Segmentation |class=cs.CV |eprint=1706.05587 }}</ref> expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios,<ref>{{Cite arXiv|last1=Duta |first1=Ionut Cosmin |last2=Georgescu |first2=Mariana Iuliana |last3=Ionescu |first3=Radu Tudor |date=2021-08-16 |title=Contextual Convolutional Neural Networks |class=cs.CV |eprint=2108.07387 }}</ref> thus having a variable receptive field size.
 
=== Weights ===
Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.
 
The vectors of weights and biases are called ''filters'' and represent particular [[feature (machine learning)|feature]]s of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the [[memory footprint]] because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.<ref name="LeCun">{{cite web |url=http://yann.lecun.com/exdb/lenet/ |title=LeNet-5, convolutional neural networks |last=LeCun |first=Yann |access-date=16 November 2013 |archive-date=24 February 2021 |archive-url=https://web.archive.org/web/20210224225707/http://yann.lecun.com/exdb/lenet/ |url-status=live }}</ref>