Wikipedia:Pengunduhan basis data: Perbedaan antara revisi

Jelajahi riwayat secara interaktif

← Revisi sebelumnya

Konten dihapus Konten ditambahkan

VisualTeks wiki

Revisi per 6 Maret 2010 14.51 sunting MastiBot (bicara \| kontrib) Bot 18.019 suntingan k bot Mengubah: it:Aiuto:Analisi del database, lt:Vikipedija:Atsisiuntimas ← Revisi sebelumnya		Revisi terkini sejak 6 Maret 2018 23.13 sunting balikkan Xqbot (bicara \| kontrib) Bot 196.421 suntingan k Bot: Memperbaiki pengalihan ganda ke Wikipedia:Unduh basis data Tag: Perubahan target pengalihan
(9 revisi perantara oleh 5 pengguna tidak ditampilkan)
Baris 1: #ALIH [[Wikipedia:Unduh basis data]] Wikipedia menyediakan seluruh isi basis data untuk pengguna yang berminat. Ini dapat digunakan untuk ''mirroring'', penggunaan pribadi, cadangan (''backup'') informal, atau kueri basis data (misalnya untuk [[Wikipedia:pemeliharaan\|pemeliharaan]]. Semua isi teks dilisensikan di bawah [[Lisensi Dokumentasi Bebas GNU]] (GDFL). Gambar dan berkas lainnya tersedia dengan lisensi lain, seperti yang dijelaskan pada masing-masing halaman deskripsi berkasnya. Untuk petunjuk untuk menyesuaikan dengan lisensi mereka, lihat [[Wikipedia:Hak cipta]]. ~~== Di mana mendapatkannya ==~~ * Untuk semua proyek Yayasan Wikimedia: http://download.wikimedia.org/ * Wikipedia bahasa Indonesia: http://download.wikimedia.org/idwiki/ ~~<!--==Where do I get...==~~ * Dumps from any Wikimedia Foundation project: http://download.wikimedia.org/ * English Wikipedia dumps in SQL and XML: http://download.wikimedia.org/enwiki/ '''pages_articles.xml.bz2 - Current revisions only, no talk or user pages (this is the one you probably want)''' pages_current.xml.bz2 - Current revisions only, all pages pages_full.xml.bz2/7z - Current revisions, all pages (includes talk and user pages) pages-meta-history.xml.bz2 - All revisions, all pages abstract.xml.gz - page abstracts all_titles_in_ns0.gz - Article titles only SQL files for the pages and links are also available '''Caution:''' Some dumps may be incomplete - pay attention to such warnings (e.g. "Dump complete, 1 item failed") near the dump file. * Wiki front-end software: [[Wikipedia:MediaWiki]]. * Database backend software: You want to download [[MySQL]]. * Image dumps: See below. In the http://download.wikimedia.org/ directory you will find the latest SQL dumps for the projects, not just English. For example, (others exist, just select the appropriate two letter language code and the appropriate project): * English Wikipedia dumps: http://download.wikimedia.org/enwiki/ * French Wikipedia dumps: http://download.wikimedia.org/frwiki/ * German Wikipedia dumps: http://download.wikimedia.org/dewiki/ ~~Some other directories (e.g. simple, nostalgia) exist, with the same structure.~~ ~~==Images and uploaded files==~~ Unlike the article text, many images are not released under GFDL or the public domain. These images are owned by external parties who may not have consented to their use in Wikipedia. Wikipedia uses such images under the doctrine of [[fair use]] under United States law. Use of such images outside the context of '''Wikipedia''' or similar works may be illegal. Also, many images legally require a credit or other attached copyright information, and this copyright information is contained within the text dumps available from [http://download.wikimedia.org/ download.wikimedia.org]. Some images may be restricted to non-commercial use, or may even be licensed exclusively to Wikipedia. Hence, download these images at your own risk. [http://download.wikimedia.org/legal.html Legal] ~~As of November 2006, the image dump for the English Wikipedia was about 76GB.~~ ~~If you still want to download images, you can get them at:~~ * All projects: http://download.wikimedia.org/images/ * English Wikipedia: http://download.wikimedia.org/images/wikipedia/en/ ~~==Dealing with large files==~~ You may run into problems downloading files of unusual size. Some older operating systems, file systems, and web clients have a hard limit of 2GB on file size. If you seem to be hitting this limit, try using [[wget]] version 1.10 or greater, [[cURL]] version 7.11.1-1 or greater, or a recent version of [[lynx (web browser)\|lynx]] (using -dump). Users have experienced problems with [[Mozilla Application Suite\|Mozilla]], and [[Mozilla Firefox\|Firefox]], but recent versions are more likely to be fixed. It is recommended that you check the [[MD5]] sums (provided in a file in the download directory, to make sure your download was complete and accurate. You can check this by running the "md5sum" command on the files you downloaded. Given how large the files are, this may take some time to calculate. Due to the technical details of how files are stored, ''file sizes'' may be reported differently on different filesystems, and so are not necessarily reliable. Also, you may have experienced corruption during the download, though this is unlikely. ~~The [[Comparison_of_file_systems#Limits\|file size limits]] for the various file systems are as follows:~~ * [[File Allocation Table\|FAT16]] (MS-DOS version 6, Windows 3.1, and earlier) supports files up to 2GB. * [[File Allocation Table\|FAT32/VFAT]] (Windows 95, 98, 98SE, and ME) supports files up to 4GB. * An old FAT16 filesystem under Windows ME or Windows NT can support up to 4GB. * [[NTFS]] (Windows NT 3.51+, 2000, XP, and Server 2003) supports up to 16 [[exabyte]]s. If you are running a Linux kernel version 2.4 or greater, ext2 and ext3 filesystems can handle 16GB files and larger, depending on your block size. See http://www.suse.de/~aj/linux_lfs.html for more information. ~~==Why not just retrieve data from wiki-indonesia.club at runtime?==~~ Suppose you are building a piece of software that at certain points displays information that came from wikipedia. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished HTML. Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible. The wiki-indonesia.club servers need to do quite a bit of work to convert the wikicode into html. That's time consuming both for you and for the wiki-indonesia.club servers, so simply spidering all pages is not the way to go. ~~To access any article in XML, one at a time, access:~~ ~~http://en.wiki-indonesia.club/wiki/Special:Export/Title_of_the_article~~ ~~Read more about this at [[Special:Export]].~~ ~~To access any article via an RSS Feed, one at a time, access:~~ ~~http://www.blinkbits.com/en_wikifeeds_rss/Title_of_the_article~~ ~~Read more about this at [[User:Blinklmc]].~~ ~~Please be aware that live mirrors of Wikipedia that are dynamically loaded from the Wikimedia servers are prohibited. Please see [[Wikipedia:Mirrors and forks]].~~ ~~==Please do not use a web crawler==~~ Please do not use a [[web crawler]] to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our [http://en.wiki-indonesia.club/robots.txt robots.txt] restricts bots to one page per second and blocks many ill-behaved bots. ~~===Sample blocked crawler email===~~ :IP address ''nnn.nnn.nnn.nnn'' was retrieving up to 50 pages per second from wiki-indonesia.club addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at '''http://stats.wikimedia.org/EN/ChartsWikipediaZZ.htm'''. It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range. :If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.wiki-indonesia.club/wiki/Wikipedia:Database_download.</i> ~~:Instead of an email reply you may prefer to visit #mediawiki at irc.freenode.net to discuss your options with our team.</i>~~ ~~=== Doing SQL queries on the current database dump ===~~ ~~[ Correction Needed? The wikisign page is not active at this time (11/09/2006). ]~~ You can do SQL queries on the current database dump (as a replacement for the disabled [[Special:Asksql]] page) at [http://www.wikisign.org wikisign.org]. For more information about this service, see [[:de:Benutzer:Filzstift/wikisign.org]] (in German only). ~~==Dealing with compressed files==~~ ~~Approximate file sizes are given for the compressed dumps; uncompressed they'll be significantly larger.~~ ~~Some older archives are compressed with gzip, which is compatible with PKZIP (the most common Windows format). Newer archives are available in both [[bzip2]] and [[7zip]] compressed formats.~~ Windows users may not have a bzip2 decompressor on hand; a [ftp://sources.redhat.com/pub/bzip2/v102/bzip2-102-x86-win32.exe command-line Windows version of bzip2] (from [http://sources.redhat.com/bzip2/ here]) is available for free under a BSD license. ~~The [[GNU Lesser General Public License\|LGPL]]'d GUI file archiver, [[7-zip]] [http://www.7-zip.org/], is also able to open bz2 compressed files, and is available for free.~~ ~~MacOS X ships with the command-line bzip2 tool.~~ ~~Please note that older versions of bzip2 may not be able to handle files larger than 2GB, so make sure you have the latest version if you experience any problems.~~ ~~==Database schema==~~ ~~===SQL schema===~~ ~~''See also: [[mw:Manual:Database layout]]''~~ The database schema is explained [http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup here]. The ''cur'' tables contain the current revisions of all pages; the ''old'' tables contain the prior edit history. ~~===XML schema===~~ ~~The XML schema for each dump is defined at the top of the file.~~ ~~==Help parsing dumps for use in scripts==~~ * [[Wikipedia:Computer help desk/ParseMediaWikiDump]] describes the [[Perl]] Parse::MediaWikiDump library, which can parse XML dumps. ~~==Help importing dumps into MySQL==~~ ~~See:~~ * [[m:Importing a Wikipedia database dump into MediaWiki]] (SQL)'' * [[Wikipedia:Database dump import problems]]. ~~=== Importing sections of a dump ===~~ ~~'''''This section is out of date.'''''~~ ~~The following [[Perl]] script is a parser for extracting the Help sections from the SQL dump:~~ ~~s/^INSERT INTO cur VALUES //gi;~~ ~~s/\n// if (($j++ % 2) == 0);~~ ~~s/(\'\d+\',\'\d+\'\)),(\(\d+,\d+,)/$1\;\n$2/gs;~~ ~~foreach (split /\n/) {~~ ~~next unless (/^\(\d+,12,\'/);~~ ~~s/^\(\d+,\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user,~~ ~~cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit,~~ ~~cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(12,/;~~ ~~s/\n\s+//g;~~ ~~s/$/\n/;~~ ~~print;~~ } NOTE: (as at 2005-05-16) the order of the fields in the cur table has changed. inverse_timestamp now comes BEFORE cur_touched. This may cause Windows users no end of grief because all of a sudden your MediaWiki starts sprouting PHP errors about dates that are negative or occur before 1 January 1970 being passed to gmdate and gmmktime functions in GlobalFunctions.php. The reason is that the fields are swapped around and so there is rubbish data in these two fields. Maybe the Unix versions of these functions are smarter or do not cause PHP to spit a Warning message into the HTML script output, or else people have php.ini configured to not display these. ~~In other words, check that the field order in the script aligns with those in the dump. Better still, we should look at changing the script to retain whatever field order the dump uses 8-)~~ ~~You can run the script and get a resulting help.sql file with this command:~~ ~~bzip2 -dc <Date>_cur_table.sql.bz2 \| perl -n <Script Name> > help.sql~~ The script can be easily modified to acquire any section you need with a few minor changes. Currently, it is set to get all records from [[Wikipedia:Namespace\|namespace]] 12, the Help namespace. You can change the two 12's to grab a different namespace, or slightly change a couple of [[regular expressions]] to get, say, all articles that begin with Q: ~~next unless (/^\(\d+,\d+,\'[qQ]/);~~ ~~s/^\(\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user,~~ ~~cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit,~~ ~~cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(/;~~ ~~Or you can use more more generic version of this script from [[User:Msm/extract.pl]].~~ NOTE: While this sounds really straightforward as a way to grab the Help namespace (#12) for use on your newly implemented MediaWiki site, you need more than just that. You also need the Template namespace (# 10) since many of the Help: pages rely on templates in some form or another. Of course, you then end up with hundreds of templates that are NOT used by the Help: pages too. Has anyone got a better idea for a script to do this ? [[User:Armistej \| Armistej]] ~~== Static HTML tree dumps for mirroring or CD distribution ==~~ MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation. ~~The static version of Wikipedia created by Wikimedia~~ http://static.wiki-indonesia.club/ [http://www.hut.fi/~tkarvine/tero-dump/ Terodump] is an alpha quality wikipedia to static html dumper, made from wikipedia code. Static html dump (beta quality) [ftp://ftp.funet.fi/pub/doc/wikipedia/wikipedia-terodump-0.1.tar.bz wikipedia-terodump-0.1.tar.bz]. This dump is made of a database 2003. - [[User:Tero]] [http://www.tommasoconforti.com/ Wiki2static] (site down as of October 2005) is an experimental program set up by [[User:Alfio]] to generate html dumps, inclusive of images, search function and alphabetical index. At the linked site experimental dumps and the script itself can be downloaded. As an example it was used to generate these copies of [http://fixedreference.org/en/20040424/wikipedia/Main_Page English WikiPedia 24 April 04] [http://fixedreference.org/simple/20040501/wikipedia/Main_Page Simple WikiPedia 1 May 04](old database) format and [http://july.fixedreference.org/en/20040724/wikipedia/Main_Page English WikiPedia 24 July 04][http://july.fixedreference.org/simple/20040724/wikipedia/Main_Page Simple WikiPedia 24 July 04] [http://july.fixedreference.org/fr/20040727/wikipedia/Accueil WikiPedia Francais 27 Juillet 2004] (new format). [[User:BozMo\|BozMo]] uses a version to generate periodic static copies at [http://fixedreference.org/ fixed reference]. If you want to draft a traditional website in Mediawiki and dump it to HTML format, you might want to try [http://barnesc.blogspot.com/2005/10/mw2html-export-mediawiki-to-static.html mw2html] by [[User:Connelly]]. ~~If you'd like to help develop dump-to-static HTML tools,~~ ~~please drop us a note on [[Wikipedia:Mailing lists\|the developers' mailing list]].~~ ~~'''See also:'''~~ [[Meta:Alternative parsers]] which lists some other options for getting static HTML dumps * [[Wikipedia:Snapshots]] * [[Wikipedia:TomeRaider database]] ~~==Dynamic HTML generation from a local XML data-base file==~~ Instead of converting a data-base file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file upon request from the browser. ~~[http://wikifilter.sourceforge.net/ WikiFilter] is such a program, and it allows you to browse over 100 dump files without visiting a Wiki site.~~ ~~== Rsync ==~~ ~~'''''This section is out of date.'''''~~ ~~You can use [[rsync]] to download the database. For example, this command will download the current English database:~~ ~~rsync rsync://download.wikimedia.org/dumps/wikipedia/en/cur_table.sql.bz2 . --partial --progress~~ The "--partial" switch prevents rsync from deleting the file in the event the download is interrupted. You may then issue the very same command again to resume the download. The "--progress" switch will show the download progress; for less verbose output, do not use this switch. The rsync utility is designed to synchronize files in a manner such that only the differences between the files are transferred. This provides a considerable performance enhancement, especially when synchronizing large files that have relatively few changes. However, if a file is compressed or encrypted, rsync will not perform well; in fact, it may perform worse than downloading a fresh copy of the file. Many of the database files are only available compressed. Therefore, there is little, if anything, to be gained by attempting to use rsync as a means of expediting an update of an older SQL dump. If the SQL dumps were available uncompressed, this process should work extremely well, especially if rsync is invoked with the on-the-fly compression switch (-z). It is uncertain as to whether uncompressed database dumps will become available. However, rsync does remain a useful and expedient tool for resuming downloads that have been interrupted, repairing downloads that have become corrupted, or updating any files that are not compressed (i.e. upload.tar). For more information, see [[rsync]]. ~~=== Technical notes ===~~ * There is some discussion about a modified [[gzip]] that can improve rsync performance. This patch to gzip resets the output stream at fixed intervals. This results in fixed-size blocks of compressed data, which are friendlier to rsync. [[Bzip2]] is designed from the start to create blocks of compressed data, and works well with rsync. Since bzip2’s compression ratios are almost always better than gzip’s, there is no reason to switch. * Technically speaking, upload.tar is compressed, in the sense that it mostly contains compressed files such as images (which is why it should not be compressed otherwise). However, usually the files themselves do not change. The addition, removal, or reordering of static files in an uncompressed [[Tar (file format)\|tarball]] should still yield excellent rsync performance, regardless of the content of those files. ~~==Outdated content==~~ ~~You may be interested in older en dumps in XML at http://download.wikimedia.org/enwiki/, though using the newest dumps is strongly recommended. These dumps contain:~~ * pages_current.xml.bz2 - Current revisions only, all pages * pages_full.xml.bz2 - All revisions, all pages * '''pages_articles.xml.bz2 - Current revisions only, no talk or user pages (this is the one you probably want)''' * all_titles_in_ns0.gz - Article titles only--> ~~== Lihat pula ==~~ * [[m:Help:Downloading pages]] * [[m:Export]] ~~{{DEFAULTSORT:{{PAGENAME}}}}~~ ~~{{Wikipedia-stub}}~~ ~~[[Kategori:Bantuan]]~~ ~~[[bg:Уикипедия:Сваляне на Уикипедия]]~~ ~~[[de:Wikipedia:Download]]~~ ~~[[en:Wikipedia:Database download]]~~ ~~[[es:Wikipedia:Descargas]]~~ ~~[[fi:Wikipedia:Tietokannan lataus]]~~ ~~[[fr:Wikipédia:Télécharger la base de données]]~~ ~~[[hy:Վիքիփեդիա:Download]]~~ ~~[[it:Aiuto:Analisi del database]]~~ ~~[[ja:Wikipedia:データベースダウンロード]]~~ ~~[[ko:위키백과:데이터베이스 다운로드]]~~ ~~[[lt:Vikipedija:Atsisiuntimas]]~~ ~~[[pl:Wikipedia:Baza danych]]~~ ~~[[pt:Ajuda:Guia de consulta e reprodução/download]]~~ ~~[[ru:Википедия:Как сделать копию Википедии]]~~ ~~[[sk:Wikipédia:Krčma/Technické/Archív]]~~ ~~[[sv:Wikipedia:Databasnerladdning]]~~ ~~[[ta:விக்கிப்பீடியா:பதிவிறக்கம்]]~~ ~~[[uk:Вікіпедія:Викачати базу даних]]~~ ~~[[zh:Wikipedia:数据库下载]]~~