Wikipédia:Central de pesquisas/Portal de dados/Tradução

Página para coordenar a tradução do Research:Data para português. Ajuda é bem-vinda. Após algumas conversas via Esplanada, o material de referência parece não cobrir algumas necessidades da comunidade, são elas: material de referência sobre ferramentas de desenvolvimento, criação de bots, API, dispositivos.

Resumo das plataformas[editar código-fonte]

Data Dumps (detalhes)

Pàgina | Baixar

Dumps de todos os projetos da WMF para backup, uso offline, pesquisa etc.

Conteúdo da Wiki, revisões, metadados, links internos e externos

Formatos XML e SQL

Uma ou duas vezes ao mês

Arquivos grandes

API (detalhes)

Homepage

API provê acesso ao conteúdo do banco de dados do MediaWiki, através de requisições HTTP ao serviço web.

Meta informações sobre wiki e usuário logados, propriedades das páginas (como, revisões, conteúdo, etc) e listas de páginas a partir de filtros.

JSON, WDDX,XML, YAML e formato de requisição nativo do PHP

Toolserver (detalhes)

Página

Toolserver é um plataforma colaborativa para ferramentas usadas e criadas por pessoas do movimento Wikimedia.

Funciona como servidor, web, padrão para aplicações

Interface de linha de comando

É necessário criar conta

IRC Feeds (detalhes)

Página

Apresentação das atualização das Mudanças recentes, via IRC.

Mudanças são mostradas assim que acontecem.

Atualizações para cada wiki estão em canais diferentes.

Filtered feeds available with cloak

Acessos (detalhes)

Página | Baixar

Dados brutos dos logs de acessos (não únicos) as páginas dos projetos Wikimedia, como Wikipedia, WikiLivros, nos diversos idiomas. Dados extraídos dos servidores squid.

Projeto, título da página, número de requisições, tamanho do conteúdo

Delimited and JSON

Atualizados a cada hora

WikiStats (detalhes)

Página | Baixar

Relatórios das atividades em dos projetos Wikimedia em 25 idiomas, aproximadamente, baseados nos arquivos dump.

Únicos vistantes, acessos por página, editores ativos e mais

Arquivos intermediários disponíveis, em CSV

Atualização mensal

Gráficos

DBpedia (detalhes)

Página

DBpedia extrai dados estruturados da Wikipédia, permite que os usuários manipulem esses dados e criem ligações para outros conjuntos de dados.

RDF,N-triplets, SPARQL endpoint, Linked Data

Bilhões de estrutura de informação em ontologia consistente

DataHub (detalhes)

Página

Coleção de vários conjuntos de dados Wikimedia.

pequenos, em geral vindos de estudos

dbpedia lite, DBpedia-Live e outros

Avaliação de qualidade EPIC/Oxford

Data Dumps[editar código-fonte]

Página[editar código-fonte]

Dumps

descrição[editar código-fonte]

WMF publica cópias dos bancos de dados da Wikipédia e de todos os outros projetos. Os Wikipédia anglófona são atualizados uma vez ao mês, por conta do seu tamanho, outros projetos menores tem esses dados atualizados com mais frequência.^[1]

Conteúdo[editar código-fonte]

Texto e metadados de todas as revisões/edições de todas as páginas, em arquivo XML

Maior parte das tabelas do banco de dados, em arquivos SQL

- Listas de ligações página-página (ligações de página, ligações de categorias, ligações de imagens)

- Listas de páginas com ligações externas ao projeto (externallinks, iwlinks, langlinks tables)

- Metadados de mídias (imagens, tabela de imagens antigas)

- Informações sobre cada página (página, page_props, tabela restrições de página)

- Títulos de todas as páginas no namespace principal, isso é, todos os artigos (*-all-titles-in-ns0.gz)

- List of all pages that are redirects and their targets (redirect table)

- Lista de todas que são redirecionamentos e seus respectivos destinos.

- Dados de log, inclui bloqueios, proteção, deleção, material subido (tabela logging)

- Pedaços (interwiki, site_stats, user_groups tables)

experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps

http://dumps.wikimedia.org/other/incr/

Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content

Media bundles for each project, separated into files uploaded to the project and files from Commons

images : http://meta.wikimedia.org/wiki/Database_dump#Downloading_Images

Static HTML dumps for 2007-2008

http://dumps.wikimedia.org/other/static_html_dumps/

(see more)

Baixar[editar código-fonte]

É possível baixar os mais atuais dumps You can download the latest dumps (for the last year) here (http://dumps.wikimedia.org/enwiki/ for English Wikipedia, http://dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Arquivos : http://dumps.wikimedia.org/archive/

Espelhos oferecem alternativas para baixar os dados.

Para arquivos de grande tamanho o uso de ferramenta para baixar é recomendado.

Formato dos dados[editar código-fonte]

XML dumps since 2010 are in the wrapper format described at Export format( schema ). Files are compressed in bzip2 (.bz2) and .7z format.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Alguns outros dumps mais antigos existem in vários outros formatos.

https://meta.wikimedia.org/wiki/Data_dumps/Dump_format

Como usar[editar código-fonte]

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Ferramentas[editar código-fonte]

Available tools are listed in the following locations, but information is not always up-to-date:

Ferramentas de importação

Outras ferramentas

Visualization tools and Data processing tools on Referata

Acesso[editar código-fonte]

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 4.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Mantenedores[editar código-fonte]

Maintainer: Ariel Glenn

Mailing list: xmldatadumps-l

Projeto de pesquisa que fazem uso dos dados[editar código-fonte]

"A Breakdown of Quality Flaws in Wikipedia" examines cleanup tags on the English Wikipedia using a January 2011 dump

"There is No Deadline – Time Evolution of Wikipedia Discussions" looks at the time evolution of Wikipedia discussions, and how it correlates to editing activity, based on 9.4 million comments from the March 12, 2010 dump

"Understanding collaboration in Wikipedia" mines a complete dump of the English Wikipedia (225 million article edits) for insights into open collaboration

"Dynamics of Conflicts in Wikipedia" takes the revision history from the dump to extract the reverts based on the text comparison to study the dynamics of editorial wars in multiple language editions

API[editar código-fonte]

Página[editar código-fonte]

http://www.mediawiki.org/wiki/API

Descrição[editar código-fonte]

The web service API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

Conteúdo[editar código-fonte]

Meta information about the wiki and the logged-in user

Properties of pages, including page revisions and content, external links, categories, templates,etc.

Lists of pages that match certain criteria

Endpoint[editar código-fonte]

To query the database you send a HTTP GET request to the desired endpoint (example http://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to "query" and defining the query details the URL.

Formato dos dados[editar código-fonte]

The API supports the following formats:

JSON(and JSON format with the debugging elements (HTML))

WDDX

YAML

PHP's native serialization (também nos formatos PHP print_r(),PHP var_export(),PHP var_dump())

O formato de saída desejado pode ser especificado na query string, via URL. O formato padrão é XML.

Encontre mais detalhes aqui.

Como usar[editar código-fonte]

Here's a simple example: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Main%20Page

This means fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (titles=Main%20Page) of English Wikipedia (http://en.wikipedia.org/w/api.php? )in XML format (format=xml). You can paste the URL in a browser to see the output.

Further ( and more complex) examples can be found here.

Veja também :

Tutorial

Documentation

Useful links

Existing tools[editar código-fonte]

To try out the API interactively, use the Api Sandbox.

Acesso[editar código-fonte]

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 4.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

Mentenedores[editar código-fonte]

FAQ: http://www.mediawiki.org/wiki/API:FAQ

Mailing list: mediawiki-api

Toolserver[editar código-fonte]

NOTA: Toolserver está para ser movido para Tool Labs.

Página[editar código-fonte]

http://toolserver.org/

Descrição[editar código-fonte]

O toolserver hospeda várias ferramentas, que servem para manipular dados relacionados aos projetos Wikimedia. Uma cópia dos bancos de dados originais é atualizada frequentemente e disponibilizada para estudos e serviços.

Conteúdo[editar código-fonte]

O Toolserver mantém cópias dos bancos de dados de todos os projetos Wikimedia, nos seus mais diversos idiomas. Wikimedia Commons que hospeda mídias (imagens, áudios e vídeos) também está nessa lista. Todos esse conteúdo está disponível para uso gratuito de interessados, contanto que esse uso não vá contra as políticas de privacidade do projeto.

Formato dos dados[editar código-fonte]

Aprenda mais sobre o atual esquema do banco de dados.

200908261500-Daniel Kinzler-The Toolserver The hackers way of surfing the wiki

Como usar[editar código-fonte]

Usar o Toolserver requer familiaridade com interface de linha de comando, Unix/Linux, uso de banco de dados e de programação.

Uma alternativa menos flexível, e mais fácil, é usar query service, que permite requisitar dados, se forem descritos de modo claro.

Para começar a usar o Toolserver mais informações podem ser encontradas em Getting started, algumas amostras de código e manuais de acesso ao banco de dados.

Ferramentas[editar código-fonte]

A mais completa lista de ferramentas que fazem uso do Toolserver está sendo construída em mw:Toolserver/List of Tools.

Acesso[editar código-fonte]

Para usar o Toolserver é necessário pedir uma conta.

Antes de começar, leia as regras.

Manutenção[editar código-fonte]

Wiki: https://wiki.toolserver.org .

Canal IRC: Predefinição:Irc

Mantenedor: Sebastian Sooth,

Olá. Por favor, verifique seu e-mail – há uma nova mensagem!
Pode remover esta nota a qualquer hora, retirando a predefinição {{Verifique o e-mail}}.

da Wikimedia Deutschland

Lista de email: toolserver-l

Projetos que usam/usaram o Toolserver[editar código-fonte]

"Circadian patterns of Wikipedia editorial activity: A demographic analysis" analyzed "34 Wikipedias in different languages [trying] to characterize and find the universalities and differences in temporal activity patterns of editors", with the underlying data provided by the German Wikimedia chapter from the toolserver.

"Feeling the Pulse of a Wiki: Visualization of Recent Changes in Wikipedia" describes a tool hosted on Toolserver providing recent changes visualization to aid admins

IRC Feeds[editar código-fonte]

Home page[editar código-fonte]

http://meta.wikimedia.org/wiki/IRC_channels#Raw_feeds

Description[editar código-fonte]

These are live Recent changes feeds hosted on the irc.wikimedia.org server which show edits on Wikimedia wikis automatically as they happen. Confirmation that an edit has been processed is typically faster through IRC than through the browser.

You can also get custom filtered feeds.

Data and format[editar código-fonte]

Each wiki edit is reflected in the wiki's IRC channel.Displayed URLs give the cumulative differences produced by the edit concerned and any subsequent edits. The time is not listed but timestamping may be provided by your IRC-client.

The format of each edit summary is :

[page_title] [URL_of_the_revision] * [user] * [size_of_the_edit] [edit_summary]

You can see some examples below:

<rc-pmtpa> Talk:Duke of York's Picture House, Brighton http://en.wikipedia.org/w/index.php?diff=542604907&oldid=498947324 *Fortdj33* (-14) Updated classification

<rc-pmtpa> Bloody Sunday (1887) http://en.wikipedia.org/w/index.php?diff=542604908&oldid=542604828 *03184.61.149.187* (-2371) /* Aftermath */

Location[editar código-fonte]

IRC feeds are hosted on the irc.wikimedia.org server.

Every one of the >730 Wikimedia wikis has an IRC RC feed. The channel name is #lang.project. For example, the channel for German Wikibooks channel is #de.wikibooks.

Existing tools[editar código-fonte]

wm-bot lets you get IRC feeds filtered according to your needs. You can define a list of pages and get notifications of revisions on those pages only.

WikiStream uses IRC feeds to illustrate the amount of activity happening on Wikimedia projects.

Access[editar código-fonte]

Anyone can access IRC feeds. However, you need a wm-bot.

Pageview statistics[editar código-fonte]

Home page[editar código-fonte]

http://dumps.wikimedia.org/other/pagecounts-raw/

Description[editar código-fonte]

Raw hourly pageview dumps based on squid server logs run since 2007.

Content[editar código-fonte]

Each request of a page reaches one of Wikimedia's squid caching hosts. The project name, the size of the page requested, and the title of the page requested are logged and aggregated hourly. English statistics are available since 2007 and non-English since 2008.

Files starting with "projectcount" contain total hits per project per hour statistics.

Note: These are not unique hits and changed titles/moves are counted separately.

Download[editar código-fonte]

http://dumps.wikimedia.org/other/pagecounts-raw/

Data format[editar código-fonte]

Delimited format : [Project] [Article_name] [Number_of_requests] [Size of the content returned]

where Project is in the form language.project using abbreviations described here.

Examples:

    fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624

means that the French Wikibooks page with title "Special:Recherche/Achille_Baraguey_d%5C%27Hilliers" was viewed 1 time in the last hour and the size of the content returned was 624.

    en Main_Page 242332 4737756101

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour.

Data in JSON format is available at http://stats.grok.se/.

Existing tools[editar código-fonte]

You can interactively browse the page view statistis and get data in JSON format at http://stats.grok.se/.

The following tools also use pageview statistics:

Article traffic statistics

GLAMourous - Commons image usage on Wikimedia projects

Top 100 articles for 2012 for each project

Support[editar código-fonte]

Maintainer:

http://stats.grok.se/ is maintained by User:Henrik

Research projects using data from this source[editar código-fonte]

Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data combines the page view statistics for articles on movies from the raw page view dumps with the editorial activity data from the toolserver database to predict the financial success of movies

Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend (in German) studies how the TV schedule influences Wikipedia pageviews

More examples

WikiStats[editar código-fonte]

Página[editar código-fonte]

http://stats.wikimedia.org/

Veja também: mw:Analytics/Wikistats

Descrição[editar código-fonte]

Wikistats é um projeto, idealizado e mantido desde de 2003 por Erik Zachte, para geração de diversos relatórios estatísticos de tendências nos projeto wiki. Esses relatórios sao feitos a partir dos arquivos dump, e dos logs de acessos.

Conteúdo[editar código-fonte]

Centenas de relatórios mensais, abrangendo mais de 25 linguagens. com informações sobre:

visitantes únicos

atividade dos editores

visualizações das páginas(geral e mobile)

criação de artigos

Relatórios especiais(alguns feitos apenas uma vez, outros regulares) sobre:

crescimento por projeto e idioma

acessos as páginas e edições por projeto e idioma

requisições ao servidor e picos de acessos

edições e reversões

resposta dos usuários

atividade dos bots

listas de email

Formato dos dados[editar código-fonte]

Final reports are presened in table and chart form. Intermediate files are avaialable in CSV format.

Baixar[editar código-fonte]

arquivos CSV

Project counts repackaged yearly

Ferramentas existentes[editar código-fonte]

The scripts used to generate the CSV files (WikiCounts.pl + WikiCounts*.pm) and reports (WikiReports.pl + WikiReports*.pm )are available for download here.

Mantenedor[editar código-fonte]

Mantenedor: Erik Zachte

DBpedia[editar código-fonte]

Página[editar código-fonte]

http://dbpedia.org

Descrição[editar código-fonte]

DBpedia.org é um esforço comunirário para extrair informações estruturadas da Wikipédia e disponibilizá-la na internet. DBpedia permite que consultas complexas seja feitas sobre esses dados, além de permitir agregação de outros conjuntos de dados a Wikipédia.

Conteúdo[editar código-fonte]

Assim como a Wikipédia, a DBpedia possui dados em vários idiomas. A versão em inglês:

Possui 3.77 milhões de descrições

2.35 million are classified in a consistent Ontology(persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.

Localized versions of DBpedia in 111 language

together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia

The data set also features:

about 2 billion pieces of information (RDF triples)

labels and abstracts for >10 million unique things in up to 111 different languages

millions of

- links to images

- links to external web pages

- data links into external RDF data sets

- links to Wikipedia categories

- YAGO categories

Formato dos dados[editar código-fonte]

RDF/XML

Turtle

N-Triplets

SPARQL endpoint

Baixar[editar código-fonte]

http://wiki.dbpedia.org/Downloads38 possui ligações para todos os conjuntos de dados, formatos e linguagens.

http://dbpedia.org/sparql - DBpedia's SPARQL endpoint

Como usar e exemplos[editar código-fonte]

Use cases shows the different ways you can use DBpedia data ( such as improving Wikipedia search or adding Wikipedia content to your webpage)

Applications shows the various applications of DBpedia including faceted browsers, visualization, URI lookup, NLP and others.

Existing tools[editar código-fonte]

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.

RelFinder is a tool for interactive relationship discovery in RDF data

Access[editar código-fonte]

DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 4.0 License and the GNU Free Documentation License.

Support[editar código-fonte]

Mailing list: DBpedia Discuss

http://wiki.dbpedia.org/Support

http://wiki.dbpedia.org/Imprint

Research projects using data from this source[editar código-fonte]

"Biographical Social Networks on Wikipedia - A cross-cultural study of links that made history" uses data extracted from DBpedia to study how biographies on Wikipedia vary depending on language/culture.

See more DBpedia related publications, blog posts and projects here.

DataHub[editar código-fonte]

DataHubWikimedia group on DataHub is a collection of datasets about Wikipedia and other projects run by the Wikimedia Foundation.

The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.

The Wikimedia group on DataHub points to some additional data sources not listed on this page. Some examples are:

dbpedia lite , which uses the API to extract structured data from Wikipedia ( not affiliated with DBpedia))

EPIC/Oxford quality assesmtent of Wikipedia by experts

Wikipedia Banner Challenge data

Wikipedia Editor Engagement Experiments: Timestamp position modification

Referências

↑ Checar quanto tempo leva para atualizar os dados da WP:PT