Random bits

Some user generated databases

KEI is interested in the development of sustainable mechanisms to strengthen the evidence for public policy decisions. One element of this work concerns user generated databases, an area of considerable interest, but mixed experience, in recent years. The following are examples of several such projects, beginning with the excellent Ensembl project, followed by several others of varying degrees of success in their implementation.

As this brief list shows, there are all sorts of ways to design and manage user generated databases. In some cases, the database services seem to be set up more to showcase a technology or an idea for a platform. In other cases, the database is a focused effort to solve a practical and well identified user interest. Some are run by for profit companies, others by non-profits, individuals or communities. The databases take different approaches in terms of database design, attention to standards for data formats, and governance, among other issues.

The Ensembl Project

The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. The Ensembl project was started in 1999, some years before the draft human genome was completed. Even at that early stage it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. The goal of Ensembl was therefore to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available via the web. Since the website’s launch in July 2000, many more genomes have been added to Ensembl and the range of available data has also expanded to include comparative genomics, variation and regulatory data.

The number of people involved in the project has also steadily increased. Currently, the Ensembl group consists of between 40 and 50 people, divided in a number of teams. The Genebuild team creates the gene sets for the various species. The result of their work is stored in the core databases, which are taken care of by the Software team. This team also develops and maintains the BioMart data mining tool. The Compara, Variation and Regulation teams are responsible for the comparative and the variation and regulatory data, respectively. The Web team makes sure that all data are presented on the website in a clear and user-friendly way. Finally the Outreach team answers questions from users and gives workshops worldwide about the use of Ensembl. The Ensembl project is headed by Paul Flicek and Steve Searle, and receives input from an independent scientific advisory board.

Ensembl is a joint project between European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust Sanger Institute (WTSI). Both institutes are located on the Wellcome Trust Genome Campus in Hinxton, south of the city of Cambridge, United Kingdom.

PhotoEnforced.com

PhotoEnforced.Com is a user generated or crowd sourced database of photo enforced locations. The open database of locations and fines is continually updated by anonymous users from around the U.S. The majority of the database currently contains red light cameras and speed cameras. However, as photo enforcement becomes an increasingly popular source of revenue for cities around the U.S. other photo enforcement techniques such as illegal right turn cameras, bus lane cameras, parking cameras, toll booth cameras, carpool lane enforcement, railroad cameras will soon be coming online. The database consists of more than 6,400+ locations, fines and it is growing everyday.

Discogs.com

Discogs is a user-built database containing information on artists, labels, and their recordings. Discogs also incorporates a Marketplace where you can buy and sell the recordings. Discogs is constantly growing as users submit releases to the database

whosampled.com

WhoSampled.com is a community site for discovering and discussing sampled music, remixes and cover songs. Anyone can submit information about a sample, remix or cover and subject to approval it will be published on the site to be discovered and discussed by the world. WhoSampled was created out of our love for sampling, music history and music production. It aims not only to be the most comprehensive, detailed and accurate database of samples, remixes and covers on the web but also a fun and engaging place to be at. This website is all about the discovery of new and old music, the exploration of musical influences and the sharing of knowledge.

Death Row Database

A database of death row prisoners in the U.S. created by OpposingViews.com. The database contains sortable information on death row inmates in each state, including their race, county, and date of birth. The content of the database is largely user-generated. The information in the database is editable, and individuals with knowledge of death row inmates may change or add new information. DPIC has no part in the creation or maintenance of this database, nor can we vouch for its accuracy. It may be a useful tool in exploring how the death penalty is applied.

ReverseNumber.co.uk

Have you ever gotten a call from a number and wish you knew who they were, or why they were calling? ReverseNumber.co.uk is a user generated, phone number lookup service made for people in the UK who want this valuable information from other individuals just like them. You can also leave your own feedback about your experience with the caller to help the community.

Bodylogs
While this site has an interesting purpose, it does not look that well implemented in practice:

The purpose of this site is to allow people from around the world to access and contribute to a global database of user-generated health content. Bodylogs collects anecdotal information relating to a certain health topic under the premise that in large amounts this information becomes very powerful. The mission of Bodylogs is to use anecdotal data from everyday people to draw meaningful inferences concerning health and medicine for the sake of furthering the health-related goals of it’s users. We believe that helpful, unbiased, and accurate information regarding health should be easily accessible to everyone. Bodylogs was created for this very reason.

publicearth.com
This database seems to be largely populated by entries from hotel operators

Accessibility Mapr

Accessibility Mapr is like a cross between Wikipedia, Google Maps, and a disability access map. Anyone can add information about the accessibility of any location on earth to the map, and the information is then available for all to freely view.

Why did you create Accessibilty Mapr? I’m a wheelchair user and I spend a great deal of time trying to work out if various places that I want to or need to go to are going to be accessible. Often they’re not, which sucks. But equally unjust, I would argue, is how difficult it can be to get accessibility information without going to the location in person. Sure, some large organisations and businesses publish accessibility information on their websites, but the quality of the information varies and can be difficult to find. And most small businesses haven’t even thought about accessibility, let alone publicized it. You can ring up, but what guarantee do you have that the person on the other end knows what they’re talking about. My hope is that Accessibility Mapr can be a repository of grass roots accessibility knowledge: for people with disabilities, by people with disabilities.

National Equestrian Crime Database

NECD is the most advanced equestrian crime database in the world. It is designed to protect your horse, your equestrian estate, your passion. NECD uniquely blends the two worlds of the equestrian and advanced information technology with the sole aim of protecting equestrian estate. Two years in development, NECD and has gained the backing and support of the most influential equestrian organisations in the UK.

OpenStreetMap
This is a collaborative project to create a free editable map of the world. From today’s Wikipedia:

OpenStreetMap (OSM) was founded in July 2004 by Steve Coast. In April 2006, the OpenStreetMap Foundation (OSMF) was established to encourage the growth, development and distribution of free geospatial data and provide geospatial data for anybody to use and share. In December 2006, Yahoo confirmed that OpenStreetMap could use its aerial photography as a backdrop for map production. In April 2007, Automotive Navigation Data (AND) donated a complete road data set for the Netherlands and trunk road data for India and China to the project and by July 2007, when the first OSM international The State of the Map conference was held there were 9,000 registered users. Sponsors of the event included Google, Yahoo and Multimap. In August 2007 an independent project, OpenAerialMap, was launched, to hold a database of aerial photography available on open licensing and in October 2007 OpenStreetMap completed the import of a US Census TIGER road dataset. In December 2007 Oxford University became the first major organisation to use OpenStreetMap data on their main website. In January 2008, functionality was made available to download map data into a GPS unit for use by cyclists. In February 2008 a series of workshops were held in India. In March two founders announced that they have received venture capital funding of 2.4m euros for CloudMade, a commercial company that will use OpenStreetMap data.

http://www.plantdatabase.co.uk/

This site has been developed to enable anyone interested in plants to contribute data; whether a gardener, researcher, a student or someone who just likes plants. Everybody knows something. We have the images in categories so that plants can be identified, seasonal influences seen at a glance, and lifecycles shown from seed to maturity.

Freebase
A history of Freebase is available from the Freebase Wiki here, and some more information is avaiable from http://en.wikipedia.org/wiki/Freebase. The Freebase domain and trademark are owned by Metaweb, a firm acquired by Google in 2010. Freebase claims to have 374,600,492 “facts” and nearly 23 million topics.

Freebase has information about approximately 23 million Topics or Entities. Each one has a unique Id, which can help distinguish multiple entities which have similar names, such as Henry Ford the industrialist vs Henry Ford the footballer. Most of our topics are associated with one or more types (such as people, places, books, films, etc) and may have additional properties like “date of birth” for a person or latitude and longitude for a location. These types and properties and related concepts are called Schema. Anyone can contribute data to Freebase, and you can also build your own schema in a Base if Freebase does not yet have schema for a subject you’re interested in.
Where does the data come from? Data in Freebase comes from a variety of Data sources. Some data is loaded by automated Data Pipelines, some uploaded in bulk, either by Metaweb’s Data team or by our Community of contributors using our API or other data loading tools. Other data is manually added piece by piece by individuals who simply use the website to edit topics.

ClinicalTrials.gov.

ClinicalTrials.gov currently contains 111,738 trials sponsored by the National Institutes of Health, other federal agencies, and private industry. Studies listed in the database are conducted in all 50 States and in 175 countries. ClinicalTrials.gov receives over 50 million page views per month 65,000 visitors daily.

A Protocol Registration System (PRS) account is required for submitting study information to ClinicalTrials.gov. Data submitters must coordinate with all of their partners so that trial information is submitted only once, by one of the entities listed below, to ClinicalTrials.gov. Trial data may be submitted by the following entities:

Sponsors legally responsible for conducting clinical trials, e.g., holders of investigational new drug applications from the U.S. Food and Drug Administration.

Governmental or international agencies conducting or supporting clinical trials, e.g., the U.S. National Institutes of Health.

Lead principal investigators who are responsible for conducting and coordinating the overall clinical investigation across multiple study sites. Trial data should not be submitted from each individual study location.

Where does my money go?

Where Does My Money Go? aims to promote transparency and citizen engagement through the analysis and visualisation of information about UK public spending. It is an independent non-partisan project run by the Open Knowledge Foundation. We’re trying to make government finances much easier to explore and understand – so you can see where every pound of your taxes gets spent.

Where Does My Money Go? was first developed as an idea by the Open Knowledge Foundation‘s Jonathan Gray in 2007. In November 2008 the project was a winner of the UK Government’s Show Us a Better Way competition. The project received a small grant in summer 2009 from the UK Government to develop a prototype, which was launched in autumn 2009. In 2010 the project received funding from Channel 4′s 4iP to support further development.

The Open Knowledge Foundation (OKF) is a not-for-profit organization dedicating to making information open — available for anyone to access and re-use. This is citizen-driven project not only in the sense that citizens like yourself will be its users, but to gather and analyse the data we are going to need your help to do it.

Saferproducts.Gov

In August 2008, Congress passed the Consumer Product Safety Improvement Act (CPSIA). Section 212 of the CPSIA requires the U.S. Consumer Product Safety Commission (CPSC) to create, by March 2011, a searchable public database of reports of harm (Reports) related to the use of consumer products and other products or substances within the jurisdiction of the CPSC.

The U.S. Consumer Product Safety Commission’s Publicly Available Consumer Product Safety Information Database (Database) is a publicly searchable database where submitters can report to the CPSC a harm or risk of harm related to the use of a consumer product or other product or substance within the jurisdiction of the CPSC.

Members of the public can search the Database for safety information about products that are in their home already, or that they may be thinking about purchasing. Beginning March 11, 2011, reports of harm or “Reports,” that contain minimum information required by law and that provide the submitter’s consent, will be posted in the Database on our website at: www.SaferProducts.gov. The public can search the Database and review Reports approximately 15 business days after a Report is submitted to the CPSC.

Product manufacturers (including importers) and private labelers that are identified in a Report may submit comments to be displayed in the Database along with the Report. Information about product recalls is also available for search and review in the Database. The Database represents a new level of transparency for the CPSC, allowing the public to have immediate access to safety information about consumer products.

Road Damage Assessment System
According to a March 17, 2011 Article in Dailycrowdsource.com:

All motorists are aware of the bane of potholes that occur during each spring thaw. While a motorist’s reflex is to simply avoid a pothole and hope for a repair soon, a new project by the Carnegie Mellon University permits you to take an active role in road repair. If you have a GPS linked cell phone and a Facebook account, you can easily provide instant alerts to the concerned department on potholes in an area.

The project termed Road Damage Assessment System (RODAS), enables anyone to upload an image of the pothole via their Facebook account. The software will then link the uploaded image to an online map, thus creating a consolidated crowdsourced database of potholes. Pennsylvania Business Council’s PBC Education Foundation and the Pennsylvania Boroughs Association through its Chrostwaite Institute have provided seed funding for the project. While this project is helpful in creating a database of roads that need repair, the power will only fully be utilized if community members continue to upload the latest status of the potholes submitted by them. The project is headed up by Robert Strauss, Professor of Economics and Public Policy working alongside Takeo Kanade, Professor of Robotics and Computer Science.

TRIP advisor
Claims 50 million reviews and opinions.

World Memory Project

The United States Holocaust Memorial Museum has gathered millions of historical documents containing details about survivors and victims of the Holocaust and Nazi persecution during World War II.

Ancestry.com has spent more than a decade creating advanced technological tools that have allowed billions of historical documents to become searchable online.

Together, the two organizations have created the World Memory Project to allow the public to help make the records from the Museum searchable by name online for free—so more families of survivors and victims can discover what happened to their loved ones during one of the darkest chapters in human history.

The World Memory Project will build the largest free online resource of information about victims and survivors of Nazi persecution—to restore the identities of people the Nazis tried to erase from history and enable families to discover the fates of missing loved ones. The project allows anyone, anywhere to type information from historical records into databases that will be made searchable online for free.

Factual.Com
My impression of Factual.Com is that is a service looking to be sold, without much attention of the practical issues of users. Here is how the owner describes the service.

Factual was founded in 2007 by Gil Elbaz, co-founder of Applied Semantics (which launched ASI’s AdSense product). Applied Semantics was acquired by Google in 2003. Gil has had a lifelong passion for organizing and structuring information, and building smart tools which can make better sense of data. To that end, he set out to develop an open data platform and community in an effort to maximize data accuracy, transparency, and availability. Fellow data lovers — Tim Chklovski, chief scientist & founding engineer, Eva Ho, an ex-Googler, along with Bill Michels, former GM of Yahoo BOSS, have joined him in this ambitious project. We are very excited to have an extremely impressive group of investors join us in our mission: Andreessen Horowitz, Auren Hoffman, Aydin Senkut, Bill Gross, Esther Dyson, Founder Collective, GRP Partners, Gunderson Dettmer, Index Ventures, Lee and June Stein, Marten Mickos, Michael Ovitz, Miramar Venture Partners, Richard Rosenblatt, Scott Kurnit, SV Angel, Thomas Lehrman, and Tom Unterman.

Africa Origins

African Origins contains information about the migration histories of Africans forcibly carried on slave ships into the Atlantic. Using detailed information on 9,453 Africans liberated by Courts of Mixed Commission, this resource presents geographic, ethnic, and linguistic data on peoples captured in Africa and pulled into the slave trade. Through contributions to this website by Africans, members of the African Diaspora, and others, we hope to realize the history of the millions of Africans captured and sold into slavery during suppression of transatlantic slave trading in the 19th century.

If you are familiar with any African names or naming practices, you will be able to contribute to this project. By suggesting a modern counterpart for an African name recorded in the historical registers, as well as ethno-linguistic groups that use that name, you help to identify the likely linguistic, cultural, and geographic origins of that African.

You can begin by entering an African name you know into the search box on the African-Origins home page. You are encouraged to select a country (if you know what country is associated with the name you entered) and the gender associated with the name (if appropriate). Then click the “Explore” button for a list of exact or similar sounding names from those recorded in the Court of Mixed Commission registers. By clicking on a row in the table of results, you can see all recorded and imputed information for that African, and a link to the form for making a contribution to that African’s identify.

Scholars familiar with African names, naming practices, languages and ethno-linguistic groups will review the cumulative responses for names such as those that you furnished, looking for consensus among respondents like you. It is expected that persons like you and responses like yours will be highly correlated, allowing the editors to determine with degrees of certainty (based on contributions) where certain names likely came from. Since most of the African languages and names are geocentric, the work of contributors like you and the editors will likely result in certain areas on the African continent being highlighted as the geographic origins of the Africans who were enslaved and recaptured, and who ended up on the registers of the Courts of Mixed Commissions.

http://www.poderopedia.com/

To promote greater transparency in Chile, Poderopedia (Powerpedia) will be an editorial and crowdsourced database that highlights the links among the country’s elite. Using data visualization, the site will investigate and illustrate the connections among people, companies and institutions, shedding light on any conflicts of interests. Crowdsourced information will be vetted by professional journalists before it is posted. Entries will include an editorial overview, a relationship map and links to the sources of information.

Regards Citoyens database of French lobbying

L’influence des lobbyistes à l’Assemblée nationale est difficile à quantifier du fait des diverses formes qu’elle peut prendre : discussions informelles, réunions en tête-à-tête, suggestions d’amendement… La seule information fiable sur les contacts qu’ils peuvent entretenir avec les députés se trouve dans les listes d’auditions présentées en annexes des rapports parlementaires.

Ces rapports sont la synthèse du travail mené par des députés visant à étudier l’impact d’un projet de loi ou à réaliser un travail de contrôle. De nombreux entretiens sont conduits afin de collecter les points de vue d’un maximum d’acteurs concernés. L’étude des personnes auditionnées dans le cadre de ces rapports devrait dès lors permettre d’établir une première cartographie des « influenceurs » jugés importants par les députés.

Pour réaliser cette étude, Transparence International France et Regards Citoyens ont décidé de s’associer pour analyser tous les rapports publiés à l’Assemblée nationale depuis le début de la législature. La première étape de ce projet nous a conduits à collecter, au sein de ces rapports, les noms des personnes auditionnées. Plus de 15 000 au total ! L’identification de chacun représentant un travail titanesque, nous avons décidé de solliciter l’aide des internautes pour nous aider à reconnaitre les organisations représentées.

The Internet Movie Database (IMDb

Since 1990, an incredibly diverse range of people across the globe have been adding, refining and correcting the data on our pages. It’s an ever-evolving process and we invite you to become a part of it. You do have to be registered first, but it’s free and it’s painless. What is a Contributor? Quite simply, a contributor is anyone who submits information for display on the site. There is a huge variety of data that can be added, such as a new Title (i.e. Movies, TV shows, Video Games etc.), Names (actors, writers, film crew, celebrities etc.) and numerous other categories such as directors, producers, trivia, goofs, soundtracks, quotes, release dates. Adding Data to the IMDb Adding information (we call it ‘data’) is a very simple process, and we always welcome new Contributors. Below, you’ll find all the info you’ll need to start updating the site.

Every piece of data submitted to the IMDb is checked by the Database Content Team before it goes live. The team is split into three project groups, each with a specific area of responsibility. For 2011, these responsibilities are:

Contribution – Improving all aspects of the contribution process.

Coverage – Filling data gaps and expanding our content coverage.

Smart-Processing – Improving our internal processing workflows.

According to Richard Sexton, IMDB project was born at the University of Cardiff, but traces its roots to the Usenet newsgroup rec.arts.movies.

Geni.Com

Geni is solving the problem of genealogy by inviting the world to build the definitive online family tree. Using the basic free service at Geni.com, users add and invite their relatives to join their family tree, which Geni compares to other trees. Matching trees are then merged into the single world family tree, which currently contains nearly 50 million living users and their ancestors. Pay services include enhanced research tools as well as keepsake products created from family tree data. Geni welcomes casual genealogists and experts who wish to discover new relatives and stay in touch with family. Geni is privately held and based in Los Angeles, California.

Mytrees.com

Our mission is to provide a platform where genealogists from around the world can share their research with each other. We are also dedicated to enriching our research archives with resources that will benefit you, our patron and fellow family history researcher. Data freely given is freely shared. It is our policy to freely share data whenever the costs for that data allow us to provide it free of charge. Our Ancestry Archive Index is totally free.
Total Ancestry Archive Names: 532,201,664
Family Tree Names: 278,514,106
1930 US Federal Census Names: 106,016,291
Social Security Death Index Names: 86,029,291
1860 US Federal Census Names: 25,123,623
Internet Names: 20,655,310
Other Flat Record Names: 4,659,837
US Civil War Record Names: 4,050,069
US Naturalization Record Names: 3,684,325
US Revolutionary War Pension Application File Names: 3,468,812

OneGreatFamily.Com

OneGreatFamily is a single, shared family tree built by people all over the world. The OneGreatFamily Tree is a powerful genealogy database that is shared and built by people like you from all over the world. Everyone’s genealogy ties into the OneGreatFamily Tree. Our database of over has over 190 million unique entries.

Wikipedia
Ed Summers asked that I also include Wikipedia:

Wikipedia is a multilingual, web-based, free-content encyclopedia project based on an openly editable model. Wikipedia’s articles provide links to guide the user to related pages with additional information. Wikipedia is written collaboratively by largely anonymous Internet volunteers who write without pay. Anyone with Internet access can write and make changes to Wikipedia articles (except in certain cases where editing is restricted to prevent disruption or vandalism). Users can contribute anonymously, under a pseudonym, or with their real identity, if they choose. Since its creation in 2001, Wikipedia has grown rapidly into one of the largest reference websites, attracting 400 million unique visitors monthly as of March 2011 according to ComScore. There are more than 82,000 active contributors working on more than 19,000,000 articles in more than 270 languages. As of today, there are 3,705,468 articles in English.

The Wikimedia Founation also operates several other projects, including Wiktionary, Wikiquote, Wikibooks, Wikisource, Wikinew, Wikiversity, Wikispecies, Mediawiki, Wikimedia Meta-Wiki, Wikimedia Commons, And Wikimedia Incubator.

images.Killi.NET
This is a database of pictures of killifish. The manager of the database is Richard Sexton. The database lists all Cyprinodontiform fishes, organizing them using a three letter mnemonic that was invented by Col. J.J. Scheel of Denmark.

International HapMap Project

The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings. Using the information in the HapMap, researchers will be able to find genes that affect health, disease, and individual responses to medications and environmental factors. One goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared. The Project is a collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. It officially started with a meeting on October 27 to 29, 2002. It involved phases of data collection and analysis. The complete data obtained in Phase I were published on 27 October 2005. The analysis of the Phase II dataset was published in October 2007. The Phase III dataset was released in spring 2009.

This entry was posted on Monday, August 8th, 2011 at 5:40 pm and is filed under Economics, Intellectual Property, Technology. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Some user generated databases

Meta

Recent Posts

Categories

Archives