Column-Oriented DBMS

Source: Wikipedia: Column-Oriented DBMS

Column-oriented DBMS

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Question book-new.svg
This article needs additional citations for verification.
Please help improve this article by adding reliable references. Unsourced material may be challenged and removed. (September 2009)

A column-oriented DBMS is a database management system (DBMS) which naturally stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items. It is possible to achieve some benefits of either column-oriented or row-oriented organization with any database. By denoting one as column-oriented we are referring to both the ease of expression of a column oriented structure and the focus on optimizations for column-oriented workloads. [1][2] This approach is contrasted with row-oriented or row store databases and with correlation databases, which use a value-based storage structure.

* 1 Description
* 2 Benefits
o 2.1 Storage efficiency vs. random access
* 3 Implementations
* 4 References

[edit] Description

A database program must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary).

This table exists in the computer's memory (RAM) and storage (hard drive). Although RAM and hard drives differ mechanically, the computer's operating system abstracts them. Still, the database must coax its two-dimensional table into a one-dimensional series of bytes, for the operating system to write to either the RAM, or hard drive, or both.

A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.


A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.


This is a simplification. Partitioning, indexing, caching, views, OLAP cubes, and transactional systems such as write ahead logging or multiversion concurrency control all dramatically affect the physical organization. That said, online transaction processing (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.
[edit] Benefits

Comparisons between row-oriented and column-oriented systems are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other delays in computers. Further, because seek time is improving at a slow rate relative to CPU power (see Moore's Law), this focus will likely continue on systems reliant on hard-disks for storage. Following is a set of over-simplified observations which attempt to paint a picture of the trade-offs between column and row oriented organizations. Note: With the advancing of in-memory technology and the low prices of RAM the access time via hard-disk is becoming a less important factor.

1. Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
2. Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
3. Row-oriented systems are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
4. Row-oriented systems are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes). However, there are a number of proven row-based OLAP RDBMS that handles terabytes, or even petabytes of data, such as Teradata.
[edit] Storage efficiency vs. random access

Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row oriented data. For example, many popular modern compression schemes, such as LZW, make use of the similarity of adjacent data to compress. While the same techniques may be used on row-oriented data, a typical implementation will achieve less effective results. Further, this behavior becomes more dramatic when a large percentage of adjacent column data is either the same or not-present, such as in a sparse column (similar to a sparse matrix). The opposing tradeoff is random access. Retrieving all data from a single row is more efficient when that data is located in a single location, such as in a row-oriented architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need of access to compressed data[3].
[edit] Implementations

Column stores or transposed files have been implemented from the early days of DBMS development, beginning in the 1970s. For example, Statistics Canada implemented the RAPID system[4] in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.

For many years, only the Sybase IQ product was commercially available in the column-oriented DBMS class. However, that has changed rapidly in the last few years with many open source and commercial implementations.

Current examples of column-oriented DBMSs include:

* Commercial
o SAND/DNA Analytics
o SenSage
o Sybase IQ
o Vertica and its academic open-source cousin C-Store
o Valentina Database
o Vectornova/Vectorstar High-speed Data Engine
o Addamark, now the Sensage Scalable Log Server
o 1010data's Tenbase database
o DataProbe
o EXASolution
o Skytide XOLAP Server
o SuperSTAR from Space-Time Research
o ParAccel Analytic Database
o FluidDB
o eXtreme-DB from StoneAge tech
* Open-source (proprietary software)
o Calpont's InfiniDB Enterprise Edition, MySQL-front end
o Infobright (formerly Brighthouse) data engine, integrates with MySQL
o RC21 commercial open source project
o Xplain Semantic Database (called transposed files). "The latest release was version 5.8 (1999)"
* Open-source (free software)
o Calpont's InfiniDB Community Edition, MySQL-front end, GPLv2
o C-Store No new release since Oct 2006
o GenoByte Column based storage system and API for handling genotype data
o Lemur Bitmap Index C++ Library (GPL)
o FastBit
o Infobright Community Edition, regular updates
o LucidDB and Eigenbase
o MonetDB academic project
o Ingres & Vectorwise initiative
o Metakit
o The S programming language and GNU R incorporate column-oriented data structures for statistical analyses

[edit] References

1. ^ A decomposition storage model, Copeland, George P. and Khoshafian, Setrag N., SIGMOD '85, 1985.
2. ^ C-Store: A column-oriented DBMS, Stonebraker et al., Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
3. ^ Brighthouse: an analytic data warehouse for ad-hoc queries, Slezak et al., Proceedings of the 34th VLDB Conference, Auckland, New Zealand, 2008
4. ^ A DBMS for Large Statistical Databases, Turner, Hammond, Cotton, Proceedings of VLDB 1979, Rio de Janeiro, Brazil.

Retrieved from ""
Categories: Database management systems | Types of databases

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License