— Ch. 1 · Defining The Big Data Paradigm —
Big data.
~5 min read · Ch. 1 of 6
In 1997, a researcher named John Mashey began using the phrase big data to describe datasets that were too large for standard software tools. By the early 2000s, the term had evolved from simply meaning huge volume to encompassing three core characteristics: volume, velocity, and variety. Volume refers to the sheer quantity of generated and stored data, often exceeding terabytes or petabytes. Velocity describes the speed at which data is created and processed, sometimes occurring in real-time. Variety captures the diverse types of information, ranging from structured numbers to unstructured text, images, and audio.
A fourth characteristic called veracity was later added to address the reliability and quality of the data itself. Without sufficient investment in expertise regarding veracity, organizations risked facing costs and risks that exceeded their capacity to create value. In 2018, a definition emerged stating that big data exists where parallel computing tools are needed to handle it. This marked a distinct shift in computer science away from traditional relational models toward new forms of integration. The size of big data remains a constantly moving target, ranging from dozens of terabytes to many zettabytes depending on the organization's capabilities.
Architectural Evolution And Technologies
Teradata Corporation marketed its first parallel processing system, the DBC 1012, in 1984. That same year, hard disk drives held only 2.5 gigabytes of storage. By 1992, Teradata systems were storing and analyzing one terabyte of data for the first time. In 2007, they installed the first petabyte-class relational database management system. Systems existing until 2008 handled 100 percent structured relational data before adding semi-structured types like XML and JSON.
Google published a paper on MapReduce in 2004, introducing a parallel processing model that split queries across nodes and gathered results later. An Apache open-source project named Hadoop adopted this framework shortly after. Apache Spark followed in 2012, addressing limitations in MapReduce by adding in-memory processing capabilities. Seisint Inc developed the HPCC Systems platform in 2000, which automatically partitions and distributes data across multiple commodity servers. LexisNexis acquired Seisint in 2004 and successfully used the platform to integrate Choicepoint Inc systems in 2008. The HPCC platform was open-sourced under an Apache v2.0 License in 2011.