In the modern world of big data, unstructured data is the most abundant. It’s so prolific because unstructured data could be anything: media, imaging, audio, sensor data, text data, and much more. Unstructured simply means that it is datasets (typical large collections of files) that aren’t stored in a structured database format. Unstructured data has an internal structure, but it’s not predefined through data models. It might be human generated, or machine generated in a textual or a non-textual format.
Unstructured Data vs. Structured Data
Unstructured data can be thought of as data that’s not actively managed in a transactional system; for example, data that doesn’t live in a relational database management system (RDBMS). Structured data can be thought of as records (or transactions) in a database environment; for example, rows in a table of a SQL database.
There is no preference as to whether data is structured or unstructured. Both have tools that allow users to access information. Unstructured data just happens to be in greater abundance than structured data is.
Examples of unstructured data are:
Until the advent of object-based storage, most, if not all, of this unstructured data was stored in file-based systems.
What Challenges Does Working with Unstructured Data Present?
The way to think about how to deal with the challenges of unstructured data is to ask: What do enterprises face with traditional approaches to managing unstructured data?
ScaleIt’s common in many enterprises to encounter unstructured datasets at the scale of tens or hundreds of billions of items. These items, objects, or files can be anything from a few bytes (for example, a temperature reading from a production-line instrument) to terabytes in size (for example, a full-length 8K resolution motion picture). Managing this scale with traditional file approaches rapidly moves from difficult to impossible as more and more resources are required just to maintain a “balance” of servers, file systems, arrays, and so on.
CollaborationIncreasingly, these massive unstructured datasets deliver value as they are shared (for example, researchers at multiple hospitals who share a common massive bank of genomic sequences). With traditional approaches, the ability to share massive sets of unstructured data across geographies, corporate entities, and so on, has required extremely expensive replication and governance.