• Anisha Sharma

Data Management in MarkLogic

In order to enable homogeneous data access across many heterogeneous data sources, MarkLogic offers the unique ability to combine numerous heterogeneous data sources (by structure and function) into a single platform design. Data sources do not need to be normalized or stacked in order to offer an accurate representation of the data. In order to provide information to end-users in the language of their choice, MarkLogic offers a number of ways.


What's on Disk


The management of data on discs is covered in this section. The subjects are:

  • Databases, Forests, and Stands

  • Tiered Storage

  • Super Databases and Super Clusters

  • Partitions, Partition Keys, and Partition Ranges



Databases, Forests, and Stands

One or more forests can be found in a database. A forest is a group of documents that is implemented as a real-world disc directory. One set of papers and their indexes are kept in each forest. In a cluster (when operating as an E-node), a single machine may manage a number of forests, or it may manage none at all. More forests on a multi-core server can improve concurrency because forests can be searched in parallel. As a general rule, there should be one forest and millions or tens of millions of documents stored in each forest for every two cores on a box. In a clustered system, you can have a number of servers, each of which manages a group of forests that are all combined into one.


There are 0 or more stands in each forest. A stand, which is analogous to a stand of trees, physically resides under the forest directory and contains a portion of the forest's data. Each stand includes indexes as well as the actual compressed document data (in TreeData format) (in IndexData).


A forest may comprise a single stand, although numerous stands are more typical because they increase concurrency and facilitate faster data input for MarkLogic.


Tiered Storage

The fastest access to your most important data is provided by the highest tier of storage and computation environments, while the slowest access to your least important data is provided by the lowest tier. MarkLogic enables you to manage your data at various tiers of storage and computation environments. Scaling storage to accommodate enormous volumes of data in the lower layers is economically achievable because of infrastructures like Hadoop and public clouds. You may achieve the best possible trade-offs between cost, performance, availability, and flexibility by separating data among various storage tiers.


Super Databases and Super Clusters

To enable the execution of a single query across many data levels, numerous databases, including those that serve on distinct storage tiers, can be combined into a super-database. Sub-databases are databases that are a part of a super-database. Multiple super-databases may contain a single sub-database.


Partitions, Partition Keys, and Partition Ranges

Data is managed using partitions using MarkLogic Server tiered storage. Each partition comprises a collection of database forests with the same name prefix and partition range.

The extent of element or attribute values for the documents to be stored in a partition is determined by the range of the partition. The partition key is the name of this component or property. The database's range index, collection lexicon, or field set serve as the foundation for the partition key. You can have multiple partitions in a database with various ranges since the partition key is set on the database and the partition range is set on the partition.


10 views0 comments