Distributed File System: Description, Features, Benefits

Distributed file system acts as a special system that accesses files on the network, ensuring the availability and safety of data on most server machines. An analog of a network platform is considered to be a traditional local file system that controls mass storage devices located on a PC.

Network Database Basics

Network database basics




These elements relate to network file systems and guarantee access to them on servers. With their support, the user has the opportunity to form a complete internetwork file system. It includes various server tools.

Distributed file systems (RFUs) provide mirroring, replication and backup from the database on any drives, which allows the developer to edit their own files, delete or save configurations.

There are several RFUs that differ in application, interface, and protocol, as well as various functions such as caching, logging, multi-channel use in local networks. Because the throughput of distributed file systems for clusters is extremely low, these applications have special systems with transfer rates of more than 100 MB / s. These include the Global System (GFS) and the Proprietary Common System (GPFS).





The RFU is hierarchically structured and has a single, logical naming convention. This is a network protocol that allows a user to access files without knowing the location of the server. The central tree structure makes it easy to find files throughout the company. They are stored redundantly and are fully accessible even in the event of a failure of the main hard drive. In a broader sense, RFU refers to the network protocol for accessing the file system.

Examples are:

  1. Network File System (NFS).
  2. Common Internet File System (CIFS), Server Message Block Extension (SMB).
  3. Apple Submission Protocol (AFP) Apple.
  4. Novell NetWare Basic Protocol (NCP).
Microsoft Windows DFS




Famous implementations of the RFU are:

  1. DFS on Microsoft Windows. Distributed DFS file system with the Microsoft standard in server operating systems. It first appeared in Windows NT4 and was shipped with Windows 2000 Server. In Windows Server 2003, improvements were added to the server, such as several DFS roots.
  2. AFS Andrew File System, for which several manufacturers exist as part of the Distributed Computing Environment project.
  3. DCE Open Group Consortium as a further development for AFSCoda, developed at Carnegie Mellonoscos University.
  4. BeeGFS / FhGFS for clusters and HPCGlusterFS applications, for all POSIX compatible operating systems.
  5. The Hadoop file system offers objects, block and file storage, part of the Linux kernel, LGPL.XtreemFS, fail-safe RFU with POSIX-compatible interface.
  6. Google file system (GFS, GoogleFS) from Google, based on Linux, optimized for high data throughput.

Comparison of distributed file systems.





Comparing distributed file systems




Service and types of system services

Such a system provides the following services:

  1. Storage service. The distribution and management of space on the secondary storage device, thereby providing a logical view of the storage system.
  2. True file service. Includes file sharing semantics, caching mechanism, replication, concurrency control, copy protocol for multiple copies.
  3. Directory Name Service. Responsible for actions related to the directory: creating and deleting directories, adding a new file to the directory, deleting from the directory, changing the name, moving it from one directory to another.
Service and types of system services




Necessary functions of the RFU:

  1. Transparency. DFS distributed file system clients do not need to know the number or location of file servers and storage devices. Many file servers provide performance, scalability, reliability and transparency of access.
  2. Both local and remote files must be accessible the same way. The system should automatically find available and transfer it to the client’s site. The file name should not indicate the location of the file. It should not change when moving from one node to another. If the file is replicated on several nodes, then the presence of several copies and their location should be hidden from clients.
  3. Mobility automatically activates the user environment, for example, the user's home directory to the node into which he is a member.
  4. Performance is measured as the average time it takes to meet customer needs. This time includes processor time + time for access to the secondary storage + network access time. It is desirable that the performance of a distributed Windows file system be comparable to that of a centralized system.
  5. The user interface in the system is simple, however, the number of commands should be as small as possible.
  6. Scalability, growth of sites and users should not seriously disrupt service.
  7. High availability The RFU should continue to function in the face of partial failures, such as a communication, host, or drive failure, and should have several independent file servers that manage multiple storage devices.
  8. High reliability. The likelihood of losing stored data should be minimized. The system should automatically back up critical files.
  9. The integrity of the data is ensured by the parallel access requests from several users who compete for access and must be correctly synchronized using the multi-form control mechanism.
  10. Users must be sure of the confidentiality of their data.
  11. The heterogeneity of the RFU, easy access to common data on different platforms should be provided, for example, Unix workstation, Wintel platform and others.

Block Level Transfer Model

Block Level Transfer Model




In file systems using a data caching model, an important design issue is the choice of data transfer unit. This refers to the fraction of the file that is transferred and generated by clients as a result of a single read or write operation.

In a file-level transfer model, when data needs to be transferred, the entire file is moved. Advantages of the model:

  1. A file should be transferred only once in response to a client’s request and, therefore, it is more efficient than page-by-page transfer, which requires more network protocols.
  2. Reduces server load and network traffic because it accesses the server only once.
  3. This improves scalability. When the entire file is cached on the client site, it is immune to server and network failures.

The disadvantages of the model:

  1. Sufficient storage space is required on the client machine. This approach is not suitable for very large files, especially when the client is running on a diskless workstation.
  2. If only a small part of the file is required, moving the entire file is wasteful.
  3. File transfer occurs in blocks. It is its adjacent part and has a fixed length and can also be equal to the size of the virtual memory page.

For the transfer model, the transfer unit is a byte. The model provides maximum flexibility because it allows you to store and retrieve an arbitrary file size specified by the offset within and the length. The disadvantage is that cache management is more complicated due to variable length data for different access requests.

The record-level transfer model is used with structured files, and the transfer unit is the record. Several users can access the shared file at the same time. An important design problem for any file system is to determine when changes to data files made by the user are observed by other users.

Shapes and cache layout

Shapes and cache layout




Each distributed Windows file system uses its own form of caching.

Reasons for creating a cache:

  1. Better performance, because repeated accesses to the same information are processed by additional network access and disk transfers.
  2. This is due to locality in file access templates.
  3. It contributes to the scalability and reliability of the RFU, since the data can be remotely cached on the client node.

The main decisions that should be made in the file caching scheme for RFU:

  1. Cache location
  2. Modification of distribution.
  3. Cache check.

The cache location refers to the location of the cached data. Assuming that the original location of the file is on the disk of its server. There are several possible cache locations in RFU:

  1. The main memory of the server. In this case, the cache falls into one network access. This does not contribute to the scalability and reliability of the system, since each cache click requires access to the server. The advantages of the method are ease of implementation, transparency for clients, ease of preservation of the source file in the cache.
  2. When using a client disk, the cache gets to one disk access. This is somewhat slower than having a cache in the server’s main memory. The advantages of distributed file systems when using a client disk provides reliability from failures, since changing cached data is lost during a failure. This high-capacity option contributes to scalability and reliability because the remote access request can be serviced locally in the cache without the need to access the server.

Distribution modification

Distribution modification




When the cache is located on clients nodes, file data can be cached simultaneously on several nodes. It is possible that the caches become inconsistent when the file data is changed by one of the clients, and the corresponding data cached in other nodes is not changed or discarded.

There are two design issues:

  1. When propagating changes made to cached data to the appropriate file server.
  2. When validating cached data.

The modification propagation scheme used has a critical effect on system performance and reliability.

The Record Scheme method is used when the cache entry changes; a new value is immediately sent to the server to update the main copy of the file. The advantage of the method is a high degree of reliability and suitability for UNIX-like semantics. This is due to the fact that the risk of updating data lost in the event of a client failure is very low, since each modification immediately spreads to the server that has the main copy.

The disadvantage is that this scheme is suitable only when the ratio of read to write accesses is quite large. It does not reduce network traffic for recording. This is due to the fact that each write access must wait until the data is written to the main copy of the server.

Write delay circuit

Write delay circuit




To reduce network traffic for recording, a write-delay circuit is used. In this case, the new data value is written only to the cache, and all updated cache entries are sent to the server later.

There are three commonly used write-delay approaches:

  1. Write when popped from the cache. Modified data in the cache is sent to the server only when the cache replacement policy has decided to extract data from the cache. This can lead to good performance, but there may be a reliability issue, as some server data has been aging for a long time.
  2. Periodic recording. The cache is periodically checked and any cached data that has changed since the last scan is sent to the server.
  3. Closing. Modification of cached data is sent to the server when the client closes the file. This does little to reduce network traffic for files that are open for very short periods or rarely change.

Benefits of a write-delay circuit:

  1. Access recording is faster because the new value is written only to the client’s cache. This leads to increased productivity.
  2. Modified data can be deleted before it is time to send it to the server, for example, temporary data. Since the modifications should not extend to the server, this leads to a significant increase in performance.
  3. Collecting all file updates and sending them to the server is more efficient than sending each update separately.

The drawback of a write-delayed scheme is that reliability can still be problematic, as changes that are not sent to the server from the client’s cache will be lost.

Replication as an accessibility mechanism

High availability is a necessary feature of a good distributed file system, and file replication is the main mechanism for improving file accessibility.

A replicated file is a file that has several copies, each on a separate server.

The difference between replication and caching

  1. A file replica is associated with a server, while a cached copy is usually associated with a client.
  2. The existence of a cached copy primarily depends on the location in the file access templates, while the presence of a replica usually depends on availability and performance requirements.
  3. Compared to a cached copy, the replica is more persistent, widely known, safe, affordable, complete and accurate.
  4. The cached copy is replica dependent. Only by periodically checking against a replica can a cached copy be useful.

Replication Benefits:

  1. Increased availability. Alternate copies of replicated data can be used when the primary copy is not available.
  2. Increased Reliability. Due to the presence of redundant data files in the system, it becomes possible to recover from catastrophic failures, for example, a hard disk failure.
  3. Improved response time. It allows you to access data either locally or from a host for which the access time is less than the access time to the primary copy.
  4. Reduced network traffic. If a file replica is available with a file server located on the client node, the client access request can be serviced locally, which reduces network traffic.
  5. Improved system bandwidth. Several client requests for access to the file can be served in parallel on different servers, which leads to an increase in system throughput.
  6. Improved scalability. Several servers are available for servicing client requests, because of file replication. This improves scalability.

Setting up the client when disconnecting

A common problem with DFS is the message “DFS Distributed File System Client Disconnected”. Microsoft has solutions to this problem, for this you need to enable the client on the server, for example, Windows Server 2012 R2.

Algorithm of actions:

  1. Open Server Manager and select “DFS Management” on the “Service” tab; if the user cannot find it, add the DFS Namespace function.
  2. Click with the mouse and select “New namespace”, the wizard will start.
  3. Specify the host name, give your namespace the name of the distributed DFS file system.
  4. Click Create, and the DFS namespace.
  5. Include shared folders in DFS.
  6. Select a namespace and click the New Folder folder.
  7. Merge multiple folders into a unique virtual folder.
  8. You can see the created path \\ Domain_Name \ Namespace_Name \ Virtual_folder_Name.
  9. After this message, “the distributed file system service is not installed”, the user will no longer receive it.

Linux Network Sharing System

NFS is the most common file system for sharing network resources. The most common version is NFS v2. This Linux distributed file system behaves like the top level of a local file system. Remote files are accessed through RPC procedure calls. It does not care about the state of the server accessible or inaccessible and uses very few file caching technologies. In addition, the security of this system is based on customer confidence. Indeed, this is the identifier of the client, which is transmitted to familiarize yourself with access rights to resources.

NFS v3 - NFS Unix, . , 2 64- , . Unix , Kerberos. , . . NFS v3 Linux, .

Scalable block storage




Ceph - , , . Ceph CRUSH, , - .

Ceph Amazon Simple (S3) OpenStack Swift (REST) , API . Ceph , Linux . Ceph (RADOS) , .

Ceph RADOS OpenStack. Ceph POSIX CephFS (CephFS) Ceph. CephFS , Ceph Ceph.

Benefits of a Distributed File System

Benefits of a Distributed File System




Technically, it provides access to a shared directory that does not contain files, but only transitions and optional subdirectories with a large number of transitions. Transitions are similar to soft links, as is known from Unix file systems, but refer to shared directories and may point to shared directories on other servers. Clients first request a DFS server for the connection, then access the file server that the connection points to.

The main objective of using the DFS distributed file system is to create an alternative namespace (directory tree view) that hides the details of the underlying infrastructure from users. The paths that users see and are called DFS names do not change when renaming servers or when moving some of the directories to another server.

Administrators can simply replace the obsolete name with a new one, which indicates a new goal. A name can indicate more than one purpose, that is, provide the client with several alternative connections for different public folders. In this case, clients of the distributed DFS file system can access any of the targets. This provides load balancing and automatic switching to another server if one of the servers fails.

Thanks to DFS, there is no longer a strict connection to the server / shared. The memory is presented in the form of a large-capacity pool, behind which are file systems hidden to the user. This is actually an incredibly useful tool for addressing the growing demands that the file system allocate disk space for new servers based on availability requirements.

A technology similar to Windows DFS benefits any company, large or small. For large companies, the more flexible use of storage resources pays off. Since all disks are part of virtual memory, there are no more unused or full disks and arrays.

Smaller companies, however, appreciate the standardization of administration. Due to its limited resources, it is difficult to track full servers, timely update them to large disks and allocate space between applications.

DFS does not represent storage space in such a way that users and applications want to see it because it really exists. And since the server and the client component are an integral part of the Windows operating system, the installation and configuration process requires little effort from the administrator and practically does not affect the work of users.

Developers have integrated comprehensive management of the Windows DFS distributed file system; the console is a single point of management for several DFS root systems. Graphical tools facilitate viewing and monitoring. Management is possible even on websites.




All Articles