Commit 67218e2e authored by Cristiano Urban's avatar Cristiano Urban
Browse files

Removed README.txt and TODO.txt, added README.md.

parent 46f889ed
Loading
Loading
Loading
Loading

README.md

0 → 100644
+132 −0
Original line number Diff line number Diff line
## VOSpace backend

### Introduction

This repository hosts the code of the VOSpace backend.
This is a dockerized version including all the VOSpace components which can be directly executed on your laptop.

The VOSpace implementation is composed by several parts. Each of these parts is represented by one of the [following repositories](https://www.ict.inaf.it/gitlab/vospace).

For a production-like demo, please refer to the [vospace-demo](https://www.ict.inaf.it/gitlab/vospace/vospace-demo) repository or simply visit [this page](http://staging.ia2.inaf.it/) and try it out.

For more information about the VOSpace specification, please refer to:
- [IVOA Documents & Standards](https://www.ivoa.net/documents/)
- [VOSpace standard v2.1](https://www.ivoa.net/documents/VOSpace/20180620/REC-VOSpace-2.1.html)
- [Universal Worker Service Pattern v1.1](https://www.ivoa.net/documents/UWS/20161024/REC-UWS-1.1-20161024.html)

Further documentation on the VOSpace implementation can be found [here](https://redmine.ict.inaf.it/projects/401/wiki).

### Main features

- Recursive scan, checksum calculation and .tar generation for data provided by users
- Database interaction for storing and retrieving information about VOSpace nodes, jobs, storage points and users
- Simple FCFS (First Come First Served) scheduling mechanism for jobs based on Redis lists
- Set of command line tools to simplify the administrator interaction with the backend architecture.


### Getting started

First of all clone the repository on your local Linux machine, open a terminal and place yourself within the *vospace-transfer-service* folder.
You can launch the whole environment by running the following commands (as user):

```
docker-compose pull
docker-compose up
```
The web interface will be available on your browser at http://localhost:8080/ once all the containers are up and running.

To stop the environment and perform a cleanup, launch the following commands from another shell:

```
docker-compose down
docker system prune -a
docker volume prune
```

### Components

- Client (container_name: client), is used to provide to the user some useful command line tools to interact with the backend
- Transfer service (container_name: transfer_service) is the core of the backend architecture
- RabbitMQ (container_name: rabbitmq), is a AMQP broker used to deliver messages containing requests from the user command line tools and from the VOSpace REST APIs
- Redis (container_name: job_cache), is used as a cache for job queues
- File catalog (container_name: file_catalog), now available [here](https://www.ict.inaf.it/gitlab/vospace/vospace-file-catalog), is a posgresql database used to store information on VOSpace nodes, but also on storage locations, jobs and users.


#### Client

The *client* container provides the following command line tools:

- **vos_data**: launches a job to automatically store data provided by the user on a given storage point (hot or cold)
- **vos_import**: imports VOSpace nodes on the file catalog for data already stored on a given storage point
- **vos_job**: provides information about jobs
- **vos_storage**: allows to add/list/remove storage points.

You can launch each of these commands without any argument to see their help page.

#### Transfer service

The transfer service is the core of the VOSpace backend architecture.

You can access the *transfer_service* container with:
```
docker exec -it transfer_service bash
```
On this container, hosted on the so-called transfer node, each user will find his/her own home folder and two subfolders representing respectively the entry point and exit point for the user data:
- */home/name.surname/store*
- */home/name.surname/retrieve*

The user will copy the data to be stored within the *store* folder, while he/she will find the requested data within the *retrieve* folder.
This use case was implemented in order to try to offer support to users providing huge amounts of data in the order of terabytes.

#### RabbitMQ

You can access the RabbitMQ web interface via browser in two steps.

1. Find the IP address of the RabbitMQ broker:
```
docker network inspect vospace-transfer-service_backend_net | grep -i -A 3 rabbitmq
```
2. Open your browser and point it to http://IPv4Address:15672 (user: guest, password: guest)


#### Redis

You can access the Redis server from the **client** container by following the steps here below.

1. Execute an interactive bash shell on the client container:
```
docker exec -it client bash
```

2. Use *redis-cli* command to connect to Redis:
```
redis-cli -h job_cache
```

3. You can obtain some info about the jobs by searching them on the following queues:
   - For write operations the queues are *write_pending*, *write_ready* and *write_terminated*
   - For read operations the queues are *read_pending*, *read_ready* and *read_terminated*.

Example: list the first six elements of the *write_ready* queue
```
redis:6379> lrange write_ready 0 5
```

#### File catalog

You can access the file catalog from the **client** container by following the steps here below.

1. Execute an interactive bash shell on the client container:
```
docker exec -it client bash
```
2. Access the database via *psql* client:
```
psql -h file_catalog -d vospace_testdb -U postgres
```
3. You can now perform a query, for example show all the tuples of the **node** table displaying some fields:
```
vospace_testdb=# SELECT node_id, path, name, parent_path, type, owner_id, content_md5, async_trans, sticky FROM node;
```

You can perform whatever query also on the other tables: *deleted_node*, *storage*, *location*, *job* and *users*.

README.txt

deleted100644 → 0
+0 −225
Original line number Diff line number Diff line
Simple communication test that involves 5 docker containers:
- client (container_name: client, commands available: 'vos_data')
- server (container_name: transfer_service)
- RabbitMQ (container_name: rabbitmq)
- Redis (container_name: redis)
- File catalog (container_name: file_catalog), now available here: 
  https://www.ict.inaf.it/gitlab/vospace/vospace-file-catalog

In addition to these containers, Sonia Zorba modified 'docker-compose.yml' by adding REST, file service and ui portions.
The images used for this purpose are:
- git.ia2.inaf.it:5050/vospace/vospace-rest
- git.ia2.inaf.it:5050/vospace/vospace-file-service
- git.ia2.inaf.it:5050/vospace/vospace-ui

The web interface is available on your browser at http://localhost:8080/ when all the containers are up and 
running (read the section here below).
  
###############################################################################################################

You can start the whole environment from the 'vos-ts' directory with:
$ docker-compose up

Once all the containers are up and running, open another shell and access the 'client' container:
$ docker exec -it client /bin/bash

Now you can launch the 'vos_data' command.
Launching the client without any argument will show you how to use it:

client@28970a09202d:~$ vos_data

NAME
       vos_data

SYNOPSYS
       vos_data COMMAND USERNAME

DESCRIPTION
       The purpose of this client application is to notify to the VOSpace backend that
       data is ready to be saved somewhere.
       
       The client accepts only one (mandatory) command at a time.
       A list of supported commands is shown here below:

       cstore
              performs a 'cold storage' request, data will be saved on tape

       hstore
              performs a 'hot storage' request, data will be saved to disk

       The client also needs to know the username associated to a storage request process.
       The username must be the same used for accessing the transfer node.

       
For example, if we want to perform a 'cold storage' request for the 'curban' user, we do:
client@28970a09202d:~$ vos_data cstore curban

Choose one of the following storage locations:

----------------------------------------------------------------------
[*] storage_id: 1    =>   hostname: tape-fe.ia2.inaf.it
----------------------------------------------------------------------

Please, insert a storage id: 1

!!!!!!!!!!!!!!!!!!!!!!!!!!WARNING!!!!!!!!!!!!!!!!!!!!!!!!!!!
If you confirm, all your data on the transfer node will be
available in read-only mode for all the time the storage
process is running.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Are you sure to proceed? [yes/no]: yes

JobID: c63697eafbf711eaa44d0242ac1c0008
Storage process started successfully!

client@28970a09202d:~$


After receiving this request the application will:
1) Create a job object, insert it into the job table of the file catalog database and push a copy into a 
   'pending' queue stored in Redis for scheduling purposes
2) Scan the content of '/home/curban/store/' to find crowded 'leaf' dirs and substitute these ones wit an
   uncompressed tar according to some constraints defined in the global configuration file
3) Re-scan the folder, move the content into a temporary folder if needed and perform recursive MD5 checksum
4) Re-scan the folder for the last time in order to obtain the final directory structure
5) Insert information about files and folders into the Node table of the file catalog, according to the VOSpace
   specification
6) Move the job from the 'write_pending' queue to a 'Write_ready' queue in Redis, if all the previous steps 
   succeeded.
7) Obtain the physical paths from the VOSpace paths of the nodes and copy all the data to the right destination
   according to the information previously inserted by the user
8) Cleanup of the '/home/curban/store/' directory (remove data and set right permissions) and database update
   (async_trans flag is set to 'true').

   
Another thing you can do is to import nodes on the VOSpace file catalog from data already stored somewhere.
For example, suppose we have a hot storage mounted on /mnt/hot_storage/users and visible from the transfer node.
Our user folder will be, for example, /mnt/hot_storage/users/curban.

On the transfer node you will find a directory called 'test_import' containing some data to be used for an import
test.

First of all, launch vos_import without any argument in order to see how to use it:

client@28970a09202d:~$ vos_import 

NAME
       vos_import

SYNOPSYS
       vos_import DIR_PATH USERNAME

DESCRIPTION
       This tool recursively imports nodes on the VOSpace file catalog.
       
       Two parameters are required:

       DIR_PATH:
           the physical absolute path of a directory located within the 
           user directory for a given mount point.
           
       USERNAME:
           the username used for accessing the transfer node.
           
EXAMPLE
      The following command will import recursively all the nodes contained
      in 'mydir' on the VOSpace for the 'jsmith' user:
      
      # vos_import /mnt/storage/users/jsmith/mydir jsmith   
    
client@28970a09202d:~$

Now, launch the import command to import the 'test_import' directory:

client@28970a09202d:~$ vos_import /mnt/hot_storage/users/curban/test_import curban

Import procedure completed!

client@28970a09202d:~$

This kind of operation works only for directories located at the first level of your user folder.


###############################################################################################################
     
You can access the rabbitmq web interface via browser:
    1) Find the IP address of the RabbitMQ broker:
    $ docker network inspect vos-ts_backend_net | grep -i -A 3 rabbitmq
    2) Open your browser and point it to http://IP_ADDRESS:15672 (user: guest, password: guest)

You can access the redis server from the 'client' container:
    1) Use redis-cli to connect to redis:
    $ redis-cli -h redis
    2) You can obtain some info about the jobs by searching them on the 'write_pending' and 'write_ready' queues 
       using the lrange command. For example, a few seconds after launching three jobs with 'dataArchiverCli.py',
       you should be able to see an output similar to the following one:
    redis:6379[2]> lrange write_ready 0 5
    1) "{\"jobId\": \"56577c8645da11ebbbfe356e379843eb\", \"jobType\": \"other\", \"ownerId\": \"2386\", \"phase\": \"PENDING\", 
    \"quote\": null, \"startTime\": null, \"endTime\": null, \"executionDuration\": null, \"destruction\": null, \"parameters\": null, 
    \"results\": null, \"jobInfo\": {\"requestType\": \"HSTORE\", \"userName\": \"szorba\"}}"
    2) "{\"jobId\": \"53d2f2a545da11ebb7bd356e379843eb\", \"jobType\": \"other\", \"ownerId\": \"2048\", \"phase\": \"PENDING\", 
    \"quote\": null, \"startTime\": null, \"endTime\": null, \"executionDuration\": null, \"destruction\": null, \"parameters\": null, 
    \"results\": null, \"jobInfo\": {\"requestType\": \"CSTORE\", \"userName\": \"sbertocco\"}}"
    3) "{\"jobId\": \"502afdca45da11eb9676356e379843eb\", \"jobType\": \"other\", \"ownerId\": \"3354\", \"phase\": \"PENDING\", 
    \"quote\": null, \"startTime\": null, \"endTime\": null, \"executionDuration\": null, \"destruction\": null, \"parameters\": null, 
    \"results\": null, \"jobInfo\": {\"requestType\": \"CSTORE\", \"userName\": \"curban\"}}"
            
You can access the file catalog from the 'client' container:
    1) Access the db via psql client:
    $ psql -h file_catalog -d vospace_testdb -U postgres
    2) You can now perform a query, for example show all the tuples of the Node table displaying some fields:
    vospace_testdb=# SELECT node_id, parent_path, path, name, type, owner_id, creator_id, content_MD5 FROM Node;
    
The default output of the query after the container initialization should be something like this:
 
vospace_testdb=# SELECT node_id, parent_path, path, name, tstamp_wrapper_dir, type, owner_id, creator_id, content_MD5 FROM node;
 node_id | parent_path |  path   |    name    | tstamp_wrapper_dir |   type    | owner_id | creator_id | content_md5 
---------+-------------+---------+------------+--------------------+-----------+----------+------------+-------------
       1 |             |         |            |                    | container | 0        | 0          | 
       2 |             | 2       | curban     |                    | container | 3354     | 3354       | 
       3 |             | 3       | sbertocco  |                    | container | 2048     | 2048       | 
       4 |             | 4       | szorba     |                    | container | 2386     | 2386       | 
       5 |             | 5       | test       |                    | container | 2386     | 2386       | 
       6 | 5           | 5.6     | f1         |                    | container | 2386     | 2386       | 
       7 | 5.6         | 5.6.7   | f2_renamed |                    | container | 2386     | 2386       | 
       8 | 5.6.7       | 5.6.7.8 | f3         |                    | data      | 2386     | 2386       | 
(8 rows)
       
A few seconds after launching three jobs with 'dataArchiverCli.py', the database will be populated and launching the previous
SQL query you will be able an output like the one here below:       

vospace_testdb=# SELECT node_id, parent_path, path, name, tstamp_wrapper_dir, type, owner_id, creator_id, content_MD5 FROM node;
 node_id | parent_path |    path    |       name       | tstamp_wrapper_dir  |   type    | owner_id | creator_id |           content_md5          
  
---------+-------------+------------+------------------+---------------------+-----------+----------+------------+----------------------------------
       1 |             |            |                  |                     | container | 0        | 0          | 
       2 |             | 2          | curban           |                     | container | 3354     | 3354       | 
       3 |             | 3          | sbertocco        |                     | container | 2048     | 2048       | 
       4 |             | 4          | szorba           |                     | container | 2386     | 2386       | 
       5 |             | 5          | test             |                     | container | 2386     | 2386       | 
       6 | 5           | 5.6        | f1               |                     | container | 2386     | 2386       | 
       7 | 5.6         | 5.6.7      | f2_renamed       |                     | container | 2386     | 2386       | 
       8 | 5.6.7       | 5.6.7.8    | f3               |                     | data      | 2386     | 2386       | 
       9 | 2           | 2.9        | mydir            | 2021_01_12-14_48_07 | container | 3354     | 3354       | 
      10 | 2           | 2.10       | foo2.txt         | 2021_01_12-14_48_07 | data      | 3354     | 3354       | e07f37a6bfe96ad66e408380a5e3a899
      11 | 2.9         | 2.9.11     | another_foo2.txt | 2021_01_12-14_48_07 | data      | 3354     | 3354       | e048e5108d71191158b50052d531b0ca
      12 | 3           | 3.12       | foo4.txt         | 2021_01_12-14_48_22 | data      | 2048     | 2048       | 5f429d803340bb7748c52b3931ed54cf
      13 | 4           | 4.13       | aaa              |                     | container | 2386     | 2386       | 
      14 | 4.13        | 4.13.14    | bbb              |                     | container | 2386     | 2386       | 
      15 | 4.13.14     | 4.13.14.15 | foo5.txt         |                     | data      | 2386     | 2386       | 262214d5cde30a74997199fb4e220a26
(15 rows)
 
 
 From the file catalog database you can also obtain information about jobs, according to the UWS specification. 
 Just try the following query:
 vospace_testdb=# SELECT * FROM job;
 
 ###############################################################################################################
 
 Stop the whole environment:
 $ docker-compose down
 
 Cleanup:
 $ docker image prune -a
 $ docker volume prune

TODO.txt

deleted100644 → 0
+0 −26
Original line number Diff line number Diff line
- Paths in config file

vospace_path_prefix = /vos

[transfer_node]
base_path = /home/{username}/store

[servers]
hostname =
base_path = /home/{username}

[tape]
frontend = 
base_path = /home/users/{username}

- We should maintain a coherence between {username} on different storage points
  ({username} should be the same in all places)

- Temporary dir with timestamp => flag?

- Add hostname parameter to dataArchiver.py

- How to scale the system: multiple queues (more complex scheduling) + FSM
- If we have more than one tape library, do we need an entry on the configuration file for each one?
  Does spectrum archive send data to the right tape according to a defined policy?
- And how many IA2 servers?
 No newline at end of file