Small updates to the jupyter notebook (457e8eef) · Commits · odmc / hdf5_example

hdf5_example.ipynb

+138 −6

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		## HDF5 Datasets

		In the following we show some examples on how to create Datasets in HDF5 (whith h5py) and update values

		### First dataset

		%% Cell type:code id: tags:

		``` python
		import h5py

		from timeit import timeit # To measure execution time
		import numpy as np # this is the main python numerical library

		f = h5py.File("testdata.hdf5",'w')

		# We create a test 2-d array filled with 1s and with 10 rows and 6 columns
		data = np.ones((10, 6))

		f["dataset_one"] = data

		# We now retrieve the dataset from file (it is still in memory in fact)
		dset = f["dataset_one"]
		```

		%% Output

		<frozen importlib._bootstrap>:491: RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'h5py._errors', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.

		%% Cell type:code id: tags:

		``` python
		#The following instructions show some dataset metadata
		print(dset)
		print(dset.dtype)
		print(dset.shape)
		```

		%% Output

		<HDF5 dataset "dataset_one": shape (10, 6), type "<f8">
		float64
		(10, 6)

		%% Cell type:markdown id: tags:

		### Dataset slicing

		Datasets provide analogous slicing operations as numpy arrays (with h5py). But these selections are translated by h5py to portion of the dataset and then HDF5 reads the data form "disk". Slicing into a dataset object returns a NumpPy array.

		%% Cell type:code id: tags:

		``` python
		# The ellipses means "as many ':' as needed"
		# here we use it to get a numpy array of the
		# entire dataset
		out = dset[...]

		print(out)
		type(out)
		```

		%% Output

		[[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1.]]

		numpy.ndarray

		%% Cell type:code id: tags:

		``` python
		dset[1:5, 1] = 0.0
		dset[...]

		#but we cannot use negative steps with a dataset
		try:
		dset[0,::-1]
		except:
		print('No no no!')
		```

		%% Output

		No no no!

		%% Cell type:code id: tags:

		``` python
		# random 2d distribution in the range (-1,1)
		data = np.random.rand(15, 10)*2 - 1

		dset = f.create_dataset('random', data=data)

		# print the first 5 even rows and the first two columns
		out = dset[0:10:2, :2]
		print(out)

		# clipping to zero all negative values, using boolean indexing
		dset[data<0] = 0
		```

		%% Output

		[[ 0.76179305 -0.20226613]
		[-0.44336689 0.76110537]
		[-0.53034403 0.96049982]
		[-0.73538148 -0.20414973]
		[ 0.79388089 0.10900864]]

		%% Cell type:markdown id: tags:

		### Resizable datasets

		If we don't know in advance the dataset size and we need to append new data several times, we have to create a resizable dataset, then we have to append data in a scalable manner

		%% Cell type:code id: tags:

		``` python
		dset = f.create_dataset('dataset_two', (1,1000), dtype=np.float32,
		maxshape=(None, 1000))

		a = np.ones((1000,1000))

		num_rows = dset.shape[0]
		dset.resize((num_rows+a.shape[0], 1000))

		dset[num_rows:] = a

		print(dset[1:5,:20])
		print(dset[0:5,:20])

		f.close()
		```

		%% Output

		[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
		[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
		[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

		%% Cell type:markdown id: tags:

		## Groups

		We can directly create nested groups with a single instruction. For instance to create the group 'nisp_frame', then the subgroup 'detectors' and at last its child group 'det11', we can use the instruction below.

		%% Cell type:code id: tags:

		``` python
		f = h5py.File("nisp_frame.hdf5",'w')

		grp = f.create_group('nisp_frame/detectors/det11')
		grp['sci_image'] = np.zeros((2040,2040))
		grp['sci_image'] = np.zeros((2040,2040), dtype=np.float32)

		print(grp.name) # the group name property
		print(grp.parent) # the parent group property
		print(grp.file) # the file property
		print(grp) # prints some group information. It has one member, the dataset
		```

		%% Output

		/nisp_frame/detectors/det11
		<HDF5 group "/nisp_frame/detectors" (1 members)>
		<HDF5 file "nisp_frame.hdf5" (mode r+)>
		<HDF5 group "/nisp_frame/detectors/det11" (1 members)>

		%% Cell type:markdown id: tags:

		## Attributes

		Attributes can be defined inside a group or in a dataset. Both have the .attrs property to access an attribute or define new attributes. With h5py, the attribute type is inferred from the passed value, but it is also possible to explicitly assign a type.

		%% Cell type:code id: tags:

		``` python
		grp = f['nisp_frame']
		grp.attrs['telescope'] = 'Euclid'
		grp.attrs['instrument'] = 'NISP'
		grp.attrs['pointing'] = np.array([8.48223045516, -20.4610801911, 64.8793517547])
		grp.attrs.create('detector_id', '11', dtype="\|S2")

		print(grp.attrs['pointing'])
		print(grp.attrs['detector_id'])
		```

		%% Output

		[ 8.48223046 -20.46108019 64.87935175]
		b'11'

		%% Cell type:code id: tags:

		``` python
		f.close()
		!h5ls -vlr testdata.hdf5
		!h5ls -vlr nisp_frame.hdf5
		```

		%% Output

		Opened "nisp_frame.hdf5" with sec2 driver.
		/ Group
		Location: 1:96
		Links: 1
		/nisp_frame Group
		Attribute: detector_id scalar
		Type: 2-byte null-padded ASCII string
		Attribute: instrument scalar
		Type: variable-length null-terminated UTF-8 string
		Attribute: pointing {3}
		Type: native double
		Attribute: telescope scalar
		Type: variable-length null-terminated UTF-8 string
		Location: 1:800
		Links: 1
		/nisp_frame/detectors Group
		Location: 1:1832
		Links: 1
		/nisp_frame/detectors/det11 Group
		Location: 1:2864
		Links: 1
		/nisp_frame/detectors/det11/sci_image Dataset {2040/2040, 2040/2040}
		Location: 1:3896
		Links: 1
		Storage: 16646400 logical bytes, 16646400 allocated bytes, 100.00% utilization
		Type: native float

		%% Cell type:markdown id: tags:

		## Tables (compound types)

		Tables can be stored as datasets where the elements (rows) have the same compound type.

		%% Cell type:code id: tags:

		``` python
		f = h5py.File("testdata.hdf5",'a')
		f = h5py.File("nisp_frame.hdf5",'a')
		dt = np.dtype([('source_id', np.uint32), ('ra', np.float32), ('dec', np.float32), ('magnitude', np.float64)])

		grp = f.create_group('source_catalog/det11')
		dset = grp.create_dataset('star_catalog', (100,), dtype=dt)

		dset['source_id', 0] = 1
		print(dset['source_id', 'ra', :20])
		print(dset[0])
		```

		%% Output

		[(1, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.)
		(0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.) (0, 0.)
		(0, 0.) (0, 0.)]
		(1, 0.0, 0.0, 0.0)

		%% Cell type:markdown id: tags:

		## References and Region references

		In the following instruction we create a reference from the detector 11 scientific image to the corresponding star catalog, which is stored in the same file

		%% Cell type:code id: tags:

		``` python
		sci_image = f['/nisp_frame/detectors/det11/sci_image']
		sci_image.attrs['star_catalog'] = dset.ref
		cat_ref = sci_image.attrs['star_catalog']

		print(cat_ref)
		dset = f[cat_ref]
		print(dset[0])
		dt = h5py.special_dtype(ref=h5py.Reference)
		dt = h5py.ref_dtype
		# the above data type dt can be used to create a dataset of references or just an attribute
		```

		%% Output

		<HDF5 object reference>
		(1, 0.0, 0.0, 0.0)

		%% Cell type:code id: tags:

		``` python
		roi = sci_image.regionref[15:20, 36:78]
		sci_image[roi]
		print(sci_image[roi])
		f.close()
		```

		%% Output

		[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
		0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
		[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
		0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
		[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
		0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
		[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
		0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
		[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
		0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

		%% Cell type:markdown id: tags:

		## Chuncking

		%% Cell type:code id: tags:

		``` python
		rdata = np.random.randint(0,2**16,(100,2048,2048), dtype=np.uint16)

		f = h5py.File('image_sequence.hdf5','w')
		dset = f.create_dataset('nochunk', data=rdata)
		f.flush()
		f.close()

		f = h5py.File('image_sequence_chunked.hdf5','w')
		dset = f.create_dataset('chunked', data=rdata, chunks=(100,64,64))
		f.flush()
		f.close()

		f = h5py.File('image_sequence.hdf5','r')
		dset = f['nochunk']
		```

		%% Cell type:code id: tags:

		``` python
		%%time
		for i in range(32):
		for j in range(32):
		block = dset[:,64i:64(i+1), 64j:64(j+1)]
		np.median(block, 0)
		```

		%% Output

		CPU times: user 9.51 s, sys: 2.55 s, total: 12.1 s
		Wall time: 12.1 s

		%% Cell type:code id: tags:

		``` python
		f.close()
		f = h5py.File('image_sequence_chunked.hdf5','r')
		dset = f['chunked']
		```

		%% Cell type:code id: tags:

		``` python
		%%time
		for i in range(32):
		for j in range(32):
		block = dset[:,64i:64(i+1), 64j:64(j+1)]
		np.median(block, 0)
		```

		%% Output

		CPU times: user 9.51 s, sys: 89.8 ms, total: 9.59 s
		Wall time: 9.59 s

		%% Cell type:code id: tags:

		``` python
		f.close()
		f = h5py.File('image_sequence_chunked.hdf5','a')
		dset = f.require_dataset('auto_chunked', (2048,2048), dtype=np.float32, compression="gzip")
		print(dset.compression)
		print(dset.compression_opts)
		print(dset.chunks)
		```

		%% Output

		gzip
		4
		(64, 128)

		%% Cell type:code id: tags:

		``` python
		```