Jupyter Notebook

Query arrays

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

Hide code cell content
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-array-notebook --name test-array-notebook
✓ logged in with email testuser1@lamin.ai (uid: DzTjkKse)
→ go to: https://lamin.ai/testuser1/test-array-notebook
! updating cloud SQLite 's3://lamindb-ci/test-array-notebook/58eab9b6d7965975a7dc17a4bcbc5306.lndb' of instance 'testuser1/test-array-notebook'
→ connected lamindb: testuser1/test-array-notebook
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
import lamindb as ln
→ connected lamindb: testuser1/test-array-notebook
ln.settings.verbosity = "info"

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/sharded_parquet").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! no run & transform got linked, call `ln.track()` & re-run
! no run & transform got linked, call `ln.track()` & re-run
Artifact(uid='ijdIAo9Ae7WpFbhF0000', is_latest=True, key='lndb-storage/sharded_parquet', suffix='', size=42767, hash='Y6gxB0O0gbdTtiodATN79w', n_objects=11, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, created_by_id=1, created_at=2024-10-29 21:24:07 UTC)

AnnData

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
artifact.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = artifact.open()
! run input wasn't tracked, call `ln.track()` and re-run

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X
Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

adata_subset.X
Hide code cell output
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()
Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="lndb-storage/testfile.hdf5")

And get a backed accessor:

backed = artifact.open()
! run input wasn't tracked, call `ln.track()` and re-run

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>

Parquet

A dataframe stored as sharded parquet.

artifact = ln.Artifact.get(key="lndb-storage/sharded_parquet")
artifact.path.view_tree()
Hide code cell output
11 sub-directories & 11 files with suffixes '.parquet'
s3://lamindb-ci/lndb-storage/sharded_parquet
├── louvain=0/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
    └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
! run input wasn't tracked, call `ln.track()` and re-run

This returns pyarrow Dataset, see here.

backed
<pyarrow._dataset.FileSystemDataset at 0x7f315bbc26e0>
backed.head(5).to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620
Hide code cell content
# clean up test instance
!lamin delete --force test-array-notebook
• deleting instance testuser1/test-array-notebook
→ deleted storage record on hub e0641645e20f57989a1a3e3364b9e548
→ deleted instance record on hub 58eab9b6d7965975a7dc17a4bcbc5306