Ingest the RxRx1 dataset

!lamin init --storage test-rxrx --schema bionty,wetlab
Hide code cell output
 connected lamindb: testuser1/test-rxrx
import lamindb as ln
import bionty as bt
import wetlab as wl

ln.track("Zo0qJt4IQPsb0000")
Hide code cell output
 connected lamindb: testuser1/test-rxrx
 created Transform('Zo0qJt4I'), started new Run('wsyAjnik') at 2024-12-20 15:36:46 UTC
 notebook imports: bionty==0.53.2 lamindb==0.77.3 wetlab==0.39.1

The metadata.csv was originally downloaded from here and deposited on S3.

Load metadata

Read in the raw metadata of the wells:

meta = ln.Artifact(
    "s3://lamindata/rxrx1/metadata.csv",
    description=(
        "Experimental design of RxRx1, e.g. what cell type and"
        " treatment are in each well."
    ),
).load()
meta.head()
Hide code cell output
site_id well_id cell_type dataset experiment plate well site well_type sirna sirna_id
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138
1 HEPG2-08_1_B02_2 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 2 negative_control EMPTY 1138
2 HEPG2-08_1_B03_1 HEPG2-08_1_B03 HEPG2 test HEPG2-08 1 B03 1 treatment s21721 855
3 HEPG2-08_1_B03_2 HEPG2-08_1_B03 HEPG2 test HEPG2-08 1 B03 2 treatment s21721 855
4 HEPG2-08_1_B04_1 HEPG2-08_1_B04 HEPG2 test HEPG2-08 1 B04 1 treatment s20894 710

It seems that the column storing cell lines is erroneously called cell_type. Also dataset refers to something that’s typically called split. Let’s rename it:

meta.rename({"cell_type": "cell_line", "dataset": "split"}, axis=1, inplace=True)

Add a paths column - this is an aggregate over 6 paths for 6 channels. We’ll deconvolute further down:

paths = []
for _, row in meta.iterrows():
    well = row.well
    site = row.site
    paths.append(
        f"images/{row.split}/{row.experiment}/Plate{row.plate}/{well}_s{site}_w1-w6.png"
    )
meta["paths"] = paths

Use more meaningful plate names:

meta["plate"] = meta["plate"].apply(lambda name: f"Plate{name}")

Create a DataFrame with each row as a single image, similar to a link table but with multiple metadata columns:

meta_with_path = meta.copy()
keys_list = []
for key in meta_with_path["paths"]:
    keys = [key.replace("w1-w6.png", f"w{str(channel)}.png") for channel in range(1, 7)]
    keys_list.append(keys)
meta_with_path["path"] = keys_list
meta_with_path = meta_with_path.explode("path").reset_index(drop=True)
del meta_with_path["paths"]
meta_with_path
Hide code cell output
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 Plate1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 Plate1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 Plate1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 Plate1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 Plate1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png
... ... ... ... ... ... ... ... ... ... ... ... ...
753055 U2OS-03_4_O23_2 U2OS-03_4_O23 U2OS train U2OS-03 Plate4 O23 2 treatment s21454 509 images/train/U2OS-03/Plate4/O23_s2_w2.png
753056 U2OS-03_4_O23_2 U2OS-03_4_O23 U2OS train U2OS-03 Plate4 O23 2 treatment s21454 509 images/train/U2OS-03/Plate4/O23_s2_w3.png
753057 U2OS-03_4_O23_2 U2OS-03_4_O23 U2OS train U2OS-03 Plate4 O23 2 treatment s21454 509 images/train/U2OS-03/Plate4/O23_s2_w4.png
753058 U2OS-03_4_O23_2 U2OS-03_4_O23 U2OS train U2OS-03 Plate4 O23 2 treatment s21454 509 images/train/U2OS-03/Plate4/O23_s2_w5.png
753059 U2OS-03_4_O23_2 U2OS-03_4_O23 U2OS train U2OS-03 Plate4 O23 2 treatment s21454 509 images/train/U2OS-03/Plate4/O23_s2_w6.png

753060 rows × 12 columns

Validate and register metadata

rxrx_curator = ln.Curator.from_df(
    meta_with_path,
    categoricals={
        "cell_line": bt.CellLine.name,
        "split": ln.ULabel.name,
        "experiment": wl.Experiment.name,
        "plate": ln.ULabel.name,
        "well": wl.Well.name,
        "well_type": ln.ULabel.name,
        "sirna": wl.GeneticPerturbation.name,
    },
)
Hide code cell output
 added 7 records with Feature.name for "columns": 'cell_line', 'split', 'experiment', 'plate', 'well', 'well_type', 'sirna'
rxrx_curator.validate()
Hide code cell output
 saving validated records of 'cell_line'
 added 4 records from public with CellLine.name for "cell_line": 'U-2 OS cell', 'Hep G2 cell', 'hTERT RPE-1 cell', 'HUV-EC-C cell'
 mapping "cell_line" on CellLine.name
!   4 terms are not validated: 'HEPG2', 'HUVEC', 'RPE', 'U2OS'
    4 synonyms found: "HEPG2" → "Hep G2 cell", "HUVEC" → "HUV-EC-C cell", "RPE" → "hTERT RPE-1 cell", "U2OS" → "U-2 OS cell"
    → curate synonyms via .standardize("cell_line")
 mapping "split" on ULabel.name
!   2 terms are not validated: 'test', 'train'
    → fix typos, remove non-existent values, or save terms via .add_new_from("split")
 mapping "experiment" on Experiment.name
!   51 terms are not validated: 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', 'HUVEC-23', 'HUVEC-24', 'RPE-08', 'RPE-09', 'RPE-10', 'RPE-11', 'U2OS-04', 'U2OS-05', 'HEPG2-01', 'HEPG2-02', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("experiment")
 mapping "plate" on ULabel.name
!   4 terms are not validated: 'Plate1', 'Plate2', 'Plate3', 'Plate4'
    → fix typos, remove non-existent values, or save terms via .add_new_from("plate")
 mapping "well" on Well.name
!   308 terms are not validated: 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("well")
 mapping "well_type" on ULabel.name
!   3 terms are not validated: 'negative_control', 'treatment', 'positive_control'
    → fix typos, remove non-existent values, or save terms via .add_new_from("well_type")
 mapping "sirna" on GeneticPerturbation.name
!   1139 terms are not validated: 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', 's19911', 's21543', 's195030', 's20290', 's20345', 's20305', 's20110', 's21048', 's20519', 's21045', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("sirna")
False
rxrx_curator.standardize("cell_line")
rxrx_curator.add_new_from("split")
rxrx_curator.add_new_from("experiment")
rxrx_curator.add_new_from("plate")
rxrx_curator.add_new_from("well_type")
# well requires row and column information so we'll create records manually
# sirna requires system information so we'll create records manually
Hide code cell output
 standardized 4 synonyms in "cell_line": "HEPG2" → "Hep G2 cell", "HUVEC" → "HUV-EC-C cell", "RPE" → "hTERT RPE-1 cell", "U2OS" → "U-2 OS cell"
 added 2 records with ULabel.name for "split": 'train', 'test'
 added 51 records with Experiment.name for "experiment": 'HUVEC-23', 'HUVEC-07', 'HUVEC-10', 'U2OS-01', 'HEPG2-03', 'HEPG2-09', 'HEPG2-01', 'RPE-01', 'RPE-05', 'HEPG2-04', 'HEPG2-10', 'U2OS-05', 'HUVEC-16', 'HEPG2-07', 'HUVEC-24', 'HUVEC-13', 'HUVEC-09', 'RPE-08', 'HUVEC-06', 'RPE-02', ...
 added 4 records with ULabel.name for "plate": 'Plate3', 'Plate2', 'Plate4', 'Plate1'
 added 3 records with ULabel.name for "well_type": 'treatment', 'positive_control', 'negative_control'

well

We also want to add the well information to link image files and parse images based on well coordinates. We first extract well locations from the table:

# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
wells = [
    wl.Well(name=well, row=well[0], column=int(well[1:]))
    for well in meta["well"].unique()
]
ln.save(wells)
ln.settings.creation.search_names = True

sirna

Add sirna to GeneticPerturbation table:

# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
sirnas = [
    wl.GeneticPerturbation(
        name=sirna,
        system="siRNA",
    )
    for sirna in meta["sirna"].unique()
]
ln.save(sirnas)
ln.settings.creation.search_names = True

cell_line

Add commonly used abbreviations:

bt.CellLine.get("30n7ByjL").set_abbr("HUVEC")
bt.CellLine.get("6EK4GXdy").set_abbr("U2OS")
bt.CellLine.get("og6IaxOV").set_abbr("RPE")
bt.CellLine.get("4ea731nb").set_abbr("HEPG2")

Register metadata file

meta_af = rxrx_curator.save_artifact(
    key="rxrx1/metadata.parquet",
    description="Metadata with file paths for each RxRx1 image.",
)

# Add a `readout` label using The `Experimental Factor Ontology`:
readout_feat = ln.Feature(name="readout", dtype="cat").save()
readout = bt.ExperimentalFactor.from_source(name="high content screen").save()
meta_af.labels.add(readout, readout_feat)
Hide code cell output
 "cell_line" is validated against CellLine.name
 "split" is validated against ULabel.name
 "experiment" is validated against Experiment.name
 "plate" is validated against ULabel.name
 "well" is validated against Well.name
 "well_type" is validated against ULabel.name
 "sirna" is validated against GeneticPerturbation.name
! 5 unique terms (41.70%) are not validated for name: 'site_id', 'well_id', 'site', 'sirna_id', 'path'
! did not create Feature records for 5 non-validated names: 'path', 'sirna_id', 'site', 'site_id', 'well_id'
 created 1 ExperimentalFactor record from Bionty matching name: 'high content screen'
 created 1 ExperimentalFactor record from Bionty matching ontology_id: 'EFO:0005397'
meta_af.describe()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'ObUn5B5Ytjp8Ucx90000'
│   ├── .key = 'rxrx1/metadata.parquet'
│   ├── .size = 5720995
│   ├── .hash = '8KaHcVpmukO1EhY2qL7IGA'
│   ├── .path = 
│   │   /home/runner/work/lamin-spatial/lamin-spatial/docs/notebooks/test-rxrx/.lamindb/ObUn5B5Ytjp8Ucx90000.parque
│   │   t
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:37:09
│   └── .transform = 'Ingest the RxRx1 dataset'
├── Dataset features/.feature_sets
│   └── columns7                 [Feature]                                                           
cell_line                   cat[bionty.CellLine]       HUV-EC-C cell, Hep G2 cell, U-2 OS cell,…
experiment                  cat[wetlab.Experiment]     HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, …
plate                       cat[ULabel]                Plate1, Plate2, Plate3, Plate4           
sirna                       cat[wetlab.GeneticPertur…  EMPTY, n337250, s1174, s12279, s134, s13…
split                       cat[ULabel]                test, train                              
well                        cat[wetlab.Well]           B02, B03, B04, B05, B06, B07, B08, B09, …
well_type                   cat[ULabel]                negative_control, positive_control, trea…
├── Linked features
│   └── readout                     cat[bionty.ExperimentalF…  high content screen                      
└── Labels
    └── .experiments                wetlab.Experiment          HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, …
        .wells                      wetlab.Well                B02, B03, B04, B05, B06, B07, B08, B09, …
        .genetic_perturbations      wetlab.GeneticPerturbati…  EMPTY, n337250, s1174, s12279, s134, s13…
        .cell_lines                 bionty.CellLine            U-2 OS cell, Hep G2 cell, hTERT RPE-1 ce…
        .experimental_factors       bionty.ExperimentalFactor  high content screen                      
        .ulabels                    ULabel                     train, test, Plate3, Plate2, Plate4, Pla…

Register images

ln.UPath("gs://rxrx1-europe-west4/images").view_tree(level=2)
Hide code cell output
53 sub-directories & 0 files
gs://rxrx1-europe-west4/images
├── test/
│   ├── HEPG2-08/
│   ├── HEPG2-09/
│   ├── HEPG2-10/
│   ├── HEPG2-11/
│   ├── HUVEC-17/
│   ├── HUVEC-18/
│   ├── HUVEC-19/
│   ├── HUVEC-20/
│   ├── HUVEC-21/
│   ├── HUVEC-22/
│   ├── HUVEC-23/
│   ├── HUVEC-24/
│   ├── RPE-08/
│   ├── RPE-09/
│   ├── RPE-10/
│   ├── RPE-11/
│   ├── U2OS-04/
│   └── U2OS-05/
└── train/
    ├── HEPG2-01/
    ├── HEPG2-02/
    ├── HEPG2-03/
    ├── HEPG2-04/
    ├── HEPG2-05/
    ├── HEPG2-06/
    ├── HEPG2-07/
    ├── HUVEC-01/
    ├── HUVEC-02/
    ├── HUVEC-03/
    ├── HUVEC-04/
    ├── HUVEC-05/
    ├── HUVEC-06/
    ├── HUVEC-07/
    ├── HUVEC-08/
    ├── HUVEC-09/
    ├── HUVEC-10/
    ├── HUVEC-11/
    ├── HUVEC-12/
    ├── HUVEC-13/
    ├── HUVEC-14/
    ├── HUVEC-15/
    ├── HUVEC-16/
    ├── RPE-01/
    ├── RPE-02/
    ├── RPE-03/
    ├── RPE-04/
    ├── RPE-05/
    ├── RPE-06/
    ├── RPE-07/
    ├── U2OS-01/
    ├── U2OS-02/
    └── U2OS-03/

Take a subset to run on CI:

images = ln.Artifact(
    "gs://rxrx1-europe-west4/images/test/HEPG2-08", description="RxRx1 image files"
)
images.n_objects
Hide code cell output
_request non-retriable exception: Anonymous caller does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/retry.py", line 130, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/retry.py", line 117, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist)., 401
 due to lack of write access, LaminDB won't manage storage location: gs://rxrx1-europe-west4/
• path in storage 'gs://rxrx1-europe-west4' with key 'images/test/HEPG2-08'
14772
images.save()
Hide code cell output
Artifact(uid='SGmmlV2hEJYeni9w0000', is_latest=True, key='images/test/HEPG2-08', description='RxRx1 image files', suffix='', size=994441606, hash='6r5Hkce0UTy7X6gLeaqzBA', n_objects=14772, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=3, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:35 UTC)
collection = ln.Collection(
    images, name="Annotated RxRx1 images", meta_artifact=meta_af, version="1"
)
collection.save()
Hide code cell output
Collection(uid='rOzX4PVMzUcTac0Q0000', version='1', is_latest=True, name='Annotated RxRx1 images', hash='dycM8ypgnRRF9zXLSeD_sA', meta_artifact=Artifact(uid='ObUn5B5Ytjp8Ucx90000', is_latest=True, key='rxrx1/metadata.parquet', description='Metadata with file paths for each RxRx1 image.', suffix='.parquet', type='dataset', size=5720995, hash='8KaHcVpmukO1EhY2qL7IGA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:09 UTC), visibility=1, created_by_id=1, transform_id=1, run_id=1, created_at=2024-12-20 15:37:35 UTC)
collection.meta_artifact
Hide code cell output
Artifact(uid='ObUn5B5Ytjp8Ucx90000', is_latest=True, key='rxrx1/metadata.parquet', description='Metadata with file paths for each RxRx1 image.', suffix='.parquet', type='dataset', size=5720995, hash='8KaHcVpmukO1EhY2qL7IGA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:09 UTC)
collection.data_artifact
Hide code cell output
Artifact(uid='SGmmlV2hEJYeni9w0000', is_latest=True, key='images/test/HEPG2-08', description='RxRx1 image files', suffix='', size=994441606, hash='6r5Hkce0UTy7X6gLeaqzBA', n_objects=14772, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=3, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:35 UTC)
collection.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = 'rOzX4PVMzUcTac0Q0000'
    ├── .hash = 'dycM8ypgnRRF9zXLSeD_sA'
    ├── .version = '1'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2024-12-20 15:37:35
    └── .transform = 'Ingest the RxRx1 dataset'