Ingest the RxRx1 dataset¶
!lamin init --storage test-rxrx --schema bionty,wetlab
Show code cell output
→ connected lamindb: testuser1/test-rxrx
import lamindb as ln
import bionty as bt
import wetlab as wl
ln.track("Zo0qJt4IQPsb0000")
Show code cell output
→ connected lamindb: testuser1/test-rxrx
→ created Transform('Zo0qJt4I'), started new Run('wsyAjnik') at 2024-12-20 15:36:46 UTC
→ notebook imports: bionty==0.53.2 lamindb==0.77.3 wetlab==0.39.1
The metadata.csv
was originally downloaded from here and deposited on S3.
Load metadata¶
Read in the raw metadata of the wells:
meta = ln.Artifact(
"s3://lamindata/rxrx1/metadata.csv",
description=(
"Experimental design of RxRx1, e.g. what cell type and"
" treatment are in each well."
),
).load()
meta.head()
Show code cell output
site_id | well_id | cell_type | dataset | experiment | plate | well | site | well_type | sirna | sirna_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 |
1 | HEPG2-08_1_B02_2 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 2 | negative_control | EMPTY | 1138 |
2 | HEPG2-08_1_B03_1 | HEPG2-08_1_B03 | HEPG2 | test | HEPG2-08 | 1 | B03 | 1 | treatment | s21721 | 855 |
3 | HEPG2-08_1_B03_2 | HEPG2-08_1_B03 | HEPG2 | test | HEPG2-08 | 1 | B03 | 2 | treatment | s21721 | 855 |
4 | HEPG2-08_1_B04_1 | HEPG2-08_1_B04 | HEPG2 | test | HEPG2-08 | 1 | B04 | 1 | treatment | s20894 | 710 |
It seems that the column storing cell lines is erroneously called cell_type
.
Also dataset
refers to something that’s typically called split
.
Let’s rename it:
meta.rename({"cell_type": "cell_line", "dataset": "split"}, axis=1, inplace=True)
Add a paths
column - this is an aggregate over 6 paths for 6 channels. We’ll deconvolute further down:
paths = []
for _, row in meta.iterrows():
well = row.well
site = row.site
paths.append(
f"images/{row.split}/{row.experiment}/Plate{row.plate}/{well}_s{site}_w1-w6.png"
)
meta["paths"] = paths
Use more meaningful plate names:
meta["plate"] = meta["plate"].apply(lambda name: f"Plate{name}")
Create a DataFrame with each row as a single image, similar to a link table but with multiple metadata columns:
meta_with_path = meta.copy()
keys_list = []
for key in meta_with_path["paths"]:
keys = [key.replace("w1-w6.png", f"w{str(channel)}.png") for channel in range(1, 7)]
keys_list.append(keys)
meta_with_path["path"] = keys_list
meta_with_path = meta_with_path.explode("path").reset_index(drop=True)
del meta_with_path["paths"]
meta_with_path
Show code cell output
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | Plate1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | Plate1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | Plate1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | Plate1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | Plate1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
753055 | U2OS-03_4_O23_2 | U2OS-03_4_O23 | U2OS | train | U2OS-03 | Plate4 | O23 | 2 | treatment | s21454 | 509 | images/train/U2OS-03/Plate4/O23_s2_w2.png |
753056 | U2OS-03_4_O23_2 | U2OS-03_4_O23 | U2OS | train | U2OS-03 | Plate4 | O23 | 2 | treatment | s21454 | 509 | images/train/U2OS-03/Plate4/O23_s2_w3.png |
753057 | U2OS-03_4_O23_2 | U2OS-03_4_O23 | U2OS | train | U2OS-03 | Plate4 | O23 | 2 | treatment | s21454 | 509 | images/train/U2OS-03/Plate4/O23_s2_w4.png |
753058 | U2OS-03_4_O23_2 | U2OS-03_4_O23 | U2OS | train | U2OS-03 | Plate4 | O23 | 2 | treatment | s21454 | 509 | images/train/U2OS-03/Plate4/O23_s2_w5.png |
753059 | U2OS-03_4_O23_2 | U2OS-03_4_O23 | U2OS | train | U2OS-03 | Plate4 | O23 | 2 | treatment | s21454 | 509 | images/train/U2OS-03/Plate4/O23_s2_w6.png |
753060 rows × 12 columns
Validate and register metadata¶
rxrx_curator = ln.Curator.from_df(
meta_with_path,
categoricals={
"cell_line": bt.CellLine.name,
"split": ln.ULabel.name,
"experiment": wl.Experiment.name,
"plate": ln.ULabel.name,
"well": wl.Well.name,
"well_type": ln.ULabel.name,
"sirna": wl.GeneticPerturbation.name,
},
)
Show code cell output
✓ added 7 records with Feature.name for "columns": 'cell_line', 'split', 'experiment', 'plate', 'well', 'well_type', 'sirna'
rxrx_curator.validate()
Show code cell output
• saving validated records of 'cell_line'
✓ added 4 records from public with CellLine.name for "cell_line": 'U-2 OS cell', 'Hep G2 cell', 'hTERT RPE-1 cell', 'HUV-EC-C cell'
• mapping "cell_line" on CellLine.name
! 4 terms are not validated: 'HEPG2', 'HUVEC', 'RPE', 'U2OS'
4 synonyms found: "HEPG2" → "Hep G2 cell", "HUVEC" → "HUV-EC-C cell", "RPE" → "hTERT RPE-1 cell", "U2OS" → "U-2 OS cell"
→ curate synonyms via .standardize("cell_line")
• mapping "split" on ULabel.name
! 2 terms are not validated: 'test', 'train'
→ fix typos, remove non-existent values, or save terms via .add_new_from("split")
• mapping "experiment" on Experiment.name
! 51 terms are not validated: 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', 'HUVEC-23', 'HUVEC-24', 'RPE-08', 'RPE-09', 'RPE-10', 'RPE-11', 'U2OS-04', 'U2OS-05', 'HEPG2-01', 'HEPG2-02', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from("experiment")
• mapping "plate" on ULabel.name
! 4 terms are not validated: 'Plate1', 'Plate2', 'Plate3', 'Plate4'
→ fix typos, remove non-existent values, or save terms via .add_new_from("plate")
• mapping "well" on Well.name
! 308 terms are not validated: 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from("well")
• mapping "well_type" on ULabel.name
! 3 terms are not validated: 'negative_control', 'treatment', 'positive_control'
→ fix typos, remove non-existent values, or save terms via .add_new_from("well_type")
• mapping "sirna" on GeneticPerturbation.name
! 1139 terms are not validated: 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', 's19911', 's21543', 's195030', 's20290', 's20345', 's20305', 's20110', 's21048', 's20519', 's21045', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from("sirna")
False
rxrx_curator.standardize("cell_line")
rxrx_curator.add_new_from("split")
rxrx_curator.add_new_from("experiment")
rxrx_curator.add_new_from("plate")
rxrx_curator.add_new_from("well_type")
# well requires row and column information so we'll create records manually
# sirna requires system information so we'll create records manually
Show code cell output
✓ standardized 4 synonyms in "cell_line": "HEPG2" → "Hep G2 cell", "HUVEC" → "HUV-EC-C cell", "RPE" → "hTERT RPE-1 cell", "U2OS" → "U-2 OS cell"
✓ added 2 records with ULabel.name for "split": 'train', 'test'
✓ added 51 records with Experiment.name for "experiment": 'HUVEC-23', 'HUVEC-07', 'HUVEC-10', 'U2OS-01', 'HEPG2-03', 'HEPG2-09', 'HEPG2-01', 'RPE-01', 'RPE-05', 'HEPG2-04', 'HEPG2-10', 'U2OS-05', 'HUVEC-16', 'HEPG2-07', 'HUVEC-24', 'HUVEC-13', 'HUVEC-09', 'RPE-08', 'HUVEC-06', 'RPE-02', ...
✓ added 4 records with ULabel.name for "plate": 'Plate3', 'Plate2', 'Plate4', 'Plate1'
✓ added 3 records with ULabel.name for "well_type": 'treatment', 'positive_control', 'negative_control'
well
¶
We also want to add the well information to link image files and parse images based on well coordinates. We first extract well locations from the table:
# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
wells = [
wl.Well(name=well, row=well[0], column=int(well[1:]))
for well in meta["well"].unique()
]
ln.save(wells)
ln.settings.creation.search_names = True
sirna
¶
Add sirna
to GeneticPerturbation
table:
# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
sirnas = [
wl.GeneticPerturbation(
name=sirna,
system="siRNA",
)
for sirna in meta["sirna"].unique()
]
ln.save(sirnas)
ln.settings.creation.search_names = True
cell_line
¶
Add commonly used abbreviations:
bt.CellLine.get("30n7ByjL").set_abbr("HUVEC")
bt.CellLine.get("6EK4GXdy").set_abbr("U2OS")
bt.CellLine.get("og6IaxOV").set_abbr("RPE")
bt.CellLine.get("4ea731nb").set_abbr("HEPG2")
Register metadata file¶
meta_af = rxrx_curator.save_artifact(
key="rxrx1/metadata.parquet",
description="Metadata with file paths for each RxRx1 image.",
)
# Add a `readout` label using The `Experimental Factor Ontology`:
readout_feat = ln.Feature(name="readout", dtype="cat").save()
readout = bt.ExperimentalFactor.from_source(name="high content screen").save()
meta_af.labels.add(readout, readout_feat)
Show code cell output
✓ "cell_line" is validated against CellLine.name
✓ "split" is validated against ULabel.name
✓ "experiment" is validated against Experiment.name
✓ "plate" is validated against ULabel.name
✓ "well" is validated against Well.name
✓ "well_type" is validated against ULabel.name
✓ "sirna" is validated against GeneticPerturbation.name
! 5 unique terms (41.70%) are not validated for name: 'site_id', 'well_id', 'site', 'sirna_id', 'path'
! did not create Feature records for 5 non-validated names: 'path', 'sirna_id', 'site', 'site_id', 'well_id'
✓ created 1 ExperimentalFactor record from Bionty matching name: 'high content screen'
✓ created 1 ExperimentalFactor record from Bionty matching ontology_id: 'EFO:0005397'
meta_af.describe()
Show code cell output
Artifact .parquet/DataFrame ├── General │ ├── .uid = 'ObUn5B5Ytjp8Ucx90000' │ ├── .key = 'rxrx1/metadata.parquet' │ ├── .size = 5720995 │ ├── .hash = '8KaHcVpmukO1EhY2qL7IGA' │ ├── .path = │ │ /home/runner/work/lamin-spatial/lamin-spatial/docs/notebooks/test-rxrx/.lamindb/ObUn5B5Ytjp8Ucx90000.parque │ │ t │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2024-12-20 15:37:09 │ └── .transform = 'Ingest the RxRx1 dataset' ├── Dataset features/.feature_sets │ └── columns • 7 [Feature] │ cell_line cat[bionty.CellLine] HUV-EC-C cell, Hep G2 cell, U-2 OS cell,… │ experiment cat[wetlab.Experiment] HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, … │ plate cat[ULabel] Plate1, Plate2, Plate3, Plate4 │ sirna cat[wetlab.GeneticPertur… EMPTY, n337250, s1174, s12279, s134, s13… │ split cat[ULabel] test, train │ well cat[wetlab.Well] B02, B03, B04, B05, B06, B07, B08, B09, … │ well_type cat[ULabel] negative_control, positive_control, trea… ├── Linked features │ └── readout cat[bionty.ExperimentalF… high content screen └── Labels └── .experiments wetlab.Experiment HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, … .wells wetlab.Well B02, B03, B04, B05, B06, B07, B08, B09, … .genetic_perturbations wetlab.GeneticPerturbati… EMPTY, n337250, s1174, s12279, s134, s13… .cell_lines bionty.CellLine U-2 OS cell, Hep G2 cell, hTERT RPE-1 ce… .experimental_factors bionty.ExperimentalFactor high content screen .ulabels ULabel train, test, Plate3, Plate2, Plate4, Pla…
Register images¶
ln.UPath("gs://rxrx1-europe-west4/images").view_tree(level=2)
Show code cell output
53 sub-directories & 0 files
gs://rxrx1-europe-west4/images
├── test/
│ ├── HEPG2-08/
│ ├── HEPG2-09/
│ ├── HEPG2-10/
│ ├── HEPG2-11/
│ ├── HUVEC-17/
│ ├── HUVEC-18/
│ ├── HUVEC-19/
│ ├── HUVEC-20/
│ ├── HUVEC-21/
│ ├── HUVEC-22/
│ ├── HUVEC-23/
│ ├── HUVEC-24/
│ ├── RPE-08/
│ ├── RPE-09/
│ ├── RPE-10/
│ ├── RPE-11/
│ ├── U2OS-04/
│ └── U2OS-05/
└── train/
├── HEPG2-01/
├── HEPG2-02/
├── HEPG2-03/
├── HEPG2-04/
├── HEPG2-05/
├── HEPG2-06/
├── HEPG2-07/
├── HUVEC-01/
├── HUVEC-02/
├── HUVEC-03/
├── HUVEC-04/
├── HUVEC-05/
├── HUVEC-06/
├── HUVEC-07/
├── HUVEC-08/
├── HUVEC-09/
├── HUVEC-10/
├── HUVEC-11/
├── HUVEC-12/
├── HUVEC-13/
├── HUVEC-14/
├── HUVEC-15/
├── HUVEC-16/
├── RPE-01/
├── RPE-02/
├── RPE-03/
├── RPE-04/
├── RPE-05/
├── RPE-06/
├── RPE-07/
├── U2OS-01/
├── U2OS-02/
└── U2OS-03/
Take a subset to run on CI:
images = ln.Artifact(
"gs://rxrx1-europe-west4/images/test/HEPG2-08", description="RxRx1 image files"
)
images.n_objects
Show code cell output
_request non-retriable exception: Anonymous caller does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/retry.py", line 130, in retry_request
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
validate_response(status, contents, path, args)
File "/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/gcsfs/retry.py", line 117, in validate_response
raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.create access to the Google Cloud Storage object. Permission 'storage.objects.create' denied on resource (or it may not exist)., 401
→ due to lack of write access, LaminDB won't manage storage location: gs://rxrx1-europe-west4/
• path in storage 'gs://rxrx1-europe-west4' with key 'images/test/HEPG2-08'
14772
images.save()
Show code cell output
Artifact(uid='SGmmlV2hEJYeni9w0000', is_latest=True, key='images/test/HEPG2-08', description='RxRx1 image files', suffix='', size=994441606, hash='6r5Hkce0UTy7X6gLeaqzBA', n_objects=14772, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=3, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:35 UTC)
collection = ln.Collection(
images, name="Annotated RxRx1 images", meta_artifact=meta_af, version="1"
)
collection.save()
Show code cell output
Collection(uid='rOzX4PVMzUcTac0Q0000', version='1', is_latest=True, name='Annotated RxRx1 images', hash='dycM8ypgnRRF9zXLSeD_sA', meta_artifact=Artifact(uid='ObUn5B5Ytjp8Ucx90000', is_latest=True, key='rxrx1/metadata.parquet', description='Metadata with file paths for each RxRx1 image.', suffix='.parquet', type='dataset', size=5720995, hash='8KaHcVpmukO1EhY2qL7IGA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:09 UTC), visibility=1, created_by_id=1, transform_id=1, run_id=1, created_at=2024-12-20 15:37:35 UTC)
collection.meta_artifact
Show code cell output
Artifact(uid='ObUn5B5Ytjp8Ucx90000', is_latest=True, key='rxrx1/metadata.parquet', description='Metadata with file paths for each RxRx1 image.', suffix='.parquet', type='dataset', size=5720995, hash='8KaHcVpmukO1EhY2qL7IGA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:09 UTC)
collection.data_artifact
Show code cell output
Artifact(uid='SGmmlV2hEJYeni9w0000', is_latest=True, key='images/test/HEPG2-08', description='RxRx1 image files', suffix='', size=994441606, hash='6r5Hkce0UTy7X6gLeaqzBA', n_objects=14772, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=3, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:37:35 UTC)
collection.describe()
Show code cell output
Collection └── General ├── .uid = 'rOzX4PVMzUcTac0Q0000' ├── .hash = 'dycM8ypgnRRF9zXLSeD_sA' ├── .version = '1' ├── .created_by = testuser1 (Test User1) ├── .created_at = 2024-12-20 15:37:35 └── .transform = 'Ingest the RxRx1 dataset'