TFDS - custom dataset(images, xmls)

TFDS - custom dataset(images, xmls)

2022. 10. 17. 20:31ㆍPython/- Tensorflow

1. Tensorflow_datasets(in detail)

2. Background and goal

You can use data directly from images, xmls in local directory. After download the data.

But i want to use dataset voc/2007, voc/2012 from tfds using tfds.load(), and i want custom_dataset to use same pipeline.

3. Prepare the dataset

You can prepare these example, i will use sceond type

data data data

└ train └ train_images └ train

└ images └ train_xmls └ train.csv

└ xmls └ test_images └ test

└ test └ test_images └ test.csv

└ images

└ xmls

If dataset is prepared, zip you dataset(data.zip)

4. Generate custom dataset

4-1. Install tfds module and generate custom_dataset project

4-2. Modify custom_dataset.py

4-3. Build custom_dataset.py

4-1. Install tfds moduel and generate custom_dataset project

1) (env) C:\...> pip install tensorflow_datasets

2) (env) C:\...> cd workspace

3) (env) C:\...> tfds new custom_dataset

now we modify only custom_dataset.py to generate custom_dataset

4-2. Modify custom_dataset.py

We can get data from both web and local. I will introduce get data from local.

custom_dataset.py have three method.

_info: information of dataset features

_split_generator: read data.zip and split data ("train", "val", "test")

_generate_examples: define features example

default code

class CustomDataset(tfds.core.GeneratorBasedBuilder):
  """DatasetBuilder for custom_dataset dataset."""

  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {
      '1.0.0': 'Initial release.',
  }

  def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # TODO(custom_dataset): Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            # These are the features of your dataset like images, labels ...
            'image': tfds.features.Image(shape=(None, None, 3)),
            'label': tfds.features.ClassLabel(names=['no', 'yes']),
        }),
        # If there's a common (input, target) tuple from the
        # features, specify them here. They'll be used if
        # `as_supervised=True` in `builder.as_dataset`.
        supervised_keys=('image', 'label'),  # Set to `None` to disable
        homepage='https://dataset-homepage/',
        citation=_CITATION,
    )

  def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Returns SplitGenerators."""
    # TODO(custom_dataset): Downloads the data and defines the splits
    path = dl_manager.download_and_extract('https://todo-data-url')

    # TODO(custom_dataset): Returns the Dict[split names, Iterator[Key, Example]]
    return {
        'train': self._generate_examples(path / 'train_imgs'),
    }

  def _generate_examples(self, path):
    """Yields examples."""
    # TODO(custom_dataset): Yields (key, example) tuples from the dataset
    for f in path.glob('*.jpeg'):
      yield 'key', {
          'image': f,
          'label': 'yes',
      }

modifed_code

"""custom_dataset dataset."""

import tensorflow_datasets as tfds
from tensorflow_datasets.core.features import BBoxFeature
import xmltodict
from PIL import Image
import numpy as np
import tensorflow as tf

# TODO(custom_dataset): Markdown description  that will appear on the catalog page.
_DESCRIPTION = """
Description is **formatted** as markdown.

It should also contain any processing which has been applied (if any),
(e.g. corrupted example skipped, images cropped,...):
"""

# TODO(custom_dataset): BibTeX citation
_CITATION = """
"""


class CustomDataset(tfds.core.GeneratorBasedBuilder):
  MANUAL_DOWNLOAD_INSTRUCTIONS = """
   data.zip files should be located at /root/tensorflow_dataset/downloads/manual
   """ # modifyed

  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {
  }

  def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # TODO(custom_dataset): Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            # These are the features of your dataset like images, labels ...
            'image': tfds.features.Image(shape=(None, None, 3)), # modifyed
            'objects': tfds.features.Sequence({
                                  'bbox': tfds.features.BBoxFeature(),
                                  'label': tfds.features.ClassLabel(names=['Choi Woo-shik',
                                                                           'Kim Da-mi',
                                                                           'Kim Seong-cheol',
                                                                           'Kim Tae-ri',
                                                                           'Nam Joo-hyuk',
                                                                           'Yoo Jae-suk']), # modifyed
            })
        }),
        # If there's a common (input, target) tuple from the
        # features, specify them here. They'll be used if
        # `as_supervised=True` in `builder.as_dataset`.
        supervised_keys=('image', 'objects'),  # Set to `None` to disable # modifyed
        homepage='https://dataset-homepage/',
        citation=_CITATION,
    )

  def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Returns SplitGenerators."""
    # TODO(custom_dataset): Downloads the data and defines the splits
    archive_path = dl_manager.manual_dir / 'data.zip' # modifyed
    extracted_path = dl_manager.extract(archive_path) # modifyed

    # TODO(custom_dataset): Returns the Dict[split names, Iterator[Key, Example]]
    return {
        'train': self._generate_examples(img_path=extracted_path / 'train_images', 
                                         xml_path=extracted_path / 'train_xmls'), # modifyed
        'test': self._generate_examples(img_path=extracted_path / 'test_images',
                                        xml_path=extracted_path / 'test_xmls'), # modifyed
    }

  def _generate_examples(self, img_path, xml_path):
    """Yields examples."""
    # TODO(custom_dataset): Yields (key, example) tuples from the dataset
    for i, (img, xml) in enumerate(zip(img_path.glob('*.jpg'), xml_path.glob('*.xml'))):
      yield i,{
        'image': img, # modifyed
        'objects': self._get_objects(xml) # modifyed
      }

  def _get_objects(self, xml): # custom method
    data=dict()
    f=open(xml)
    xml_file=xmltodict.parse(f.read())
    bbox=[]
    label=[]
    height, width = xml_file['annotation']['size']['height'], xml_file['annotation']['size']['width']
    for obj in xml_file['annotation']['object']:
      if type(obj)==type(dict()):
        label.append(obj['name'])
        x1=obj['bndbox']['xmin']
        y1=obj['bndbox']['ymin']
        x2=obj['bndbox']['xmax']
        y2=obj['bndbox']['ymax']
        y1, y2 = float(y1)/float(height), float(y2)/float(height)
        x1, x2 = float(x1)/float(width), float(x2)/float(width)
        bbox.append(tfds.features.BBox(ymin=y1, xmin=x1, ymax=y2, xmax=x2))
      else:
        if obj=='name':
          label.append(xml_file['annotation']['object'][obj])
        elif obj=='bndbox':
          x1 = xml_file['annotation']['object'][obj]['xmin']
          y1 = xml_file['annotation']['object'][obj]['ymin']
          x2 = xml_file['annotation']['object'][obj]['xmax']
          y2 = xml_file['annotation']['object'][obj]['ymax']
          y1, y2 = float(y1)/float(height), float(y2)/float(height)
          x1, x2 = float(x1)/float(width), float(x2)/float(width)
          bbox.append(tfds.features.BBox(ymin=y1, xmin=x1, ymax=y2, xmax=x2))
    f.close()
    data['bbox']=bbox
    data['label']=label
    return data

If you want to see more detail or example click here

1) Define MANUAL_DOWNLOAD_INSTRUCTIONS

2) Modify _info, You can define features

3) Modify _split_generators

you have two options for manual_dir (in detail Build step)

1. leave data in default directory

2. leave data in my directory

archive_path = dl_managet.manual_dir / 'data.zip' : join manual_dir, 'data.zip'

extracted_path = dl_managet.extract(archive_path) : path of extracted data.zip

you can set dataset use dict() -> "train", "test", "val"...

4) Modify _generate_examples

Also you can set features images, bbox, classes...

But tfds use independent id, so you should give id to tfds. you can use just enumerate.

4-3. Build custom_dataset.py

cd .../custom_dataset
tfds build                           # use default manual_dir
									 # = "/.../tensorflow_datasets/downloads/manual"
                                     # if you don't have manual, mkdir manual
                                     # or
tfds build --manual_dir "C:\...\..." # use custom manual_dir

You can find generated dataset in /.../tensorflow_datatsets/custom_datasets/... using both two ways

5. load custom_dataset

dataset, info = tfds.load("voc/2007", split="train", data_dir="~/tensorflow_datasets", with_info=True)
dataset, info = tfds.load("custom_dataset", split="train", data_dir=".../tensorflow_datasets", with_info=True)

Voc dataset can be loaded using data_dir "~/tensorflow_datasets", but custon_dataset is not worked.

So custom_dataset need to dataset dir. Also you can use tf.io.gfile.glob.

Now you can make custom_dataset

Reference

https://www.tensorflow.org/datasets/api_docs/python/tfds/all_symbols

'Python > - Tensorflow' 카테고리의 다른 글

TFrecord, from_generator (0)	2023.02.18

TFrecord, from_generator 2023.02.18

find-knowledge