Skip to main content

SeedSpark is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments

Project description

SeedSpark

SeedSpark is an open-source is an Extensible PySpark utility package to create production spark pipelines and dev-test them in dev environments or to perform end to end tests. The goal is to enable rapid development of Spark pipelines via PySpark on Spark clusters and locally test the pipeline by using various utilities.

TODO

  1. Move logwrap [on top of loguru] extension out as a seperate package.
  2. Add Test containers for amundsen, etc..

Getting Started

  1. Setup SDKMAN
  2. Setup Java
  3. Setup Apache Spark
  4. Install Poetry
  5. Install Pre-commit and follow instruction in here
  6. Run tests locally

Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based system. It provides a convenient command line interface for installing, switching, removing and listing Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See documentation on the SDKMAN! website.

Open your favourite terminal and enter the following:

$ curl -s https://get.sdkman.io | bash
If the environment needs tweaking for SDKMAN to be installed,
the installer will prompt you accordingly and ask you to restart.

Next, open a new terminal or enter:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

Lastly, run the following code snippet to ensure that installation succeeded:

$ sdk version

Setup Java

Install Java Now open favourite terminal and enter the following:

List the AdoptOpenJDK OpenJDK versions
$ sdk list java

To install For Java 11
$ sdk install java 11.0.10.hs-adpt

To install For Java 11
$ sdk install java 8.0.292.hs-adpt

Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

List the Apache Spark versions:
$ sdk list spark

To install For Spark 3
$ sdk install spark 3.0.2

To install For Spark 3.1
$ sdk install spark 3.0.2

Poetry

Poetry Commands

poetry install

poetry update

# --tree: List the dependencies as a tree.
# --latest (-l): Show the latest version.
# --outdated (-o): Show the latest version but only for packages that are outdated.
poetry show -o

Running Tests Locally

Take a look at tests in tests/dataquality and tests/jobs

$ poetry run pytest
Ran 95 tests in 96.95s

NOTE: It's just curated stuff in this repo for personal usage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seedspark-0.4.3.tar.gz (18.7 kB view hashes)

Uploaded Source

Built Distribution

seedspark-0.4.3-py3-none-any.whl (21.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page