Package to scrap tweets

These details have not been verified by PyPI

Project description

stweet

Python package

A modern fast python library to scrap tweets and users quickly from Twitter unofficial API.

This tool helps you to scrap tweets by a search phrase, tweets by ids and user by usernames. It uses the Twitter API, the same API is used on a website.

Inspiration for the creation of the library

I have used twint to scrap tweets, but it has many errors, and it doesn't work properly. The code was not simple to understand. All tasks have one config, and the user has to know the exact parameter. The last important thing is the fact that Api can change — Twitter is the API owner and changes depend on it. It is annoying when something does not work and users must report bugs as issues.

Main advantages of the library

Simple code — the code is not only mine, every user can contribute to the library
Domain objects and interfaces — the main part of functionalities can be replaced (eg. calling web requests), the library has basic simple solution — if you want to expand it, you can do it without any problems and forks
100% coverage with integration tests — this advantage can find the API changes, tests are carried out every week and when the task fails, we can find the source of change easily – not in version 2.0
Custom tweets and users output — it is a part of the interface, if you want to save tweets and users custom format, it takes you a brief moment

Installation

pip install -U stweet

Donate

If you want to sponsor me, in thanks for the project, please send me some crypto 😁:

Coin	Wallet address
Bitcoin	3EajE9DbLvEmBHLRzjDfG86LyZB4jzsZyg
Etherum	0xE43d8C2c7a9af286bc2fc0568e2812151AF9b1FD

Basic usage

To make a simple request the scrap task must be prepared. The task should be processed by ** runner**.

import stweet as st


def try_search():
    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users]).run()


def try_user_scrap():
    user_task = st.GetUsersTask(['iga_swiatek'])
    output_json = st.JsonLineFileRawOutput('output_raw_user.jl')
    output_print = st.PrintRawOutput()
    st.GetUsersRunner(get_user_task=user_task, raw_data_outputs=[output_print, output_json]).run()


def try_tweet_by_id_scrap():
    id_task = st.TweetsByIdTask('1447348840164564994')
    output_json = st.JsonLineFileRawOutput('output_raw_id.jl')
    output_print = st.PrintRawOutput()
    st.TweetsByIdRunner(tweets_by_id_task=id_task,
                        raw_data_outputs=[output_print, output_json]).run()


if __name__ == '__main__':
    try_search()
    try_user_scrap()
    try_tweet_by_id_scrap()

Example above shows that it is few lines of code required to scrap tweets.

Export format

Stweet uses api from website so there is no documentation about receiving response. Response is saving as raw so final user must parse it on his own. Maybe parser will be added in feature.

Scrapped data can be exported in different ways by using RawDataOutput abstract class. List of these outputs can be passed in every runner – yes it is possible to export in two different ways.

Currently, stweet have implemented:

CollectorRawOutput – can save data in memory and return as list of objects
JsonLineFileRawOutput – can export data as json lines
PrintEveryNRawOutput – prints every N-th item
PrintFirstInBatchRawOutput – prints first item in batch
PrintRawOutput – prints all items (not recommended in large scrapping)

Using tor proxy

Library is integrated with tor-python-easy. It allows using tor proxy with exposed control port – to change ip when it is needed.

If you want to use tor proxy client you need to prepare custom web client and use it in runner.

You need to run tor proxy -- you can run it on your local OS, or you can use this docker-compose.

Code snippet below show how to use proxy:

import stweet as st

if __name__ == '__main__':
    web_client = st.DefaultTwitterWebClientProvider.get_web_client_preconfigured_for_tor_proxy(
        socks_proxy_url='socks5://localhost:9050',
        control_host='localhost',
        control_port=9051,
        control_password='test1234'
    )

    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users],
                         web_client=web_client).run()

Divide scrap periods recommended

Twitter on guest client block multiple pagination. Sometimes in one query there is possible to call for 3 paginations. To avoid this limitation divide scrapping period for smaller parts.

Twitter in 2023 block in API putting time range in timestamp – only format YYYY-MM-DD is acceptable. In arrow you can only put time without hours.

Twint inspiration

Small part of library uses code from twint. Twint was also main inspiration to create stweet.

Algorithm	Hash digest
SHA256	`f34852b5b6bf8e48e61a9b73311738ec1e016aa4372c8953ef25ccd9139d4ab6`
MD5	`dddd43c2d874c27370bcd70dc25544ad`
BLAKE2b-256	`1e713759442609777acb231e0d423595beb122f549110c50d89757a3c7186cac`

Algorithm	Hash digest
SHA256	`b6f07ba2a3172d659449433704543b3c855887c22851735ed5bbed1c588bff9f`
MD5	`242a7804e684d75f44d191ff7a783a8b`
BLAKE2b-256	`e0e3f4c5b6bdd85b80f8d790cbdb989e80c9c01a962c1937ba661987f267820e`

stweet 2.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers