Text-Mining Differences in Chinese Newspaper Articles
1.0

Visit source

Download RO-Crate

Workflow Type: Galaxy

V 16 - including updated diff-tool

Associated Tutorial

This workflows is part of the tutorial Text-Mining Differences in Chinese Newspaper Articles, available in the GTN

Features

Includes a Galaxy Workflow Report
Uses Galaxy Workflow Comments

Thanks to...

Workflow Author(s): Daniela Schneider

Tutorial Author(s): Daniela Schneider

Tutorial Contributor(s): Björn Grüning, Daniela Schneider, Saskia Hiltemann, Teresa Müller

Funder(s): German Competence Center Cloud Technologies for Data Management and Processing (de.KCD)

SEEK ID: https://workflowhub.eu/workflows/1623?version=1

Inputs

ID	Name	Description	Type
Input censored text	#main/Input censored text	Upload the censored text containing replacement characters like ‘×’.	File
Input uncensored text	#main/Input uncensored text	Upload the uncensored text without replacement characters.	File

Steps

ID	Name	Description
2	Preprocessing of censored text	This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0
3	Preprocessing of uncensored text	This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0
4	Comparison with diff - user version	The diff tool compares the two cleaned texts. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts. toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1
5	Comparison with diff - computer version	The diff tool compares the two cleaned texts. This version of the output (raw output) is used for the further steps of the analysis. It is less intuitive for users. Therefore, the second diff below includes a more visual version of the output (HTML). toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1
6	Extracting only censored passages	This step selects all lines from the diff file that contain the censorship symbol ×. The condition "ord(c1) == 215" means that lines in column c1, which contain the censored text, are selected if they match ×. The symbol × is unspecific, therefore, the Unicode identifier of the character (215) is used for clarity in this condition. This step does not show an output. If the filtered symbol is empty in the second text, this file lacks columns to compute the following steps. This is invisible for users but would cause a technical error. The compute step covers this and makes sure all necessary columns exist. It shows the output for both steps (Extracting and Compute) correctly. Add another Unicode here if you want to select a different character, for example, '□' or '△'. You can get the respective code, for example, on this website: https://www.mauvecloud.net/charsets/CharCodeFinder.html Copy the character you want to filter in the "input" window and select "Decimal Character Codes" as an output. If you do this for symbol ×, you get 215. Filter1
7	Compute	This step unifies the formatting and adds potentially missing columns, should lines extracted before coming up empty in the second text. This ensures the proper number of columns and allows the smooth running of the next steps. toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1
8	Cut	This step selects only column 9, which contains the uncensored characters from text two. The result is only one column with different rows of Chinese characters. This step allows scaling words by frequency the word cloud in the next step. meaning characters that appear more often appear bigger, making the results evident at first sight. Cut1
9	Datamash	This step sums up how often which character appeared in the table before. toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0
10	Generate a word cloud	This step shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text. toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy1
11	Sort	Sorts the quantified results from those appearing most to those appearing least. sort1

Outputs

ID	Name	Description	Type
output_csv	#main/output_csv	n/a	File
output_graphic	#main/output_graphic	n/a	File

Version History

1.0 (earliest) Created 2nd Jun 2025 at 11:01 by GTN Bot

Added/updated 4 files

Open master 3f39b6d

help

Creators and Submitter

Creators

Not specified

Submitter

GTN Bot

Discussion Channel

GTN Matrix

License

Creative Commons Attribution 4.0 International

Activity

Views: 226 Downloads: 29 Runs: 0

Created: 2nd Jun 2025 at 11:01

help

Tags

help

Attributions

None

Total size: 104 KB

(v.1.17.0-main)

About WorkflowHub | Acknowledgements | Credits | Terms & Conditions | Privacy Policy | Cite us | Contact us

Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH