Text-Mining Differences in Chinese Newspaper Articles
1.0

Workflow Type: Galaxy

V 16 - including updated diff-tool

Associated Tutorial

This workflows is part of the tutorial Text-Mining Differences in Chinese Newspaper Articles, available in the GTN

Features

Thanks to...

Workflow Author(s): Daniela Schneider

Tutorial Author(s): Daniela Schneider

Tutorial Contributor(s): Björn Grüning, Daniela Schneider, Saskia Hiltemann, Teresa Müller

Funder(s): German Competence Center Cloud Technologies for Data Management and Processing (de.KCD)

gtn star logo followed by the word workflows

Inputs

ID Name Description Type
Input censored text #main/Input censored text Upload the censored text containing replacement characters like ‘×’.
  • File
Input uncensored text #main/Input uncensored text Upload the uncensored text without replacement characters.
  • File

Steps

ID Name Description
2 Preprocessing of censored text This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0
3 Preprocessing of uncensored text This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0
4 Comparison with diff - user version The diff tool compares the two cleaned texts. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts. toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1
5 Comparison with diff - computer version The diff tool compares the two cleaned texts. This version of the output (raw output) is used for the further steps of the analysis. It is less intuitive for users. Therefore, the second diff below includes a more visual version of the output (HTML). toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1
6 Extracting only censored passages This step selects all lines from the diff file that contain the censorship symbol ×. The condition "ord(c1) == 215" means that lines in column c1, which contain the censored text, are selected if they match ×. The symbol × is unspecific, therefore, the Unicode identifier of the character (215) is used for clarity in this condition. This step does not show an output. If the filtered symbol is empty in the second text, this file lacks columns to compute the following steps. This is invisible for users but would cause a technical error. The compute step covers this and makes sure all necessary columns exist. It shows the output for both steps (Extracting and Compute) correctly. Add another Unicode here if you want to select a different character, for example, '□' or '△'. You can get the respective code, for example, on this website: https://www.mauvecloud.net/charsets/CharCodeFinder.html Copy the character you want to filter in the "input" window and select "Decimal Character Codes" as an output. If you do this for symbol ×, you get 215. Filter1
7 Compute This step unifies the formatting and adds potentially missing columns, should lines extracted before coming up empty in the second text. This ensures the proper number of columns and allows the smooth running of the next steps. toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1
8 Cut This step selects only column 9, which contains the uncensored characters from text two. The result is only one column with different rows of Chinese characters. This step allows scaling words by frequency the word cloud in the next step. meaning characters that appear more often appear bigger, making the results evident at first sight. Cut1
9 Datamash This step sums up how often which character appeared in the table before. toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0
10 Generate a word cloud This step shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text. toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy1
11 Sort Sorts the quantified results from those appearing most to those appearing least. sort1

Outputs

ID Name Description Type
output_csv #main/output_csv n/a
  • File
output_graphic #main/output_graphic n/a
  • File

Version History

1.0 (earliest) Created 2nd Jun 2025 at 11:01 by GTN Bot

Added/updated 4 files


Open master 3f39b6d
help Creators and Submitter
Creators
Not specified
Submitter
Discussion Channel
Activity

Views: 40   Downloads: 4   Runs: 0

Created: 2nd Jun 2025 at 11:01

help Attributions

None

Total size: 104 KB
Powered by
(v.1.17.0-main)
Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH